Wikiwix crawling robot

The Wikiwix robot works in 3 different modes:
  • Crawling the Wikimedia projects: for all projects of the wikimedia foundation, our robot listens to the RC IRC bot irc.wikimedia.org and go crawling the pages as soon as they have been modified mr crda|ed. It keeps our search engine on Wikipedia and sister projects up to date.
  • Crawling web pages for our Twitter based search engine. This is complementary to the encyclopedic Wikipedia search, and gives you access to what is buzzing now on the internet.
  • Crawling entire web sites on demand: one can customize a search engine on its favorite web with Wikimarks.

How to control our robot

Prevent our robot to crawl your website

  • In the robots.txt file, located at the root of your website (http://yoursite.com/robots.txt), put the following lines:

    User-agent:wikiwix

    Disallow: /

  • Or for just a subtree or particular pages :

    User-agent:wikiwix

    Disallow: /subtree

    Disallow: /somewhere/particular-page.html

Prevent our robot to crawl some pages

We respect meta-tags for robots :
  • Putting a line < meta name="robots" content="noindex,nofollow"/ > in the page will prevent us from crawling the page nor from following links in it.
  • any combination of noindex or index, with nofollow or follow, will lead us to index/not to index, follow links/not to follow links in the page.