We all want to prevent our websites from being ripped. If you have a unique design the desire to prevent it from being duplicated is more. So we need to have an effective way of preventing rippers from getting an exact replica of our website. There are a plethora of website ripping applications like WinHTTrack or Webreaper which have been surfacing now a days and the rippers use these applications to create website replicas. They can use the replicated dump to create an exact clone of your website.
The image below shows how the usage of an effective robots.txt file in our website root prevented WinHTTrack to copy our content and data:
We can use the robots.txt file to prevent the website ripper bots to crawl our website and subsequently the program will not be able to copy anything. We know that bots access to a website can be controlled using robots.txt which sits in the website root directory. Whenever we use a ripping program the bot used by the program will have to pass the check in robots.txt. If it finds that there the rule in robots.txt that disallows the bot, it will no longer be able to crawl the site which is a win-win situation for us. In robots.txt we have either an allow or a disallow rule, which looks like this:
User-agent: <User Agent name>
User-agent: <User Agent name>
In the above examples the / indicates the root directory. It can be replaced with the directory path to which we wish to allow or deny the access. So when a bot goes through robots.txt and find itself under the Deny rule it will stop there itself and can no longer enter into the website.
This robots.txt method can be used to prevent email harvester, leechers or any such evil spirited applications from accessing your site. We have compiled a commonly used evil programs and created the deny rule for those programs. You can use the list below in your robots.txt file to create a preventive shield from the evil programs like WinHTTrack or Email Collector or Web Downloader. If you do not have a robots.txt file in your website root, create one with the following content or if you have one, you can just copy paste the content in your existing robots.txt file.
# This list is compiled by Techie Zone part of Qlogix Network. User-agent: Teleport Disallow: / User-agent: TeleportPro Disallow: / User-agent: EmailCollector Disallow: / User-agent: EmailSiphon Disallow: / User-agent: WebBandit Disallow: / User-agent: WebZIP Disallow: / User-agent: WebReaper Disallow: / User-agent: WebStripper Disallow: / User-agent: Web Downloader Disallow: / User-agent: WebCopier Disallow: / User-agent: Offline Explorer Pro Disallow: / User-agent: HTTrack Website Copier Disallow: / User-agent: Offline Commander Disallow: / User-agent: Leech Disallow: / User-agent: WebSnake Disallow: / User-agent: BlackWidow Disallow: / User-agent: HTTP Weazel Disallow: /
As this post is targeted for Webmasters and you are reading this post indicates that you are one of them, we would also suggest you to read the post on Increasing your website speed.