Categories
Tips & tricks Webmaster

Use robots.txt to prevent website from being ripped

We all want to prevent our websites from being ripped. If you have a unique design the desire to prevent it from being duplicated is more. So we need to have an effective way of preventing rippers from getting an exact replica of our website. There are a plethora of website ripping applications like WinHTTrack or Webreaper which have been surfacing now a days and the rippers use these applications to create website replicas. They can use the replicated dump to create an exact clone of your website.

email

We all want to prevent our websites from being ripped. If you have a unique design the desire to prevent it from being duplicated is more. So we need to have an effective way of preventing rippers from getting an exact replica of our website. There are a plethora of website ripping applications like WinHTTrack or Webreaper which have been surfacing now a days and the rippers use these applications to create website replicas. They can use the replicated dump to create an exact clone of your website.

These rippers use bots to crawl the target websites and create static html version of all pages, copy the images, style sheets, JavaScript etc (Can not copy PHP or ASP files) and save them on the hard drive maintaining the directory structure and hyperlinks. Once the complete website is downloaded to your PC, you can use the website offline even if you are not connected to the Internet. These downloaded data can also be used to make an exact replica of your website. For static html based website the cloning becomes much easier. None of the webmaster will ever wish to allow anyone to rip their website but most of them wonder how to prevent such an operations.

The image below shows how the usage of an effective robots.txt file in our website root prevented WinHTTrack to copy our content and data:

We can use the robots.txt file to prevent the website ripper bots to crawl our website and subsequently the program will not be able to copy anything. We know that bots access to a website can be controlled using robots.txt which sits in the website root directory. Whenever we use a ripping program the bot used by the program will have to pass the check in robots.txt. If it finds that there the rule in robots.txt that disallows the bot, it will no longer be able to crawl the site which is a win-win situation for us. In robots.txt we have either an allow or a disallow rule, which looks like this:

User-agent: <User Agent name>
Disallow: /

User-agent: <User Agent name>
Allow: /

In the above examples the / indicates the root directory. It can be replaced with the directory path to which we wish to allow or deny the access. So when a bot goes through robots.txt and find itself under the Deny rule it will stop there itself and can no longer enter into the website.

This robots.txt method can be used to prevent email harvester, leechers or any such evil spirited applications from accessing your site. We have compiled a commonly used evil programs and created the deny rule for those programs. You can use the list below in your robots.txt file to create a preventive shield from the evil programs like WinHTTrack or Email Collector or Web Downloader. If you do not have a robots.txt file in your website root, create one with the following content or if you have one, you can just copy paste the content in your existing robots.txt file.

# This list is compiled by Techie Zone part of Qlogix Network.
User-agent: Teleport
Disallow: /
User-agent: TeleportPro
Disallow: /
User-agent: EmailCollector
Disallow: /
User-agent: EmailSiphon
Disallow: /
User-agent: WebBandit
Disallow: /
User-agent: WebZIP
Disallow: /
User-agent: WebReaper
Disallow: /
User-agent: WebStripper
Disallow: /
User-agent: Web Downloader
Disallow: /
User-agent: WebCopier
Disallow: /
User-agent: Offline Explorer Pro
Disallow: /
User-agent: HTTrack Website Copier
Disallow: /
User-agent: Offline Commander
Disallow: /
User-agent: Leech
Disallow: /
User-agent: WebSnake
Disallow: /
User-agent: BlackWidow
Disallow: /
User-agent: HTTP Weazel
Disallow: /

As this post is targeted for Webmasters and you are reading this post indicates that you are one of them, we would also suggest you to read the post on Increasing your website speed.

email

By Ajay Meher

Ajay is the editor and webmaster of Techie Zone . He is also a Wordpress expert and provides Wordpress consultancy and web services. Recently he has launched his maiden start-up Qlogix Solutions. You can follow him in twitter @ajaykumarmeher

9 replies on “Use robots.txt to prevent website from being ripped”

To be honest, this not very effective.

Go to HTTrack, change the browser ID. Configure to disobey robots.txt and TADA. Website downloaded.

You can’t do anything about it. The web is open thanks to Tim Berners Lee. 😉

There’s an error in your reasoning:
“We know that bots access to a website can be controlled using robots.txt”
should be
“We know that well-behaved bots access to a website can be controlled using robots.txt”
The ‘control’ is in the programming of the bots, and nothing in a robots.txt file can stop the bot-programmer to just ignore it.

Thanks for posting this. I was googling for how to prevent httrack and didn’t find much on the subject. Just because httrack can be told to ignore robots.txt doesn’t mean we shouldn’t use it to at least stop people who are knew to ripping and don’t know. Every little extra step counts.

I also like an idea I just read about putting downloads folders outside the web root and using a separate asp or php page to retrieve and serve them. This will at least protect any download files you might have for paying customers. I’m betting there are a ton of wordpress and similar blog and cms sites out there who have the dl directory right inside the site directory and links straight to it.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.