The use of robots.txt files has been around for a number of years, however the potential risks associated with incorrect use still aren’t well known. Robot.txt files tell search engine bots about your website and its structure. Web developers and SEO Sydney specialists can easily access any robot.txt file for websites and understand what that website is doing.
There are a few things you should always look to include in your robots.txt file. You should include information for the web crawler on where to find your sitemaps, how often they can be crawled, as well as which pages not to crawl through. A good crawler bot (like the Googlebot) will always look for a robots.txt first and then follow any directions given.
There is, however, a risk with robots.txt files that you need to be aware of. Not all crawler bots are good and respect the instructions given in the file. Some particularly bad crawler bots will find your robots.txt file and look for the disallow list. You have then effectively given these bad bots a road map to places you don’t want them to look.
As well as this, your robots.txt file is open for your competitor to freely view. You could be giving information you wouldn’t want to be sharing straight to your competitor. While this may not be a major issue, it could still be a security risk and you should look for ways to minimise this.
So, have a look below at some of the best practices for you to reduce the risk of your robots.txt file:
Best Practice #1: Understanding The Purpose of a Robots.txt File
It is vital that you have a proper understanding of what your robots.txt is actually used for. Importantly, a robots exclusion standard does not in any way impact whether or not a URL is removed from a search engine’s index. It also will not halt the search engine from adding additional URLs to the index.
Search engines don’t really listen to robots.txt files when adding (or removing) URLs in their index. The action of crawling and then indexing URLs is completely separate to the robots.txt file and will not assist you in this area – so make sure you are not trying to use your robots.txt file for this purpose.
Best Practice #2: When to use Noindex
Sometimes you’ll have a few pages that you want to keep private but still want them to be publicly accessible. In this case you should be using “noindex” and not “disallow” in your robots.txt. This will mean that the crawler bot won’t index the URL it finds if you don’t want it to be indexed. For this type of content, it is okay to have it set up in your robots.txt.
If you have any pages that should be private and not accessed by the public then the best course of action would be to password protect the page, or even whitelist the IP. If you aren’t sure about your private pages then you should talk to your web designer/ developer who can give you better guidance for your individual situation.
Best Practice #3: Don’t Disallow Specific Pages, Just Directories
If you list specific pages, you are making it that much easier for bad crawlers, or your competitors, to find the places you don’t want them to look. That being said, listing a directories doesn’t make it impossible for those bad bots to find what you were trying to “hide” but it does make it a lot more difficult.
Best Practice #4: Use Caution When Using Disallow and Noindex During the Same Time
This is typically quite rare, but it is something that you should be aware of. If this issue does arise on your page, Google is known to display the following message in the description for your website in their index “No information is available for this page”. Obviously, this is not providing the user with a positive experience, so if you find this issue on your site it should be resolves as soon as possible.
Best Practice #5: Honeypots for IP Blacklisting
This is something you should consider if you are thinking about taking your security to another level. As the name suggests the honeypot will attract the bad bots (like bees to honey) and you should look to make the honeypot appealing by listing the type of things they would be looking for.
You then need to also set up an IP address disallow resource. Once you have trapped them you can have their IP blacklisted from gaining access to any section of your site. This is quite extreme, but if you are particularly concerned about the safety of your data, this is certainly something you should consider.