Noindex, Nofollow & Disallow: What Are They & How Do I Use Them?

For a site with hundreds of thousands, even millions of URLs; when visited by crawling search engine robots, it is imperative to better assimilate the notion of “crawl budgeting”. Samuel Graham, Online Content Specialist at Greater Bank confirms – “the idea with this practice is to guide the search engine robots during the exploration of your website to focus on important pages. Therefore, your rankings will improve for the relevant pages of your business”.

The means of communicating with search engine bots when it comes to the crawl are the Disallow (Allow), Noindex and Nofollow commands. These directives are used either in the header section of the HTML of a page or in the robots.txt file. Here’s what they mean:

Disallow tells robots which pages or directories not to explore;
Noindex tells search engines which pages not to index in the search results;
Nofollow asks the bots not to follow the links on the page, (do not give them credit).

The Disallow command is mainly used in the robots.txt file at the root of a site. The Noindex and the Nofollow are used in the tag of the HTML code or in the CMS.

Depending on your purpose, the use of these guidelines will have an impact on the crawl of your site. It can either be an ally or a monster that will destroy the visibility of your site. It all depends on how well you use it!

Directive in the meta tag

In the header section of your site, you can have these combinations:

We should, therefore, avoid sending conflicting messages to search engines by giving them instructions to follow on to pages that you have already decided to block and those they are prohibited from accessing.

To deepen your knowledge on this topic, you can read this summary provided by Google.

Directive in the robots.txt file

It’s important that you use the robots.txt file only if you understand what you are doing. This file is so important and powerful that mishandling it can have disastrous consequences for your site and subsequently, your business.

For sites with very few pages, (tens of thousands or less), it is practically useless to try to manipulate this file. On the other hand, the management of this file is important for sites with millions of URLs to index, such as e-commerce websites.

Disallow: How does it work?

With this directive, you can enable the exploration of a part of your site, (or a directory), or even URLs with certain parameters. This is often crucial when you have an e-commerce site with filters that need to be well managed. The result is simple – the robot will not be able to visit the pages of this directory.

Imagine that you sell shoes online. You’ll have dozens of filters that are important to the user, such as the size and colors. Each search will create a unique URL but does not represent any interest in indexing. By enabling these pages, you can save dozens, see hundreds of thousands of pages in the “crawl budget”.

Please, be aware that this also prevents the passage of any juice, (PageRank), which makes any page belonging to this category useless for link building.

In this instance, it is enough to add the directive Disallow: /the-page-that-does-not-need-to-be-indexed

Why not combine Disallow and Noindex?

This is a frequent question I get from my clients that I would like to address in this article so that it’s clear. You can experiment and use the Noindex to disable the indexing of pages by search engines. However, on an HTML page, the combination of Disallow and Noindex directives is not possible because the information contained in the disabled page for crawling cannot be discovered or read. The disallow command on the robot.txt page and the noindex in the header of the HTML tag can only be set up separately.

From there, when the robots discover a new page with the Disallow directive, the page is not explored. If the crawl is done from the already crawled and indexed URLs when the search engines rediscover this page with the Noindex tag, then they will deindex it. Consequently, this page will no longer exist in the index of search engines, and therefore, no exploration will be possible in the future.

In summary, Disallow + Noindex in the robots.txt file tends to answer the problem of preventing the crawl of the pages and deindexes them if they are present in the index of the search engines.

Join our Certificate Course in Digital Marketing with Placement Assurance.