Robots & Sitemaps: What Happens Behind the Scenes?

Written on 26 May, 2015 by Lazar Dusanovic
Categories Search Engine Optimisation

Robots & Sitemaps: What Happens Behind the Scenes?

When potential customers visit your website, many of them are not aware that at the very same time there could be other, non-human, browsers too.

Relax, we talking about search engine robots here!

When a search engine like Google wants to visit your site, they don’t ask some guy in the basement to do so, instead they send out tiny bits of code called robots. Robots look for one very important file before going anywhere else – that file is aptly named a robots.txt file

The robots file is literally the gate keeper to your site – it tells search engine spiders (another name for robots) what to do and where they are allowed to go.

Robot names

You may see variations in your logs and robots.txt file about other kind of robots visiting your site. For example: User-Agent: Googlebot would indicate that Google’s main search engine spider visited your site or your robots.txt file has special rules just for it. There are hundreds of robots out there and not all are actually connected to search engines. Some other sites that may want to read your websites information may have their own kind of bots. Moz, for example, has a bot named the rogerBot and some of their tools use a robot called dotbot as well.  Even Google has several – for example: Googlebot, Googlebot-News, and Googlebot-Mobile, to name a few.

Anatomy of a robots.txt file

User-Agent: *

This line says that the rules below are for any search engine. It can also name a specific search engine spider such as User-Agent: Googlebot-Mobile.

Allow: /

This says that the robots are allowed to crawl the site.

Disallow: /wp-admin and Disallow: /wp-includes

This line says that robots are not allowed to crawl the wp-admin folder (the typical administration folder for WordPress) as well as wp-includes folder which typically contains scripts, theme and plugin information and other things that robots don’t usually need to know about.

Sitemap:

This line specifies where your website’s sitemap is located (more of this below).

Some tricks and advanced

One the biggest things that this file is typically used for is to block a search engine from seeing, crawling or indexing a specific page -but what if you have dozens of pages that you need to block?

 If they are in a folder, simply block the folder:

Disallow: /wp-admin/

If they are not, then it gets a bit trickier. First, you need to look for commonalities in the URL.

Some more common URLS that appear on e-commerce sites contain parameters for sorting like products, such as this:

url?width=400

url?Size=xl

url?color=blue

The good news is that these can be easily blocked:

Disallow: *width=*

Disallow: *size=*

Note the asterisk (*) – that is a wildcard. In short, you are instructing to disallow any URLs that contain “size=”, no matter where it is in the URL.

If you are confident that you don’t need search engines to index or crawl any parameter URLS, you can even block anything with an equals sign (Disallow: *=*). Be careful, though! Many CMS (Content Management Systems) and e-commerce shopping carts use parameters for product ID numbers and pagination (while the latter should be blocked to avoid duplicate content issues, the former could drop your sites traffic too as dozens if not hundreds of pages are dropped from the index).

Sitemaps

Another important line in your robots file is the sitemap location:

Sitemap: http://www.mydomain.com/sitemap.xml

So what is the sitemap.xml file?

Where robots.txt tells search engines where they cannot go, the sitemap.xml tells them where your websites pages (as well as posts, products and images) are located. The sitemap is designated with an XML file name (XML is Extensible Markup Language – essentially a set of rules and definitions).

Generally for each section or type of content you could have a single sitemap, and in turn link them all together with a sitemap index file like this:

This is a screen shot of the Yoast XML Sitemap setting. Yoast is a popular SEO plugin for WordPress and once setup will generate a sitemap_index.xml files that contains all the segmented content of your site such as posts, pages and even images.

Don’t know how to make a sitemap file?

The good news is that many CMS’s will generate one for you. If you are not using a CMS or yours does not generate sitemap files, you can always visit a site like https://www.xml-sitemaps.com/ to have one created for you. Then all you need to do is upload it to your website’s server.

Important considerations when dealing with robots and sitemap files

Some tips:

  • If you block a file in robots, you should make sure it is not in your sitemap by manually removing it if you need to.
  • Make sure you do not disallow: / as this will block EVERYTHING on your site. Typically this is done before a website is launched.
  • If you are using a sitemap index, be sure that your robots.txt file points to its correct location. You don’t need to add a line for each sitemap you have in your robots file, as that is what your index is for.

Just because you have 100 pages in your sitemap does not guarantee that every page will be indexed by Google or other search engines. This can be for any number of reasons, from thin content to duplicated or very similar pages. Even the age of a particular page matters when it comes to indexing.

Robots and Sitemap errors

Google’s Webmaster Tools is a treasure trove of information for identifying issues with these files.

Firstly, Google lets you use their system to test your current robots.txt file and to make “temporary” changes so you can see the effects on the pages of your site.

Some errors include unnecessary coding such as Crawl-Delay (not supported by Google), or incorrectly spelling “disallow” (even though Google is smart enough and recognizes “disallow”, for example).

Sitemaps can have a lot of issues, which is typically a result of miscoding or blocking URLs in robots, as well as general crawling issues – e.g. when a website goes down or is re-launched without its original sitemap or robots file.

What if I make a mistake?

Mistakes happen and even to the most skilled of us. Sometimes a website is launched still blocking critical pages or even all of its pages and suddenly your rankings drop. Or perhaps the newest product in your store is not getting any traffic and you can’t figure out why – it could be a line in your robots file!

If you are at a loss and need help, you can always call on the friendly specialists at Webcentral– 1300 663 995.

Looking for some help with domains, hosting, web design or digital marketing?
 

Send me marketing tips, special offers and updates