Search engine not able to index blogspot pages

I recently did an upgrade to new version of blogger. It has more features and more fun, really. Amongst the list are no-publishing-act, more AJAX features, labelling of blogs (tagging with keywords), user comments administration and etc.

However, after the upgrade the internal search features provided by FreeFind failed. I got this from the indication (weekly reports) that FreeFind's spider has only managed to get one page to be indexed.

The first thing I realized was that I have actually put the archive list into a drop-down box. Which means that spiders will not be able to render out the listing of archives at non-real-time.

However, the biggest observation I had is that FreeFind's spiders also not able to follow-through links available from the main page. This is quite unusual.

I don't remember making any changes to FreeFind's spider settings, thus logically problems should be coming from upgrades to blogger's new version.

robots nofollow through problems...

From FreeFind FAQ section, it narrows down the problem to "You are using a nofollow robots meta tag".

Then I went on to check the html codes rendered out as this blog and found the following...

<meta name="robots" content="index,follow">
<meta name="document-classification" content="Commercial">
<meta name="document-rating" content="Safe for Kids">
<meta name="document-distribution" content="Global">
<meta name="rating" content="General">
<meta name="author" content="Brandon Teoh">
<meta http-equiv="Content-Language" content="en-us">

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta name="MSSmartTagsPreventParsing" content="true" />
<meta name="generator" content="Blogger" />
<link rel="alternate" type="application/atom+xml" title="IT-Sideways: Tech Blog Malaysia - Atom" href="http://brandonteohno1-it.blogspot.com/feeds/posts/default" />
<link rel="alternate" type="application/rss+xml" title="IT-Sideways: Tech Blog Malaysia - RSS" href="http://brandonteohno1-it.blogspot.com/feeds/posts/default?alt=rss" />
<link rel="service.post" type="application/atom+xml" title="IT-Sideways: Tech Blog Malaysia - Atom" href="http://www.blogger.com/feeds/8024740/posts/default" />
<link rel="EditURI" type="application/rsd+xml" title="RSD" href="http://www2.blogger.com/rsd.g?blogID=8024740" />
<meta name="ROBOTS" content="NOINDEX,NOFOLLOW" />

A you can see, the red color portion of meta tags are self-generated by blogger based on the tag below.

<$BlogMetaData$> (provided by default from blogger's template)

Therefore, since <meta name="ROBOTS" content="NOINDEX,NOFOLLOW" /> is being generated and it comes under <meta name="robots" content="index,follow">. Thus, follow-through is be turned-off.

By default, blogger's meta setting for robots prevented spiders (from search engines and etc) to follow-through all links from the main page. This is perhaps, the policy to uphold your piracy. However, this may affect your blog(s) from being spidered and indexed by search engines.

Solutions ..

Turn-on the follow-through features...

Take off blogger's default meta tags (~~<$BlogMetaData$>~~) and the put any other meta tags which you would be injected to the template manually. That is, in this situation, it is simply...

<meta name="robots" content="index,follow">
<meta name="document-classification" content="Commercial">
<meta name="document-rating" content="Safe for Kids">
<meta name="document-distribution" content="Global">
<meta name="rating" content="General">
<meta name="author" content="Brandon Teoh">
<meta http-equiv="Content-Language" content="en-us">

you may also want to copy some important tags generated by blogger's default meta tags' script and insert them manually.

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta name="MSSmartTagsPreventParsing" content="true" />
<meta name="generator" content="Blogger" />
<link rel="alternate" type="application/atom+xml" title="IT-Sideways: Tech Blog Malaysia - Atom" href="http://brandonteohno1-it.blogspot.com/feeds/posts/default" />
<link rel="alternate" type="application/rss+xml" title="IT-Sideways: Tech Blog Malaysia - RSS" href="http://brandonteohno1-it.blogspot.com/feeds/posts/default?alt=rss" />
<link rel="service.post" type="application/atom+xml" title="IT-Sideways: Tech Blog Malaysia - Atom" href="http://www.blogger.com/feeds/8024740/posts/default" />
<link rel="EditURI" type="application/rsd+xml" title="RSD" href="http://www2.blogger.com/rsd.g?blogID=8024740" />

or in case you are using FreeFind, just add the following line under all meta tags.

<meta name=FreeFind content="all">

That is ...

<meta name="robots" content="index,follow">
<meta name="document-classification" content="Commercial">
<meta name="document-rating" content="Safe for Kids">
<meta name="document-distribution" content="Global">
<meta name="rating" content="General">
<meta name="author" content="Brandon Teoh">
<meta http-equiv="Content-Language" content="en-us">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <meta name="MSSmartTagsPreventParsing" content="true" /> <meta name="generator" content="Blogger" /> <link rel="alternate" type="application/atom+xml" title="IT-Sideways: Tech Blog Malaysia - Atom" href="http://brandonteohno1-it.blogspot.com/feeds/posts/default" /> <link rel="alternate" type="application/rss+xml" title="IT-Sideways: Tech Blog Malaysia - RSS" href="http://brandonteohno1-it.blogspot.com/feeds/posts/default?alt=rss" /> <link rel="service.post" type="application/atom+xml" title="IT-Sideways: Tech Blog Malaysia - Atom" href="http://www.blogger.com/feeds/8024740/posts/default" /> <link rel="EditURI" type="application/rsd+xml" title="RSD" href="http://www2.blogger.com/rsd.g?blogID=8024740" /><meta name=FreeFind content="all">

Characteristic of FreeFind ...

All these come partly due to FreeFind's nature of real-time indexing and updating... Which means that everytime FreeFind's spider does a crawl, it will not keep old indexes in the archive. Old indexes will be purged and new indexes will be created.

Therefore, your indexes may be healthy yesterday but might subject to problems tomorrow; in case your server is down concurrently with spiders' activities, high possibility that your indexes might be temporarily erased.

The good thing is that it helps you with the problem of bad caches. But it may take longer time as your blog's pages grow.

Also, take note that if you are using free version, limitation to maximum of pages indexed will bring future problems.

Comments

Anonymous said…

I am facing the same problem with my blog. I changed the meta description of my blog two days back, but google havn't crawled it yet. I have mailed to blogger team and looking for some reply.
How much more time will google take to crawl my blog again?

Feb 21, 2007, 6:38:00 AM