PDF indexing and ranking test

TL;DR

  • Adding properties via Word is likely to support the indexation and ranking when the document is converted in PDF

Since the recent interest by Google (in particular with its Quick View functionality) and other search engines to different document types than a standard HTML web page, I was curious to understand how a well optimized PDF document could have more chances to succeed in indexing rather than one made without respecting any sort of optimisation.

The following test has been made basically using the same text for each PDF document; some changes were necessary to highlight the differences in the specific docs.

The table below show the differences between the PDF generated:

A summary table of the ran test

Test name H1 H1 fake H2 H2 fake Bottom Title Subject Keywords Comments
Test 1 X   X   X X X X X
Test 1.1 X   X   X X      
Test 1.2 X   X   X X     X
Test 1.3 X   X   X X   X  
Test 2         X X X X X
Test 3         X        
Test 4 X       X        
Test 4.1 X       X        
Test 5     X   X        
Test 5.1     X   X        
Test 6       X X        
Test 7 X     X X X X X X
Test 8         X        
Test 9   X     X        
Test 10   X     X        
Test 11   X   X X        
Test 12   X   X X X X X X
Test 13         X        

Unique Research Key (URK): seiunamicone

Assumptions

  1. A 14% (about) of keyword density is the minimum for each doc;
  2. A 100% KD is assumed when in a document the URK appear in all the fields highlighted into the table above;
  3. Fake H1 and H2 are considered when the same font size is used, but use a different type of emphasis;

Facts

The web site has been recently registered and no SEO activities have been done to increase the presence of this domain on the SEs.

I’ll keep update this page showing SEs access and SERP rank improvements as soon as I’ll notice some difference (or as soon as I can).

Indexation results

26th October: This test has been publicly announced with Twitter (and other SM channels) to boost the indexing and see how the new partnership Google and Bing made with the tweeting-site really works.

Preparing the test has required a bit of time and everything was ready at about 11:30 GMT. Manual submission to the search engines has been made during the past 30 minutes.

27th October: some preparation. The first search engine to crawl the page has been Mediabot-Google. I’m not surprised about that, since the page contains an AdSense banner and Google want to recognize the content of the page before showing the banners.

2009-10-26 13:53:54 /SEO-Test-PDF/index.html - 66.249.71.164 Mediapartners-Google 200

Just a couple of hours later, Google passed over the web site, followed by Yahoo!.

2009-10-26 14:25:25 /sitemap.xml - 66.249.71.164 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) 200

2009-10-26 14:33:04 /robots.txt - 74.6.22.91 Mozilla/5.0+(compatible;+Yahoo!+Slurp;+http://help.yahoo.com/help /us/ysearch/slurp) 404 0 2

It’s interesting to notice the search engine behaviour for which I both submitted the specific URL. Google requested the sitemap.xml file as opposed to Yahoo! that requested the exclusion file robotx.txt ignoring the rest.

I’m not surprised by the delay of MSN/Bing bot which passed about five hours later requesting both robots.txt file and the index page of the test directory.

2009-10-26 16:55:53 /robots.txt - 65.55.209.107 msnbot/1.1+(+http://search.msn.com/msnbot.htm) 200 2009-10-26 16:55:53 /SEO-Test-PDF/index.html - 65.55.209.107 msnbot/1.1+(+http://search.msn.com/msnbot.htm) 200

The robots.txt file contains the address of the sitemap, so even Yahoo! and Bing now know the existence of that file despite they didn’t expressly requested it.

Sometimes the prudence it’s not even too much, so that MSN decided to pass again one hour later to be sure the test was still there.

2009-10-26 17:50:59 /robots.txt - 65.55.209.106 msnbot/1.1+(+http://search.msn.com/msnbot.htm) 200

2009-10-26 17:50:59 /sitemap.xml - 65.55.209.106 msnbot/1.1+(+http://search.msn.com/msnbot.htm) 200

The last search engine to access to the page, at least yesterday, was Baidu, that to be honest I didn’t directly invited (nor I don't know how they could be aware of the site considering its limited presence)

2009-10-26 22:29:20 /robots.txt - 220.181.7.16 Baiduspider+(+http://www.baidu.com/search/spider.htm) 200

Similar situation for Ask.

28th October: Today morning, it’s 9:20 GMT, the only interesting feedback I can see is Googlebot that passed over my sitemap and robots.txt again.

2009-10-27 04:54:03 /robots.txt - 66.249.71.164 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) 200

2009-10-27 04:54:03 /sitemap.xml - 66.249.71.164 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) 200

Later today Baidu scanned again the robots.txt file.

Considering these first results, 3 search engine on 4 are now aware about the web site and test, despite only MSN / Bing really passed over the specific test directory.

The test will continue during the following days/weeks, and I’ll update more update as soon as they become available.

29th October: Google is now aware of the test Google bot this morning passed on the testing web site and successfully crawled all the pdf file belonging to this test. It first passed this early morning …

2009-10-28 03:50:51 /robots.txt - 66.249.65.155 Mozilla/5.0+(compatible;+Googlebot/2.1;++http:// www.google.com/bot.html) 200

2009-10-28 03:50:51 /SEO-Test-PDF/index.html - 66.249.65.155 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) 200

then some hours later scanned all the PDF files.

2009-10-28 05:19:02 /SEO-Test-PDF/PDF-test-search-in-the-header1-ver2.pdf - 66.249.65.155 Mozilla/5.0+(compatible;+Googlebot/2.1;++ http://www.google.com/bot.html) 200

2009-10-28 05:19:02 /SEO-Test-PDF/PDF-test-without-headers.pdf - 66.249.65.155 Mozilla/5.0+(compatible;+Googlebot/2.1;++ http://www.google.com/bot.html) 200

2009-10-28 05:21:00 /SEO-Test-PDF/PDF-test-without-header2.pdf - 66.249.65.155 Mozilla/5.0+(compatible;+Googlebot/2.1;++ http://www.google.com/bot.html) 200

2009-10-28 05:22:45 /SEO-Test-PDF/PDF-test-normal-with-headers.pdf - 66.249.65.155 Mozilla/5.0+(compatible;+Googlebot/2.1;++ http://www.google.com/bot.html) 200

2009-10-28 05:24:37 /SEO-Test-PDF/PDF-test-without-headers-KD43.pdf - 66.249.65.155 Mozilla/5.0+(compatible;+Googlebot/2.1;++ http://www.google.com/bot.html) 200

2009-10-28 05:26:29 /SEO-Test-PDF/PDF-test-search-in-the-header2.pdf - 66.249.65.155 Mozilla/5.0+(compatible;+Googlebot/2.1;++ http://www.google.com/bot.html) 200

2009-10-28 05:28:22 /SEO-Test-PDF/PDF-test-without-headers-KD100.pdf - 66.249.65.155 Mozilla/5.0+(compatible;+Googlebot/2.1;++ http://www.google.com/bot.html) 200

2009-10-28 05:30:14 /SEO-Test-PDF/PDF-test-without-header2-KD100.pdf - 66.249.65.155 Mozilla/5.0+(compatible;+Googlebot/2.1;++ http://www.google.com/bot.html) 200

2009-10-28 05:32:07 /SEO-Test-PDF/PDF-test-search-in-the-header1.pdf - 66.249.65.155 Mozilla/5.0+(compatible;+Googlebot/2.1;++ http://www.google.com/bot.html) 200

2009-10-28 05:33:59 /SEO-Test-PDF/PDF-test-without-headers-doublekey.pdf - 66.249.65.155 Mozilla/5.0+(compatible;+Googlebot/2.1;++ http://www.google.com/bot.html) 200

2009-10-28 05:35:51 /SEO-Test-PDF/PDF-test-without-header2-doublekey.pdf - 66.249.65.155 Mozilla/5.0+(compatible;+Googlebot/2.1;++ http://www.google.com/bot.html) 200

2009-10-28 05:37:44 /SEO-Test-PDF/PDF-test-normal-with-headers-KD57%.pdf - 66.249.65.155 Mozilla/5.0+(compatible;+Googlebot/2.1;++ http://www.google.com/bot.html) 200

2009-10-28 05:39:36 /SEO-Test-PDF/PDF-test-search-in-the-header2-ver2..pdf - 66.249.65.155 Mozilla/5.0+(compatible;+Googlebot/2.1;++ http://www.google.com/bot.html) 200

2009-10-28 05:41:28 /SEO-Test-PDF/PDF-test-normal-with-headers-KD71%.pdf - 66.249.65.155 Mozilla/5.0+(compatible;+Googlebot/2.1;++ http://www.google.com/bot.html) 200

2009-10-28 05:43:21 /SEO-Test-PDF/PDF-test-normal-with-headers-KD100%.pdf - 66.249.65.155 Mozilla/5.0+(compatible;+Googlebot/2.1;++ http://www.google.com/bot.html) 200

2009-10-28 05:45:13 /SEO-Test-PDF/PDF-test-normal-with-headers-KD71%-ver2.pdf - 66.249.65.155 Mozilla/5.0+(compatible;+Googlebot/2.1;++ http://www.google.com/bot.html) 200

2009-10-28 05:47:06 /SEO-Test-PDF/PDF-test-normal-with-headers-search-in-the-properties.pdf - 66.249.65.155 Mozilla/5.0+(compatible;+Googlebot/2.1;++ http://www.google.com/bot.html) 200

3rd November: Yahoo! is crawling Yahoo! started to crawl the pdf files and hopefully index them.

2009-11-03 14:03:25 /SEO-Test-PDF/PDF-test-normal-with-headers-KD100%.pdf - 67.195.113.250 Mozilla/5.0+(compatible;+Yahoo!+Slurp/3.0;+ http://help.yahoo.com/help/us/ysearch/slurp) 200

2009-11-03 14:03:46 /SEO-Test-PDF/PDF-test-without-headers-doublekey.pdf - 67.195.113.250 Mozilla/5.0+(compatible;+Yahoo!+Slurp/3.0;+ http://help.yahoo.com/help/us/ysearch/slurp) 200

2009-11-03 14:06:07 /SEO-Test-PDF/PDF-test-normal-with-headers.pdf - 67.195.113.250 Mozilla/5.0+(compatible;+Yahoo!+Slurp/3.0;+ http://help.yahoo.com/help/us/ysearch/slurp) 200

2009-11-03 14:08:26 /SEO-Test-PDF/PDF-test-without-headers-KD43.pdf - 67.195.113.250 Mozilla/5.0+(compatible;+Yahoo!+Slurp/3.0;+ http://help.yahoo.com/help/us/ysearch/slurp) 200

5th November: Ask/Teoma is aware of the test With a lot of days of delay, finally Ask decided to pass over the web site and crawl robot.txt file and one document.

2009-11-05 20:58:09 /robots.txt - 66.235.124.58 Mozilla/5.0+(compatible;+Ask+Jeeves/Teoma;++ http://about.ask.com/en/docs/about/webmasters.shtml) 200

2009-11-05 20:58:09 /SEO-test-PDF/PDF-test-without-header2-KD100.pdf - 66.235.124.58 Mozilla/5.0+(compatible;+Ask+Jeeves/Teoma;++ http://about.ask.com/en/docs/about/webmasters.shtml) 200

8th November: Ask show results There are some results in the SERP of Ask. It doesn’t show all the files belonging to the test, but certainly did something in a very short time.

10th November: Yahoo! is lazy Despite Yahoo! crawled the pdf files for the first time about ten days ago, today is still impossible to find any result in the SERP looking for the URK.

 Leave a Comment