Dave Sifry is chiming in on some analysis done by Tristan Louis about how well Google, Yahoo and Technorati are covering the blogosphere. Briefly, here’s what Tristan did: He ran link: queries on Google, Yahoo and Technorati for the blogs in the Technorati Top 100 and recorded the number of results reported by each search engine. For example, taking BoingBoing, the 1st blog on that list:
- For the query link:boingboing.net, Google reports “about 40,700 results”, i.e. pages in its index that link to BoingBoing.net
- Technorati has indexed 23,358 links to BoingBoing.
- Yahoo’s results claim that it has indexed about 1,320,000 pages linking to BoingBoing.
Interestingly, Technorati is the only one of the three that gives the same count whether the link query includes www before the domain name or not (I happen to think that’s the correct behavior).
So much for Tristan’s method, which is transparent and easily reproducible. The picture that emerges is that Technorati’s coverage of the blogosphere is worse than Google’s, which in turn is [much] worse than Yahoo’s. By the way, Tristan’s post has more depth than is relevant to this post, and it has some interesting statistics that pull apart this data. Read it.
Anyway, Dave smelled something fishy in Tristan’s data (he’s onto the right question, but he goes after a red herring and misses a different, interesting feature in the data):
… I believe that Tristan’s analysis begs a question that hasn’t been asked yet: How accurate are the numbers that search engines report about the size of their result sets? … For example, when you search for all the results for “Tristan Louis” on Google, it reports “about 575,000”.
Whoa, hold it. That’s a keyword query, which means Dave’s now running a different experiment from Tristan’s, which uses link queries. I recommend you read Dave’s entire post, but from this point forward, he’s on a different track, using keyword queries instead of link queries throughout.
Dave’s objection is to the limit on “viewable results” that Yahoo and Google implement (Technorati doesn’t). Both Yahoo and Google only serve up to about 1000 pages of a results set. Crunching through 1-N results not only gets more expensive for higher N, but the value for the user falls off pretty rapidly after a while. And as a bonus, this limit keeps pranksters with robots from chewing up bandwidth by paging through millions of results, wreaking havoc on caching. Not to mention that nobody wants to wade through more than a few pages of results anyway, instead of just rephrasing the query to get better results. [Someone should do something about the excessive recall of these keyword search engines…] Anyway, the 1000-page limit is an interesting discovery, but it’s an obvious optimization. The results count given at the top of the page (results 1-10 of 40,700) is of course an estimate, again as an optimization (getting exact counts from massively distributed indexes isn’t free, and who needs an exact count at this level of recall, anyway?)
So, the limit on viewable results is very straightforwardly explained as an optimization, benign to the user experience. I don’t think there’s anything to get worked up about in only being able to see the first 1000 results for a query. That’s what query refinement is for. The interesting thing is the estimated total number of results, specifically of link queries.
When I went through Tristan’s original experiment and ran some link queries, it became pretty obvious (as if it wasn’t obvious in Tristan’s post) that there’s something weird about Yahoo’s method of estimating the total number of results. The practice of estimating the total number of results (as opposed to computing it precisely) is a necessary optimization in a search engine that wants to scale to Google or Yahoo scale, and the estimated results counts seem plausible on Google and Yahoo for keyword queries. Counts from both engines were plausible (and within an order of magnitude of each other) for most keyword queries that I tried. But for link queries, that’s not the case. Let’s look again at the estimated total results counts for pages linking to BoingBoing:
Now, the fact that Technorati found only half as many links to BoingBoing as Google isn’t a big deal and shouldn’t give Technorati an inferiority complex. A sizeable chunk of the links may be from sites that Technorati isn’t indexing because those sites aren’t blogs or don’t use ping services that Technorati is monitoring. Also, Technorati’s index isn’t as old as Google’s and other factors like multiple links per page (to the same blog) make the comparison even more difficult. In any case, the difference between Google and Technorati is relatively small (if the Technorati team spends some time on the back end now that the new UI is up, they’ll narrow that gap). What’s interesting, however, is Yahoo’s estimate for the number of results for this particular query. At 1.3 million, it’s about 30x larger than Google’s count and 60x larger than Technorati’s. That seems implausible to me, and it looks like some wacky calculations are happening in Yahoo’s estimation of results count for this query. For several blogs I tried, Google’s results count is plausible and roughly 2-4x Technorati’s, whereas Yahoo’s is, well, out there. Here are small, medium and medium-large examples (we covered extra-large above, with BoingBoing) :
I told you, it’s wacky. Tristan’s conclusion is that Yahoo! is more focused on indexing the blogosphere and has more data. That may be true. But these counts are so far out there that I can’t help but think there’s a problem with the way they’re calculated. So there. Fix it. I may check 🙂
And, to end on a nitpicky note … as I mentioned somewhere above, if you add or subtract www to the link: query, both Google’s and Yahoo’s total counts jump around like crazy. I don’t know about you, but in this context of searching WWW content, I think www should be treated as a special hostname and equivalent to the domain, i.e. http://www.domain.com == domain.com.