WWW 2008 / Poster Paper April 21-25, 2008 · Beijing, China A Larger Scale Study of Robots.txt Santanu Kolay, Paolo D'Alber to, Ali Dasdan, and Arnab Bhattacharjee Yahoo! Inc. Sunnyvale, CA, USA {santanuk,pdalber t,dasdan,arnab}@yahoo-inc.com ABSTRACT A website can regulate search engine crawler access to its content using the rob ots exclusion protocol, sp ecified in its rob ots.txt file. The rules in the protocol enable the site to allow or disallow part or all of its content to certain crawlers, resulting in a favorable or unfavorable bias towards some of them. A 2007 survey on the rob ots.txt usage of ab out 7,593 sites found some evidence of such biases, the news of which led to widespread discussions on the web. In this pap er, we rep ort on our survey of ab out 6 million sites. Our survey tries to correct the shortcomings of the previous survey and shows the lack of any significant preferences towards any particular search engine. Categories and Sub ject Descriptors: H.3.3 [Information Search and Retrieval]: Search Process General Terms: Exp erimentation, Measurement Keywords: Crawler, rob ots exclusion, rob ots.txt, search engine ing the crawler of Yahoo! search engine, we started our regular retrieval of ab out 2.2M non-empty rob ots.txt files from 6M sites (5M top and 1M randomly selected, where "top" is in terms of HostRank, a quality score similar to PageRank but computed on the host graph). Having access to many prop erties of these sites, we p erformed our analysis of the collected data in many ways, and also replicated the analysis of the previous study. Among our many findings, the main result is that some sites may have bias towards sp ecific crawlers but overall the top two crawlers seem to have access to the same amount of content. This is contrary to the p oint made by the previous study. 2. EXPERIMENTS AND RESULTS An ideal measure of the bias towards a crawler is through the value of the allowed and disallowed content. Unfortunately, this is almost imp ossible to measure b ecause, e.g., a search engine cannot value the disallowed content without serving it to its customers. Hence, all the approaches to measuring the bias resort to approximating content via directory or URL counts. In [4, 5], the bias was measured by counting the numb er of directories disallowed, i.e., the crawler with the highest (lowest) such count was regarded as having the most unfavorable (favorable) bias. The drawback of this approach is that the numb er of directories disallowed may not correlate well to the amount of content or the numb er of URLs disallowed. In our study, we replicated the previous approach (Figs. 1(b) and 1(c)) but we also tried to estimate the numb er of URLs disallowed by sending path queries to search engines. Though b etter than the directory counts, the URL counts still cannot scale due to daily search limits imp osed by search engines. We, however, were able to get the counts for at least 1,500 sites p er month. Once we had the URL counts for two search engines Y and G, we computed the relative bias (Fig. 1(d)) of a site H towards these engines in two steps: (1) find the p ercentage PY (PG) of H's URLs that only Y (G) has, and then (2) rep ort bias towards Y if PY is larger than PG (by some ) or bias towards G otherwise. We p erformed other exp eriments to understand the significance of the rob ots.txt bias: the use of CrawlDelay (Fig. 1(e)), which may introduce bias if a crawler does not honor it, and the use of the sitemap protocol. The sitemap protocol [3] enables a site to inform a search engine ab out any of its content it prefers to b e crawled. It was introduced in Apr. of 2007, and the big four search engines all supp ort it. Our 1. INTRODUCTION Search engines retrieve content from (web)sites using crawling agents called robots or craw lers. Since a site needs to deal with many crawlers using its resources, it needs to regulate their b ehavior. The robots exclusion protocol [2] is a partial solution to this regulation problem, providing an advisory regulation for crawlers to follow. To use this protocol, a site will typically sp ecify its protocol rules in a file called rob ots.txt. The rules allow the site to allow or disallow part or all of its content to sp ecific crawlers. Despite the imp ortance of this protocol for b oth content providers and search engines, the first reasonably large scale study for its usage was done only recently in 2007 [4, 5]. The study was p erformed over 2,925 distinct rob ots.txt files from 7,593 sites. This study rep orted many useful observations but the most surprising of them was the one that there is a bias towards sp ecific crawlers, which was also shown to strongly correlate with the resp ective search engine market shares. This observation was also discussed widely in many well-known blogs following the search engine field. A web search using the query "rob ots.txt study bias" returns many pages discussing the findings of this study. Intrigued by the observations of the previous study, we started our own investigation into the rob ots.txt usage. UsCopyright is held by the author/owner(s). WWW 2008, April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04. 1171 WWW 2008 / Poster Paper April 21-25, 2008 · Beijing, China (a) # of non-empty rob ots.txt files processed. (b) Average (Avg.) # of disallowed directories disallowed p er rob ots.txt file. (c) Bias via directory count occurs in less than 1.5 % of valid rob ots.txt files. (d) Bias via URL count is almost the same. (e) Bias via CrawlDelay is mostly created by small values. (f ) Bias via HostRank is almost the same. The bias mostly occurs with low-rank hosts. Figure 1: Our comparison of the robot.txt bias towards two ma jor search engine crawlers. Note that these engines get almost the same share of bias in all cases and that the amount of bias in each case is actually very small. evaluation [1] has shown that since Oct. 2007, the sitemap usage has b een increasing by 10,000 rob ots.txt files a month. However, only ab out 6% out of 1.7M valid rob ots.txt files had the sitemaps link sp ecified in Dec. 4. REFERENCES [1] A. Dasdan, D. Pavlovski, and A. Bhattacharjee. A large scale study of sitemap usage. Available by email, Jan 2008. [2] M. Koster. A metho d for web rob ots control. The internet draft, The Internet Engineering Task Force (IETF), 1996. [3] The sitemap proto col. URL: http://www.sitemaps.org, Apr 2007. [4] Y. Sun, Z. Zhuang, I. G. Councill, and C. L. Giles. Determining bias to search engines from rob ots.txt. In Proc. of Int. Conf. on Web Intel. WI), pages 149­55. IEEE/WIC/ACM, 2007. [5] Y. Sun, Z. Zhuang, and C. L. Giles. A large scale study of rob ots.txt. In Proc. of Int. Conf. on World Wide Web (WWW), pages 1123­4. ACM, 2007. 3. CONCLUSIONS We have done a comprehensive and larger scale study of the rob ots.txt usage. Our results show that the bias towards search engines is not as serious as rep orted by the recent prior work [4, 5]. 1172