Bing Indexing of gitweb.cgi Links
21 January, 2012
data:image/s3,"s3://crabby-images/6b0b5/6b0b576e2e3b38580e528fbb8a557bb228c23598" alt="Bing indexing of gitweb.cgi links Bing indexing of gitweb.cgi links"
First, here are some stats for indexing bots from major search engines across all cipherdyne.org Apache log data for hits against gitweb.cgi from June, 2011 to today:
Hits | Percentage | User-Agent |
505055 | 81.01% | Mozilla/5.0 (compatible; bingbot/2.0;) |
50242 | 8.06% | msnbot/2.0b (+http://search.msn.com/msnbot.htm)._ |
25707 | 4.12% | Mozilla/5.0 (compatible; Ezooms/1.0; ezooms.bot@gmail.com) |
6583 | 1.06% | Feedfetcher-Google; (+http://www.google.com/feedfetcher.html;) |
4310 | 0.69% | Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) |
1956 | 0.31% | Mozilla/5.0 (compatible; SISTRIX Crawler; http://crawler.sistrix.net/) |
1905 | 0.31% | Mozilla/5.0 (compatible; Purebot/1.1; +http://www.puritysearch.net/) |
1751 | 0.28% | Mozilla/5.0 (compatible; Yahoo! Slurp;) |
1625 | 0.26% | Mozilla/5.0 (compatible; MJ12bot/v1.4.0;) |
1451 | 0.23% | TwengaBot-Discover (http://www.twenga.fr/bot-discover.html) |
Wow! So bots associated with Microsoft's Bing search engine take the top two spots for a combined hit total of well over 500,000 since June, 2011. If spread out over the entire time period (which it's not as we'll see) that would be an average of about 2,600 hits per day, and this figure is more than 20 times the third place bot. Google is in a distant forth place, even though Google used to heavily index Trac repositories.
So, let's see how the search engine hits are distributed since June, 2011. First, here is a graph of just gitweb hits by the top five crawlers:
data:image/s3,"s3://crabby-images/f3db4/f3db49ae1615bf4a3df530504c9052c3ec3a804e" alt="top 5 gitweb indexers top 5 gitweb indexers"
data:image/s3,"s3://crabby-images/4f8b7/4f8b7b973d744a10edbc879168c8e60c8beab2cb" alt="top 5 gitweb indexers logarithmic top 5 gitweb indexers logarithmic"
How does this compare with hits across other portions of cipherdyne.org? Bing indexing is still far and away the largest outlier:
data:image/s3,"s3://crabby-images/2d6f0/2d6f0160a026c6352a305ed7bdd592c021360ab1" alt="top 5 indexers of cipherdyne.org top 5 indexers of cipherdyne.org"
Update 01/23: There are tons of web analysis tools out there, but I wrote a couple of quick scripts to generate the data in this blog post. The first "user_agent_stats.pl" parses Apache logs and produces user-agent graphs with Gnuplot as shown in this post. The second "uniq_hits.pl" is extremely simple and just counts the number of hits against the same links within the Apache log data. Both scripts accept log data via TDIN - here is an example where user agents who hit any "index.html" link are plotted (graph is not shown):
$ zcat ../logs/cipherdyne.org*.gz |grep "index.html" | ./user_agent_stats.pl -p index_hits
[+] Parsing Apache log data...
[+] Total agents: 1769 (abbreviated to: 174 agents)
[+] Executing gnuplot...
Plot file: index_hits.gif
Agent stats: index_hits.agents