Scraping Google for Jan!
Since my quest to become first as Jan, I’ve been looking at all kind of traffic analysis tools, search word checkers and so on, really trying to get an overview of my current position. For this quest I’ve also just written my own Google Scraper to find on what page I am at on Google. I wrote it in Ruby (what else), with a little help (read: a lot) from hpricot. Here is the code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
#!/bin/bash/ruby require 'rubygems' require 'open-uri' require 'hpricot' require 'pp' require 'htmlentities/string' # Settings! Search = ARGV[0] URL = /^http:\/\/blog.defv.be.*$/ Max = 30 # No touchy class BlogFound < StandardError; end page = 0 item = 0 begin doc = Hpricot(open("http://www.google.be/search?hl=nl&q=#{Search.sub(' ', '%20').encode_entities}&start=#{page*10}")) page += 1 i = doc.search("//div[@class='g']") i.each do |div| uri = div.search("a").first.attributes['href'] item+=1 raise BlogFound if uri =~ URL end rescue BlogFound print "Search for #{Search} || Page: #{page} || Item: #{item}\n" exit end until page > Max print "Not found in the first #{Max} pages on Google with keyword #{Search}.. Too bad!\n" |
Results of the program:
jan@odd:~/apps/mephisto$ ./find_on_google.rb jan
Search for jan || Page: 7 || Item: 67
So, conclusion, I’m on page 7! I’m getting at page 1.. Just a little bit of patience I guess.. and some linking? :D
2 comments
Comments
-
Don't forget to test on multiple datacenters in your script ;) as they don't always sync very well.
-
allez heb u beetje geholpen met linking op mijn blog en op tweakers.net