Scraping Google for Jan!

written by Jan on February 7th, 2007 @ 06:44 PM

Since my quest to become first as Jan, I’ve been looking at all kind of traffic analysis tools, search word checkers and so on, really trying to get an overview of my current position. For this quest I’ve also just written my own Google Scraper to find on what page I am at on Google. I wrote it in Ruby (what else), with a little help (read: a lot) from hpricot. Here is the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

  #!/bin/bash/ruby
  require 'rubygems'
  require 'open-uri'
  require 'hpricot'
  require 'pp'
  require 'htmlentities/string'
  
  # Settings!
  Search = ARGV[0]
  URL = /^http:\/\/blog.defv.be.*$/
  Max = 30
  
  # No touchy
  class BlogFound < StandardError; end
  
  page = 0
  item = 0
  begin
      doc = Hpricot(open("http://www.google.be/search?hl=nl&q=#{Search.sub(' ', '%20').encode_entities}&start=#{page*10}"))
      page += 1
      i = doc.search("//div[@class='g']")
      i.each do |div|
          uri = div.search("a").first.attributes['href']
          item+=1
          raise BlogFound if uri =~ URL
      end
  rescue BlogFound
      print "Search for #{Search} || Page: #{page} || Item: #{item}\n"
      exit
  end until page > Max
  
  print "Not found in the first #{Max} pages on Google with keyword #{Search}.. Too bad!\n"

Results of the program:


  jan@odd:~/apps/mephisto$ ./find_on_google.rb jan
  Search for jan || Page: 7 || Item: 67

So, conclusion, I’m on page 7! I’m getting at page 1.. Just a little bit of patience I guess.. and some linking? :D



Comments

  • Benjamin on 07 Feb 19:42

    Don't forget to test on multiple datacenters in your script ;) as they don't always sync very well.
  • Bjorn on 08 Feb 19:20

    allez heb u beetje geholpen met linking op mijn blog en op tweakers.net

Comments are closed

Options:

Size