--------------000706010906020103020108
Content-Type: text/plain; charset=ISO-8859-15
Content-Transfer-Encoding: 7bit
Ruby said:
This week's Ruby Quiz is to write a simple utility. Your program should accept
an IP address as a command-line argument and print out the two letter code for
the country that IP is assigned in. You can find a database for the matching
at:
http://software77.net/cgi-bin/ip-country/geo-ip.pl
To keep the problem interesting though, let's write our programs with a focus on
speed and memory efficiency.
$ time ruby ip_to_country.rb 68.97.89.187
US
real 0m0.314s
user 0m0.259s
sys 0m0.053s
This is my first submission to Ruby Quiz, so don't expect ground-breaking techniques from this implementation.
Let me characterize my solution: Instead of looking up the csv file directly, I precompile it. The search can then be performed quickly. I think my solution will beat some others when it comes to massive lookups in the range of 100th of thousand lookups.
First, I wrote a random IP address generator. It can alternatively output the IP addresses as a space delimited string or each IP address on a separate line. On my bash submitting all IP address worked up to about 8 k entries until the argument list was too long, so I implemented that the addresses can be given on $stdin also. For a proper argument list on stdin, the entries should be separated by \n, so just add a second argument to rand_ip.rb to generate the list of ips -- the script doesn't care what the second parameter is, it just has to be there (i use "byline").
I use the script like this:
$ ./rand_ip.rb 100000 byline > ip-test-list
Second, I use a precompile run. I don't know how folks here can stay under 100 milliseconds without parsing the file beforehand. My precompilation generates a binary table that stores 10 bytes for each entry: 4 bytes each for start/end of address space and 2 bytes for the country code. Additionally, for saving memory, I squeeze memory areas with adjacent addresses in the same country. This saves more than 50% of the records while it doesn't significantly speed up the search (just about 1 step more in my binary search algorithm, see below). the packer (packr.rb) uses pack("NNa2") for building the binary table, and look, the addresses are stored in network byte order (i.e. big endian). The output table holds about 32 k Entries.
The packer doesn't accept any parameters, just call
$ ./packer.rb
Third comes the actual address search. I implemented two versions for comparison: One with a small RAM footprint that looks up everything in the packed table file. The other one reads the whole table into memory and performs all lookups in memory.
The algorithm is the same in both implementations: I use binary search over all entries until the ip address either matches one range or there is no match at all.
Comparison is sped up by making 4-letter string comparisons of the respective binary addresses. That way "search_ip >= addr_start" works even when the addresses are strings. This saves a lot of time because at search time there is no .to_i calculation avoided.
I run the searchers (search_file.rb, search_str.rb) for small numbers of random IP address using the command line:
$ time ./search_file.rb `./rand_ip.rb 3000` > /dev/null
For more addresses I use one of the following -- the last usage is to make successive runs with equal input:
$ time ./rand_ip.rb 100000 byline | ./search_file.rb > /dev/null
$ ./rand_ip.rb 100000 byline > ip-test-list; time ./search_file.rb < ip-test-list > /dev/null
My results for 100000 lookups are:
$ time ./packer.rb
real 0m1.634s
user 0m1.576s
sys 0m0.056s
$ time ./rand_ip.rb 100000 byline > ip-test-list
real 0m0.797s
user 0m0.740s
sys 0m0.056s
$ time ./search_file.rb < ip-test-list > /dev/null
real 0m11.091s
user 0m9.673s
sys 0m1.420s
$ time ./search_str.rb < ip-test-list > /dev/null
real 0m7.131s
user 0m6.960s
sys 0m0.168s
btw: I don't understand the following result -- can you help me improving this? 129 ms just for firing ruby up! Can't be!
$ time ruby /dev/null
real 0m0.129s
user 0m0.120s
sys 0m0.012s
$ uname -a
Linux server 2.6.21-modgentoo #2 SMP Wed May 2 19:07:13 CEST 2007 x86_64 AMD Athlon(tm) 64 Processor 3000+ AuthenticAMD GNU/Linux
$ ruby --version
ruby 1.8.6 (2007-03-13 patchlevel 0) [x86_64-linux]
Cheers,
- Matthias
--------------000706010906020103020108
Content-Type: text/plain;
name="rand_ip.rb"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename="rand_ip.rb"
#!/usr/bin/ruby
r=ARGV[0].to_i
EOL= ARGV[1] ? "\n" : " "
r.times {
print "#{rand(256)}.#{rand(256)}.#{rand(256)}.#{rand(256)}#{EOL}"
}
--------------000706010906020103020108
Content-Type: text/plain;
name="packer.rb"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename="packer.rb"
#!/usr/bin/ruby
# comment
last_start=nil
last_end=nil
last_country=nil
File.open("packed-ip.dat","wb") do |wfile|
IO.foreach("geo-ip.csv") do |line|
next if !(line =~ /^"/ )
s,e,d1,d2,co=line.delete!("\"").split(",")
s,e = s.to_i,e.to_i
if !last_start
# initialize with first entry
last_start,last_end,last_country = s,e,co
else
if s==last_end+1 and co==last_country
# squeeze if successive ranges have zero gap
last_end=e
else
# append last entry, remember new one
wfile << [last_start,last_end,last_country].pack("NNa2")
last_start,last_end,last_country = s,e,co
end
end
end
# print last entry
if last_start
wfile << [last_start,last_end,last_country].pack("NNa2")
end
end
--------------000706010906020103020108
Content-Type: text/plain;
name="search_file.rb"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename="search_file.rb"
#!/usr/bin/ruby
# take command line or stdin -- the latter has performance advantage for long lists
if ARGV[0]
arr=ARGV
else
arr=$stdin
end
# the binary table file is looked up with each request
File.open("packed-ip.dat","rb") do |rfile|
rfile.seek(0,IO::SEEK_END)
record_max=rfile.pos/10-1
arr.each { |argv|
# build a 4-char string representation of IP address
# in network byte order so it can be a string compare below
ipstr= argv.split(".").map {|x| x.to_i.chr}.join
# low/high water marks initialized
low,high=0,record_max
while true
mid=(low+high)/2 # binary search median
rfile.seek(10*mid) # one record is 10 byte, seek to position
str=rfile.read(8) # for range matching, we need only 8 bytes
# at comparison, values are big endian, i.e. packed("N")
if ipstr>=str[0,4] # is this IP not below the current range?
if ipstr<=str[4,4] # is this IP not above the current range?
puts rfile.read(2) # a perfect match, voila!
break
else
low=mid+1 # binary search: raise lower limit
end
else
high=mid-1 # binary search: reduce upper limit
end
if low>high # no entries left? nothing found
puts "no country"
break
end
end
}
end
--------------000706010906020103020108
Content-Type: text/plain;
name="search_str.rb"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename="search_str.rb"
#!/usr/bin/ruby
# read the binary table upfront
f=File.open("packed-ip.dat","rb")
table=f.read
record_max=table.length/10-1
f.close
# take command line or stdin -- the latter has performance advantage for long lists
if ARGV[0]
arr=ARGV
else
arr=$stdin
end
arr.each { |argv|
# build a 4-char string representation of IP address
# in network byte order so it can be a string compare below
ipstr= argv.split(".").map {|x| x.to_i.chr}.join
# low/high water marks initialized
low,high=0,record_max
while true
mid=(low+high)/2 # binary search median
# at comparison, values are big endian, i.e. packed("N")
if ipstr>=table[10*mid,4] # is this IP not below the current range?
if ipstr<=table[10*mid+4,4] # is this IP not above the current range?
puts table[10*mid+8,2] # a perfecct match, voila!
break
else
low=mid+1 # binary search: raise lower limit
end
else
high=mid-1 # binary search: reduce upper limit
end
if low>high # no entries left? nothing found
puts "no country"
break
end
end
}
--------------000706010906020103020108--