super-newbee Ruby regex help?

A

Aaron Reimann

This is pretty complex considering that I am just now reading "Learn to
Program" by Chris Pine (it is a book teaching you how to program in
Ruby). It is very basic. I am somewhat good with PHP but and wanting
to move into RoR and want to learn Ruby before I learn Rails.

Anyway, I found a real life situation where I think Ruby could do this
very quickly (and if I need to do it again, I can just run the script).
I need to remove some stuff from a text file. Simple huh? Here is
the site that I need the list from:

http://edge.i-hacked.com/250-working-proxies-for-safe-web-access-from-work-or-school

In that page there is one line of "code" that has all of the
links...here is part of it:
<a href="http://www.3proxy.com">3 Proxy</a> || <a
href="http://www.3proxy.net">3 Proxy</a> || <a
href="http://www.3proxy.org">3 Proxy</a>

I have taken just that line and saved that as a text file.

I need to strip everything where I wind up with this:
3proxy\.com
3proxy\.net
3proxy\.org
4proxy\.com

I will be taking that list (all 300 of them) and adding them to my
content filtering box. That way, all of these sites will be blocked.

Do you guys know of any sites that might have a similar situation where
I can see the code? or have any of you done something similar? I can
probably modify stuff to make it fit my needs, but stuff like
http://www.regular-expressions.info/ruby.html doesn't give me enough
info to start.

what i have right now is: file = File.open("list.txt","w")

lol

Sorry I'm a nubee... :)

thanks,
aaron
 
V

Vincent Fourmond

Hello !
In that page there is one line of "code" that has all of the
links...here is part of it:
<a href="http://www.3proxy.com">3 Proxy</a> || <a
href="http://www.3proxy.net">3 Proxy</a> || <a
href="http://www.3proxy.org">3 Proxy</a>

I have taken just that line and saved that as a text file.

I need to strip everything where I wind up with this:
3proxy\.com
3proxy\.net
3proxy\.org
4proxy\.com

OK, what you need is to extract the part 3proxy.com from the String
<a href="http://www.3proxy.com">3 Proxy</a>

For that, a RE like the following should do

/http:\/\/www\.([^"]+)/

You can read it this way: "find substrings that start with http://www.
(don't forget to escape /in the RE, else ruby will think that it is
ending; you also need to escape the dot, although in this case it
shouldn't matter much)
and are followed by some text that doesn't contain ". The parenthesis
around say you're interested in it; you'll be able to use what it did
match with the $1 variable. Note that this part will match as much as
possible, so you'll actually get everything you want.

Then a possible way to do what you want would be

proxies = [] # array where the proxies will be
f = File.open('your_file_with_the_list_youre_reading')
f.readlines.each do |l| # iterate on each line
l.scan(/http:\/\/www\.([^"]+)/) do # scan the line for the pattern
proxies << $1 # add the content of $1 to your list
end
end
p proxies

This should work...

Have a good time with Ruby !

Vince
 
W

William James

Aaron said:
This is pretty complex considering that I am just now reading "Learn to
Program" by Chris Pine (it is a book teaching you how to program in
Ruby). It is very basic. I am somewhat good with PHP but and wanting
to move into RoR and want to learn Ruby before I learn Rails.

Anyway, I found a real life situation where I think Ruby could do this
very quickly (and if I need to do it again, I can just run the script).
I need to remove some stuff from a text file. Simple huh? Here is
the site that I need the list from:

http://edge.i-hacked.com/250-working-proxies-for-safe-web-access-from-work-or-school

In that page there is one line of "code" that has all of the
links...here is part of it:
<a href="http://www.3proxy.com">3 Proxy</a> || <a
href="http://www.3proxy.net">3 Proxy</a> || <a
href="http://www.3proxy.org">3 Proxy</a>

I have taken just that line and saved that as a text file.

I need to strip everything where I wind up with this:
3proxy\.com
3proxy\.net
3proxy\.org
4proxy\.com

I will be taking that list (all 300 of them) and adding them to my
content filtering box. That way, all of these sites will be blocked.

Do you guys know of any sites that might have a similar situation where
I can see the code? or have any of you done something similar? I can
probably modify stuff to make it fit my needs, but stuff like
http://www.regular-expressions.info/ruby.html doesn't give me enough
info to start.

what i have right now is: file = File.open("list.txt","w")

If the file already exists, you'll destroy it by using the "w" option.
Since some of the anchor tags span more than one line,
let's read the whole file at once:

p IO.read( 'list.txt' ).
scan( %r{<a \s+ href="http://www\.([^"]*)"}x ).flatten
 
A

Aaron Reimann

Thank you guys. I have not tried all that has been suggested, but I
got this code emailed to me:

###
require 'rubygems'
require 'mechanize'

url="http://edge.i-hacked.com/250-working-proxies-for-safe-web-access-from-work-or-school"
agent = WWW::Mechanize.new
page = agent.get(url)

page.body.scan(/http:\/\/www\.([^"]+)/) do
p $1
end
###

I had to install the 'mechanize' gem, but it works...overall. I have
to figure out how to "write" the output into a text file. but this is
pretty cool.

I will be trying the one below too.

thanks!
aaron

Vincent said:
Hello !
In that page there is one line of "code" that has all of the
links...here is part of it:
<a href="http://www.3proxy.com">3 Proxy</a> || <a
href="http://www.3proxy.net">3 Proxy</a> || <a
href="http://www.3proxy.org">3 Proxy</a>

I have taken just that line and saved that as a text file.

I need to strip everything where I wind up with this:
3proxy\.com
3proxy\.net
3proxy\.org
4proxy\.com

OK, what you need is to extract the part 3proxy.com from the String
<a href="http://www.3proxy.com">3 Proxy</a>

For that, a RE like the following should do

/http:\/\/www\.([^"]+)/

You can read it this way: "find substrings that start with http://www.
(don't forget to escape /in the RE, else ruby will think that it is
ending; you also need to escape the dot, although in this case it
shouldn't matter much)
and are followed by some text that doesn't contain ". The parenthesis
around say you're interested in it; you'll be able to use what it did
match with the $1 variable. Note that this part will match as much as
possible, so you'll actually get everything you want.

Then a possible way to do what you want would be

proxies = [] # array where the proxies will be
f = File.open('your_file_with_the_list_youre_reading')
f.readlines.each do |l| # iterate on each line
l.scan(/http:\/\/www\.([^"]+)/) do # scan the line for the pattern
proxies << $1 # add the content of $1 to your list
end
end
p proxies

This should work...

Have a good time with Ruby !

Vince
 
C

Cliff Cyphers

Aaron said:
I had to install the 'mechanize' gem, but it works...overall. I have
to figure out how to "write" the output into a text file. but this is
pretty cool.

Update filename and you are set.

url="http://edge.i-hacked.com/250-working-proxies-for-safe-web-access-from-work-or-school"
filename="/tmp/tmp2.txt"
agent = WWW::Mechanize.new
page = agent.get(url)

session_fd = File.open(filename, "w")
page.body.scan(/http:\/\/www\.([^"]+)/) do
session_fd.puts $1
end
session_fd.close
 
D

Daniel Harple

Thank you guys. I have not tried all that has been suggested, but I
got this code emailed to me:

###
require 'rubygems'
require 'mechanize'

url="http://edge.i-hacked.com/250-working-proxies-for-safe-web-
access-from-work-or-school"
agent = WWW::Mechanize.new
page = agent.get(url)

page.body.scan(/http:\/\/www\.([^"]+)/) do
p $1
end
###

I had to install the 'mechanize' gem, but it works...overall. I have
to figure out how to "write" the output into a text file. but this is
pretty cool.

Mechanize has a method to get all the links for a Page:

require "rubygems"
require "mechanize"

url="http://edge.i-hacked.com/250-working-proxies-for-safe-web-
access-from-work-or-school"
links = WWW::Mechanize.new.get(url).links.map { |a| a.uri rescue
nil }.flatten
File.open('links.txt', 'w') { |f| f.puts(links) }

This saves all the relative links, however.

-- Daniel
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

SQL Connection string regex pattern to parse sections 1
Help 1
Hijacking `super' 2
Help please 8
Help with Visual Lightbox: Scripts 2
Newbee in VHDL 1
Help with Github??? 2
I need help fixing my website 2

Members online

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top