parsing html

M

Martin Pfeffer

hi
my problem is i need a file with german words and so i try to create a
file parsing html sites and write extracted words to a database so my
questizn is what is the easyest way to extract text from html pages?
thx
Martin
 
S

Stefan Schmiedl

hi
my problem is i need a file with german words and so i try to create a
file parsing html sites and write extracted words to a database so my
questizn is what is the easyest way to extract text from html pages?
thx
Martin

there's a /usr/share/dict/ngerman on my Debian box
wc ngerman
308860 308860 3998536 ngerman

which tells me that the average word length is about 13 (!) letters.
Unvorstellbar!

s.
 
B

Brian Schröder

If you don't mind senseless words like "img" that come from html markup:

--8<---
require 'open-uri' open('http://ruby.brian-schroeder.de').read.scan(/[-\wöäüß]+/i)
--8<---

If you have valid xhtml:

--8<---
require 'rexml/document'
require 'open-uri'

include REXML
Document.new(open('http://ruby.brian-schroeder.de')).
elements.to_a('//').
map{|e| e.texts.map{|t|t.value} }.
join(' ').
scan(/[-\wöäüß]+/i).
sort.
uniq
--8<---

hth,

Brian

PS: I'm shure the text-extraction with rexml can be done in a nicer/more efficent way.
 
B

Ben Giddings

Martin said:
my problem is i need a file with german words and so i try to create a
file parsing html sites and write extracted words to a database so my
questizn is what is the easyest way to extract text from html pages?

My "htmltokenizer" module (available on RAA and Rubyforge) is pretty
good at extracting text from HTML pages.

Ben
 
A

Alexander Kellett

My "htmltokenizer" module (available on RAA and Rubyforge) is pretty
good at extracting text from HTML pages.

aye. it rocks. thanks for that :)

Alex
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,160
Messages
2,570,889
Members
47,421
Latest member
StacyTaver

Latest Threads

Top