Seperate body content from HTML

  • Thread starter Matthew Margolis
  • Start date
M

Matthew Margolis

I am currently working on a script that will parse lyrics on online
lyric pages. To get at the actual lyrics I need to take the HTML source
and somehow separate out all the <BODY> content. I would then put all
of the body content into a string and parse it to remove all image tags,
style tags and non visible characters leaving me with just text.

I am new to Ruby and regular expressions so I am having some trouble
getting the data between the <BODY> and </BODY> tags into a string.
Right now I have the entire page loaded into a string called data and I
run #scan with a regular expression and a block that prints out the
matches from #scan.

I guess I am just asking for a good regular expression(or other means)
of separating out the body content of an HTML document from the rest of
the source.

Thanks,
Matthew Margolis
 
J

Joao Pedrosa

Hi,
Take a look at the HTMLTokenizer module at RAA.
http://raa.ruby-lang.org/project/htmltokenizer/

Cheers,
Joao

--- Matthew Margolis said:
I am currently working on a script that will parse
lyrics on online
lyric pages. To get at the actual lyrics I need to
take the HTML source
and somehow separate out all the <BODY> content. I
would then put all
of the body content into a string and parse it to
remove all image tags,
style tags and non visible characters leaving me
with just text.

I am new to Ruby and regular expressions so I am
having some trouble
getting the data between the <BODY> and </BODY> tags
into a string.
Right now I have the entire page loaded into a
string called data and I
run #scan with a regular expression and a block that
prints out the
matches from #scan.

I guess I am just asking for a good regular
expression(or other means)
of separating out the body content of an HTML
document from the rest of
the source.

Thanks,
Matthew Margolis




__________________________________
Do you Yahoo!?
Read only the mail you want - Yahoo! Mail SpamGuard.
http://promotions.yahoo.com/new_mail
 
Z

Zachary P. Landau

--wxDdMuZNg1r63Hyj
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

I am currently working on a script that will parse lyrics on online=20
lyric pages. To get at the actual lyrics I need to take the HTML source= =20
and somehow separate out all the <BODY> content. I would then put all=20
of the body content into a string and parse it to remove all image tags,= =20
style tags and non visible characters leaving me with just text.
=20
I am new to Ruby and regular expressions so I am having some trouble=20
getting the data between the <BODY> and </BODY> tags into a string. =20
Right now I have the entire page loaded into a string called data and I= =20
run #scan with a regular expression and a block that prints out the=20
matches from #scan.
=20
I guess I am just asking for a good regular expression(or other means)=20
of separating out the body content of an HTML document from the rest of= =20
the source.
=20
Thanks,
Matthew Margolis

Matthew,

I wrote some code that does exactly the same thing, and I did it with
some regular expressions. It works, but it can get a little messy. You
might have better luck with an html tokenizer as someone else said.
Usually the hardest part is finding out all the variations on the HTML
returned. A lot of sites with dynamic content require trying to fetch
all kinds of information so you can see what the HTML will look like.

While writing lyrics plugins, one very difficult thing I ran into was
pages having different content depending on my User Agent string. For
example, sometimes the capitalization of the tags would be different in
different browsers. Once the content was completely different.

If you want to use some of my code to help your project along, you can
find it at http://kapheine.hypa.net/musicextras under the API docs (or
download it).

--
Zachary P. Landau <[email protected]>
GPG: gpg --recv-key 0x24E5AD99 | http://kapheine.hypa.net/kapheine.asc

--wxDdMuZNg1r63Hyj
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFA6tUpCwWyMCTlrZkRArG0AJ49Pt0xdMWj//W0VD19Ks6rkS2pOACdHg3A
zquNXVoRGaALwt9f6Mwsq0E=
=7den
-----END PGP SIGNATURE-----

--wxDdMuZNg1r63Hyj--
 
M

Matthew Margolis

Zachary said:
Matthew,

I wrote some code that does exactly the same thing, and I did it with
some regular expressions. It works, but it can get a little messy. You
might have better luck with an html tokenizer as someone else said.
Usually the hardest part is finding out all the variations on the HTML
returned. A lot of sites with dynamic content require trying to fetch
all kinds of information so you can see what the HTML will look like.

While writing lyrics plugins, one very difficult thing I ran into was
pages having different content depending on my User Agent string. For
example, sometimes the capitalization of the tags would be different in
different browsers. Once the content was completely different.

If you want to use some of my code to help your project along, you can
find it at http://kapheine.hypa.net/musicextras under the API docs (or
download it).
Thank you Zachary. I am checking out the API docs right now.

-Matthew Margolis
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,147
Messages
2,570,835
Members
47,383
Latest member
EzraGiffor

Latest Threads

Top