Seperate body content from HTML

Matthew Margolis · Jul 5, 2004

I am currently working on a script that will parse lyrics on online
lyric pages. To get at the actual lyrics I need to take the HTML source
and somehow separate out all the <BODY> content. I would then put all
of the body content into a string and parse it to remove all image tags,
style tags and non visible characters leaving me with just text.

I am new to Ruby and regular expressions so I am having some trouble
getting the data between the <BODY> and </BODY> tags into a string.
Right now I have the entire page loaded into a string called data and I
run #scan with a regular expression and a block that prints out the
matches from #scan.

I guess I am just asking for a good regular expression(or other means)
of separating out the body content of an HTML document from the rest of
the source.

Thanks,
Matthew Margolis

Joao Pedrosa · Jul 5, 2004

Hi,
Take a look at the HTMLTokenizer module at RAA.
http://raa.ruby-lang.org/project/htmltokenizer/

Cheers,
Joao

--- Matthew Margolis said:
I am currently working on a script that will parse
lyrics on online
lyric pages. To get at the actual lyrics I need to
take the HTML source
and somehow separate out all the <BODY> content. I
would then put all
of the body content into a string and parse it to
remove all image tags,
style tags and non visible characters leaving me
with just text.

I am new to Ruby and regular expressions so I am
having some trouble
getting the data between the <BODY> and </BODY> tags
into a string.
Right now I have the entire page loaded into a
string called data and I
run #scan with a regular expression and a block that
prints out the
matches from #scan.

I guess I am just asking for a good regular
expression(or other means)
of separating out the body content of an HTML
document from the rest of
the source.

Thanks,
Matthew Margolis

__________________________________
Do you Yahoo!?
Read only the mail you want - Yahoo! Mail SpamGuard.
http://promotions.yahoo.com/new_mail

Matthew Margolis · Jul 5, 2004

Joao said:
Hi,
Take a look at the HTMLTokenizer module at RAA.
http://raa.ruby-lang.org/project/htmltokenizer/

Cheers,
Joao

Excellent. Thank you very much.

-Matthew Margolis

Zachary P. Landau · Jul 6, 2004

--wxDdMuZNg1r63Hyj
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

I am currently working on a script that will parse lyrics on online=20
lyric pages. To get at the actual lyrics I need to take the HTML source= =20
and somehow separate out all the <BODY> content. I would then put all=20
of the body content into a string and parse it to remove all image tags,= =20
style tags and non visible characters leaving me with just text.
=20
I am new to Ruby and regular expressions so I am having some trouble=20
getting the data between the <BODY> and </BODY> tags into a string. =20
Right now I have the entire page loaded into a string called data and I= =20
run #scan with a regular expression and a block that prints out the=20
matches from #scan.
=20
I guess I am just asking for a good regular expression(or other means)=20
of separating out the body content of an HTML document from the rest of= =20
the source.
=20
Thanks,
Matthew Margolis

Matthew,

I wrote some code that does exactly the same thing, and I did it with
some regular expressions. It works, but it can get a little messy. You
might have better luck with an html tokenizer as someone else said.
Usually the hardest part is finding out all the variations on the HTML
returned. A lot of sites with dynamic content require trying to fetch
all kinds of information so you can see what the HTML will look like.

While writing lyrics plugins, one very difficult thing I ran into was
pages having different content depending on my User Agent string. For
example, sometimes the capitalization of the tags would be different in
different browsers. Once the content was completely different.

If you want to use some of my code to help your project along, you can
find it at http://kapheine.hypa.net/musicextras under the API docs (or
download it).

--
Zachary P. Landau <[email protected]>
GPG: gpg --recv-key 0x24E5AD99 | http://kapheine.hypa.net/kapheine.asc

--wxDdMuZNg1r63Hyj
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFA6tUpCwWyMCTlrZkRArG0AJ49Pt0xdMWj//W0VD19Ks6rkS2pOACdHg3A
zquNXVoRGaALwt9f6Mwsq0E=
=7den
-----END PGP SIGNATURE-----

--wxDdMuZNg1r63Hyj--

Matthew Margolis · Jul 7, 2004

Zachary said:
Matthew,

I wrote some code that does exactly the same thing, and I did it with
some regular expressions. It works, but it can get a little messy. You
might have better luck with an html tokenizer as someone else said.
Usually the hardest part is finding out all the variations on the HTML
returned. A lot of sites with dynamic content require trying to fetch
all kinds of information so you can see what the HTML will look like.

While writing lyrics plugins, one very difficult thing I ran into was
pages having different content depending on my User Agent string. For
example, sometimes the capitalization of the tags would be different in
different browsers. Once the content was completely different.

If you want to use some of my code to help your project along, you can
find it at http://kapheine.hypa.net/musicextras under the API docs (or
download it).

Thank you Zachary. I am checking out the API docs right now.

-Matthew Margolis

Python client/server that reads HTML body from server	1	Apr 12, 2023
Play mp3 on body load call.	1	Dec 10, 2024
I'm about to get in trouble with the HTML <body></body> tags	10	Aug 12, 2023
Background image not showing up on html page	3	Sep 23, 2023
XHTML - how extend/create ELEMENT body in my DTD?	0	Oct 29, 2019
PHP cURL for large content and single HTTP request	1	Feb 23, 2023
Getting extra blank rows from appending HTML..?	2	Oct 24, 2023
HTML Site Problems	11	Nov 25, 2019

Seperate body content from HTML

Matthew Margolis

Joao Pedrosa

Matthew Margolis

Zachary P. Landau

Matthew Margolis

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads