M
Matthew Margolis
I am currently working on a script that will parse lyrics on online
lyric pages. To get at the actual lyrics I need to take the HTML source
and somehow separate out all the <BODY> content. I would then put all
of the body content into a string and parse it to remove all image tags,
style tags and non visible characters leaving me with just text.
I am new to Ruby and regular expressions so I am having some trouble
getting the data between the <BODY> and </BODY> tags into a string.
Right now I have the entire page loaded into a string called data and I
run #scan with a regular expression and a block that prints out the
matches from #scan.
I guess I am just asking for a good regular expression(or other means)
of separating out the body content of an HTML document from the rest of
the source.
Thanks,
Matthew Margolis
lyric pages. To get at the actual lyrics I need to take the HTML source
and somehow separate out all the <BODY> content. I would then put all
of the body content into a string and parse it to remove all image tags,
style tags and non visible characters leaving me with just text.
I am new to Ruby and regular expressions so I am having some trouble
getting the data between the <BODY> and </BODY> tags into a string.
Right now I have the entire page loaded into a string called data and I
run #scan with a regular expression and a block that prints out the
matches from #scan.
I guess I am just asking for a good regular expression(or other means)
of separating out the body content of an HTML document from the rest of
the source.
Thanks,
Matthew Margolis