Parsing out text from in between HTML tags

T

tgwaltz

Hello -

I'm new to perl and am having a tough time trying to complete a
theoretically simple statement. What I'm trying to do is write a very
simple search engine that searches an html file for a given
searchQuery. The way it's set up now is that if the searchQuery is
something like "java," every single page is a hit because the word
"javascript" is in the code in the form of the "<script
language="javascript">" etc. I want to specify that $searchQuery
should be surrounded like so:

">(anything)searchQuery(anything)<"

In other words, the searchQuery has to be in between two HTML tags.
Here's what I have at this point (the wrong way):

return unless ($fileName =~ /\Q$searchQuery\E/i);

Any help would be greatly appreciated!

Thanks,
TW
 
T

Tad J McClellan

I'm new to perl and am having a tough time trying to complete a
theoretically simple statement.


What you want to do (parse a context-free language) is not
as simple as it seems. It is, in fact, pretty darn complex.

What I'm trying to do is write a very
simple search engine that searches an html file for a given
searchQuery. The way it's set up now is that if the searchQuery is
something like "java," every single page is a hit because the word
"javascript" is in the code in the form of the "<script
language="javascript">" etc.


Should it match the below, or should it not match the below?

<p>You can use <strong>javascript</strong> for client-side programming</p>

If it should not match, then you probably want word-boundaries (\b) in
your pattern.

I want to specify that $searchQuery
should be surrounded like so:

">(anything)searchQuery(anything)<"


If $searchQuery = 'HTML tags' then should it match or not match the below?

<p><acronym title="HyperText Markup Language">HTML</acronym>
tags have angle-brackets</p>

If it should match, then "anything" above does not really mean anything...

"HTML tags", "HTML&nbsp;tags" and "HTML\ntags" should probably all match...

In other words, the searchQuery has to be in between two HTML tags.
^^^^^^^^^^^^^^^^^^^^^

That too is over-simplified.

Here's what I have at this point (the wrong way):

return unless ($fileName =~ /\Q$searchQuery\E/i);
^^^^

Do you want to search the name or search the content?

If you want to search the content, then you have chosen an extremely
poor name for your variable...

Once you have culled the data to only its content (ie. removed all markup),
and normalized it (eg. folded whitespace) then you probably want something like:

... $file_content =~ /\b\Q$searchQuery\E\b/i ...

Any help would be greatly appreciated!


Use a module that understands HTML for processing HTML data.

perldoc -q "remove HTML"

suggests a couple of modules that can help you (and there are many others as well).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top