strip text up to a keyword?

K

kw

Is there a relatively simple way to strip/delete all text from a file,
up to a keyword? It has to work around linefeeds. I have the webpage
stored in a text variable. Basically I want to remove a bunch of
useless text, ads, scripts, etc from webpage html, and have the
important text content as the result...

$webpage=get_webpage("http://news.yahoo.com/news?tmpl=index&cid=716");
#done
$webpage=strip_html($webpage); #done
$webpage=strip_text_up_to("Top Stories",$webpage);
$webpage=strip_text_after("Top Stories Section",$webpage);


-note, my perl skills are pretty limited so a more simplistic approach
would help me
-note, I've searched and can't find anything to do this, if it exists
already please point me to it


Thanks!
 
M

Michael Budash

Is there a relatively simple way to strip/delete all text from a file,
up to a keyword? It has to work around linefeeds. I have the webpage
stored in a text variable. Basically I want to remove a bunch of
useless text, ads, scripts, etc from webpage html, and have the
important text content as the result...

$webpage=get_webpage("http://news.yahoo.com/news?tmpl=index&cid=716");
#done
$webpage=strip_html($webpage); #done
$webpage=strip_text_up_to("Top Stories",$webpage);

sub strip_text_up_to {
my ($string, $webpage) = @_;
$webpage =~ s/^.*$string//s; #
return $webpage;
}
$webpage=strip_text_after("Top Stories Section",$webpage);

sub strip_text_after {
my ($string, $webpage) = @_;
$webpage =~ s/$string.*$//s;
return $webpage;
}

in order to understand this, type at the command line:

perldoc perlre

there you'll see that, in both cases, .* matches as much as possible
("greedy"), while .*? will match as little as possible ("non-greedy").
make your choice for your particular needs.

note also that, in your specific case, there is no way for the script to
know _which_ occurance of 'Top Stories' it's supposed to stop at, and
there are several, including the one in 'Top Stories Section' !! so my
example may not help you much. though if you reverse the two subroutine
calls, the resulr smay be closer to desired...

you might wanna read up on 'lookahead' and 'negative lookahead' in the
docs ref above. could help you.
-note, my perl skills are pretty limited so a more simplistic approach
would help me
-note, I've searched and can't find anything to do this, if it exists
already please point me to it

welcome to perl regular expressions!
 
T

Tad McClellan

kw said:
Is there a relatively simple way to strip/delete all text from a file,
up to a keyword?


No. There is a relatively simple way to strip/delete all text from a
string, up to a keyword though.

You did not say what to do when the keyword occurs more than once...


$string =~ s/.*keyword//s; # up to last keyword

$string =~ s/.*?keyword//s; # up to first keyword



How is get_webpage() different from LWP::Simple::get() ?

$webpage=strip_html($webpage); #done


I'd be willing to bet a dollar that strip_html() has bugs in it.

If you show it to us, we can show you some of its bugs.

In the meantime, you can try it with data like is shown for
the corresponding Frequently Asked Question:

How do I remove HTML from a string?

$webpage=strip_text_up_to("Top Stories",$webpage);


You can write strip_text_up_to() so that it will modify its
argument, then you wouldn't need to assign it back to itself:

strip_text_up_to("Top Stories",$webpage);

sub strip_text_up_to { # untested
my $keyword = quotemeta shift;
$_[0] =~ s/.*$keyword//s; # up to last keyword
}

-note, my perl skills are pretty limited so a more simplistic approach
would help me


Processing HTML is not simple, that is the nature of that beast.

Consider doing simple tasks before moving on to complex tasks.
 
A

Anno Siegel

Tad McClellan said:

Well, there is, but the OP doesn't need it (because the text is in memory
anyway). This prints blocks from "BEGIN" to "END" from a file:

while ( <DATA> ) {
print if s/.*(?=BEGIN)// .. s/(?<=END).*//;
}

Anno
 
T

Tad McClellan

Anno Siegel said:
Well, there is, but the OP doesn't need it (because the text is in memory
anyway). This prints blocks from "BEGIN" to "END" from a file:

while ( <DATA> ) {
print if s/.*(?=BEGIN)// .. s/(?<=END).*//;
}


But that does not modify the file, as stated in the spec (which
was my point: the spec was imprecise).

Add a line and change a line of your code, _then_ I'll be wrong. :)


$^I = ''; # no backup!
while ( <> ) {
 
K

kw

$webpage =~ s/^.*$string//s;
$_[0] =~ s/.*$keyword//s;

Do these work through multiple linefeeds (if keyword is on the 100th
line of the file/variable)? Or do the previous lines do something to
help out? I think I have tried similar RE's that only work on one
line.

How is get_webpage() different from LWP::Simple::get() ?

It's probably not, I was just trying to show the high level functions
I wanted to accomplish.
I'd be willing to bet a dollar that strip_html() has bugs in it.

If you show it to us, we can show you some of its bugs.

Probably true, all code has bugs :) FWIW, I didn't write this
myself, I found one of the suggestions in the FAQ or somewhere online.
You can write strip_text_up_to() so that it will modify its
argument, then you wouldn't need to assign it back to itself:

These are the things I try to avoid in perl, I like having the code a
little more obvious. I don't write perl everyday, so when I come back
to this script in 6 months, I don't want to spend too much time
decrypting things like this. If it is functionally the same but
simpler (or looks more like C), that is all I care about for now :)


I'll try to pick "keywords" that are more unique.


Thanks for everyone's help! I'll try these ideas out this weekend.
 
A

Anno Siegel

Tad McClellan said:
But that does not modify the file, as stated in the spec (which
was my point: the spec was imprecise).

Aha. I didn't note your point, although your "No." was notably pointed.
Add a line and change a line of your code, _then_ I'll be wrong. :)


$^I = ''; # no backup!
while ( <> ) {

Okay, so the problem the "No." points out is the naive mind-set that takes
Perl's image of a file (an array of lines) for the reality of the file.
The usual editors reinforce the notion of a file as a sequence of lines
that can be changed at will, that's makes it so common.

In reality, changing things in the middle of a file is possible, but if
the length of the file changes, this requires a complete rewrite of at
least everything that follows the change. This, again, means that there
is a moment when part of the file is only in memory, not on disk. That
part could get lost in a crash, so the operation isn't safe.

It is easier and safer (those rarely go together) to write a modified
copy of the original file, and rename that to the original after the
deed. This is what editors do to give you the illusion, and it is what
Perl does with your code above.

Anno

PS: Of course, Tad knows everything I wrote about files. It's this Usenet
way of ostensibly talking to the one you're replying to, but really writing
for a wider audience, to whom it may concern.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,142
Messages
2,570,818
Members
47,362
Latest member
eitamoro

Latest Threads

Top