strip text up to a keyword?

kw · Oct 9, 2003

Is there a relatively simple way to strip/delete all text from a file,
up to a keyword? It has to work around linefeeds. I have the webpage
stored in a text variable. Basically I want to remove a bunch of
useless text, ads, scripts, etc from webpage html, and have the
important text content as the result...

$webpage=get_webpage("http://news.yahoo.com/news?tmpl=index&cid=716");
#done
$webpage=strip_html($webpage); #done
$webpage=strip_text_up_to("Top Stories",$webpage);
$webpage=strip_text_after("Top Stories Section",$webpage);

-note, my perl skills are pretty limited so a more simplistic approach
would help me
-note, I've searched and can't find anything to do this, if it exists
already please point me to it

Thanks!

Michael Budash · Oct 9, 2003

Is there a relatively simple way to strip/delete all text from a file,
up to a keyword? It has to work around linefeeds. I have the webpage
stored in a text variable. Basically I want to remove a bunch of
useless text, ads, scripts, etc from webpage html, and have the
important text content as the result...

$webpage=get_webpage("http://news.yahoo.com/news?tmpl=index&cid=716");
#done
$webpage=strip_html($webpage); #done
$webpage=strip_text_up_to("Top Stories",$webpage);

sub strip_text_up_to {
my ($string, $webpage) = @_;
$webpage =~ s/^.*$string//s; #
return $webpage;
}

$webpage=strip_text_after("Top Stories Section",$webpage);

sub strip_text_after {
my ($string, $webpage) = @_;
$webpage =~ s/$string.*$//s;
return $webpage;
}

in order to understand this, type at the command line:

perldoc perlre

there you'll see that, in both cases, .* matches as much as possible
("greedy"), while .*? will match as little as possible ("non-greedy").
make your choice for your particular needs.

note also that, in your specific case, there is no way for the script to
know _which_ occurance of 'Top Stories' it's supposed to stop at, and
there are several, including the one in 'Top Stories Section' !! so my
example may not help you much. though if you reverse the two subroutine
calls, the resulr smay be closer to desired...

you might wanna read up on 'lookahead' and 'negative lookahead' in the
docs ref above. could help you.

-note, my perl skills are pretty limited so a more simplistic approach
would help me
-note, I've searched and can't find anything to do this, if it exists
already please point me to it

welcome to perl regular expressions!

Tad McClellan · Oct 10, 2003

kw said:
Is there a relatively simple way to strip/delete all text from a file,
up to a keyword?

No. There is a relatively simple way to strip/delete all text from a
string, up to a keyword though.

You did not say what to do when the keyword occurs more than once...

$string =~ s/.*keyword//s; # up to last keyword

$string =~ s/.*?keyword//s; # up to first keyword

$webpage=get_webpage("http://news.yahoo.com/news?tmpl=index&cid=716");

How is get_webpage() different from LWP::Simple::get() ?

$webpage=strip_html($webpage); #done

I'd be willing to bet a dollar that strip_html() has bugs in it.

If you show it to us, we can show you some of its bugs.

In the meantime, you can try it with data like is shown for
the corresponding Frequently Asked Question:

How do I remove HTML from a string?

$webpage=strip_text_up_to("Top Stories",$webpage);

You can write strip_text_up_to() so that it will modify its
argument, then you wouldn't need to assign it back to itself:

strip_text_up_to("Top Stories",$webpage);

sub strip_text_up_to { # untested
my $keyword = quotemeta shift;
$_[0] =~ s/.*$keyword//s; # up to last keyword
}

-note, my perl skills are pretty limited so a more simplistic approach
would help me

Processing HTML is not simple, that is the nature of that beast.

Consider doing simple tasks before moving on to complex tasks.

Anno Siegel · Oct 10, 2003

Tad McClellan said:
No.

Well, there is, but the OP doesn't need it (because the text is in memory
anyway). This prints blocks from "BEGIN" to "END" from a file:

while ( <DATA> ) {
print if s/.*(?=BEGIN)// .. s/(?<=END).*//;
}

Anno

Tad McClellan · Oct 10, 2003

Anno Siegel said:
Well, there is, but the OP doesn't need it (because the text is in memory
anyway). This prints blocks from "BEGIN" to "END" from a file:

while ( <DATA> ) {
print if s/.*(?=BEGIN)// .. s/(?<=END).*//;
}

But that does not modify the file, as stated in the spec (which
was my point: the spec was imprecise).

Add a line and change a line of your code, _then_ I'll be wrong.

$^I = ''; # no backup!
while ( <> ) {

kw · Oct 10, 2003

$webpage =~ s/^.*$string//s;

$_[0] =~ s/.*$keyword//s;

Do these work through multiple linefeeds (if keyword is on the 100th
line of the file/variable)? Or do the previous lines do something to
help out? I think I have tried similar RE's that only work on one
line.

How is get_webpage() different from LWP::Simple::get() ?

It's probably not, I was just trying to show the high level functions
I wanted to accomplish.

I'd be willing to bet a dollar that strip_html() has bugs in it.

If you show it to us, we can show you some of its bugs.

Probably true, all code has bugs

FWIW, I didn't write this
myself, I found one of the suggestions in the FAQ or somewhere online.

You can write strip_text_up_to() so that it will modify its
argument, then you wouldn't need to assign it back to itself:

These are the things I try to avoid in perl, I like having the code a
little more obvious. I don't write perl everyday, so when I come back
to this script in 6 months, I don't want to spend too much time
decrypting things like this. If it is functionally the same but
simpler (or looks more like C), that is all I care about for now

I'll try to pick "keywords" that are more unique.

Thanks for everyone's help! I'll try these ideas out this weekend.

Anno Siegel · Oct 10, 2003

Tad McClellan said:
But that does not modify the file, as stated in the spec (which
was my point: the spec was imprecise).

Aha. I didn't note your point, although your "No." was notably pointed.

Add a line and change a line of your code, _then_ I'll be wrong.

$^I = ''; # no backup!
while ( <> ) {

Okay, so the problem the "No." points out is the naive mind-set that takes
Perl's image of a file (an array of lines) for the reality of the file.
The usual editors reinforce the notion of a file as a sequence of lines
that can be changed at will, that's makes it so common.

In reality, changing things in the middle of a file is possible, but if
the length of the file changes, this requires a complete rewrite of at
least everything that follows the change. This, again, means that there
is a moment when part of the file is only in memory, not on disk. That
part could get lost in a crash, so the operation isn't safe.

It is easier and safer (those rarely go together) to write a modified
copy of the original file, and rename that to the original after the
deed. This is what editors do to give you the illusion, and it is what
Perl does with your code above.

Anno

PS: Of course, Tad knows everything I wrote about files. It's this Usenet
way of ostensibly talking to the one you're replying to, but really writing
for a wider audience, to whom it may concern.

Tad McClellan · Oct 11, 2003

^^^^
^^^^ consider using some other name.

$webpage =~ s/^.*$string//s;

Click to expand...

$_[0] =~ s/.*$keyword//s;

Click to expand...

Do these work through multiple linefeeds

What happened when you tried it?

FAQ 6.11 How do I use a regular expression to strip C style comments from a file?	0	Feb 10, 2011
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
Parsing HTML Source in Subdirectory/Auto-gen/CGI pages	3	Jun 24, 2004
Text to line up level with top of image	7	May 3, 2010
space under text in a table	33	Feb 26, 2007
How to separate a big text file (say 400 news stories) to many smalltext files?	9	Mar 22, 2009
how is this done?	2	Oct 3, 2006
How to set up a fast correct java build?	41	Jan 8, 2010

strip text up to a keyword?

kw

Michael Budash

Tad McClellan

Anno Siegel

Tad McClellan

kw

Anno Siegel

Tad McClellan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads