.*? instad of .*

Z

Zhidian Du

I read a program that strip tags of html

while(<>){

s/<.*?>//gs

}

I am courious why it uses .*?, not .*


Another question, while(<>) read one line at a time or a couple lines of a time?
How do I know it?


Thanks.

Z. Du
 
I

Iain Chalmers

I read a program that strip tags of html

while(<>){

s/<.*?>//gs

}

I am courious why it uses .*?, not .*

Because .* will strip a _lot_ more than just tags - compare the outputs
of these:

perl -e '$_="<h1>a heading</h1>";s/<.*>//gs;print'

perl -e '$_="<h1>a heading</h1>";s/<.*?>//gs;print'

of course this perfectly valid html comment tag:

some text <!-- this is a comment with a > in it --> some more text

breaks .*? too...

....the _real_ answer is - you can't (reliably) use a regex to parse
(arbitrary) html - if you want to parse html, use a real parser that
knows about html... Start with HTML::parser or similar...

cheers,

big
 
D

Default

I read a program that strip tags of html
s/<.*?>//gs
I am courious why it uses .*?, not .*

The ? makes the * non greedy.
Im not sure how to explain it exactly though.

.* means zero or more of any character except newline, but prefer more.
.*? means the same but dont prefer the or more part.

I think that regex means:
substitute
<
followed by zero or more non newline chars
but dont prefer more over zero
followed by >

I am new to programming so check with more experienced posters.
 
G

Gunnar Hjalmarsson

Zhidian said:
I read a program that strip tags of html

while(<>){

s/<.*?>//gs

}

I am courious why it uses .*?, not .*

That is a FAQ.

perldoc -q greedy
Another question, while(<>) read one line at a time or a couple
lines of a time? How do I know it?

Read about $. and $/ in

perldoc perlvar

Please learn to study the FAQ and the rest of the docs. The
documentation is part of the Perl distribution, and also available on
line, e.g. at

http://www.perldoc.com/

More about how to post here at
http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
 
T

Tassilo v. Parseval

Also sprach Default@IO_Error_1011101.xyz:
The ? makes the * non greedy.

Yup, greediness is the technical term here.
Im not sure how to explain it exactly though.

.* means zero or more of any character except newline, but prefer more.

Not quite. In the substitution

s/<.*?>//gs

the /s modifier will make the dot '.' also match newlines. So this
pattern then reads:

Substitute '<' followed by an arbitrary amount of arbitrary characters
up to the _first_ occurance of '>'.

If we have a greedy pattern

s/<.*>//gs

this becomes:

Substitute '<' followed by an arbitrary amount of arbitrary characters
up the _last_ occurance of '>'.

Note that

s/<.*?>//gs;

could also be written as

s/<[^>]>//g

Does the same but would read:

Substitute '<' followed by an arbitrary amount of characters _other_than_
'>'. So once it encounters a '>' the regex is done extracting the
substring.

Tassilo
 
T

Tassilo v. Parseval

Also sprach Gunnar Hjalmarsson:
Zhidian Du wrote:

Read about $. and $/ in

perldoc perlvar

Please learn to study the FAQ and the rest of the docs.

perlvar.pod is probably not the expected spot to look for that.
Actually, it's pretty hard to figure out where to find information on

while (<>)

when being new to perl. My first guess was perlsyn.pod but it is in fact
to be found in perlop.pod. You might even need both: First perlsyn.pod
to find out that 'while' works on an expression (so it's not
while(LIST)) which is evaluated in scalar context and after that,
figuring out what '<>' does in scalar context with the help of
perlop.pod.

Tassilo
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,142
Messages
2,570,819
Members
47,367
Latest member
mahdiharooniir

Latest Threads

Top