robic0 schrob:
Here's just something to bust Gunnar's balls, its the ^ it's
anti-greedy formula, if you can understand it...
$_ =
qr/(?:<(?
?
\/*)($Name)\s*(\/*))|(?:META(.*?))|(?
$Name)((?:\s+$Name\s*=\s*["'][^<]*['"])+)\s*(\/*))|(?:\?(.*?)\?)|(?:!(?
?
OCTYPE(.*?))|(?:\[CDATA\[(.*?)\]\])|(?:--(.*?[^-])--)|(?:ATTLIST(.*?))|(?:ELEMENT(.*?))|(?:ENTITY(.*?)))))>)|(.+?)/s;
OK, let's see:
The last (.+?) doesn't make sense because it's not followed by any
This regular expression is pre-compiled for use in another expression
(I put in the $_ but its assigned a permanent name in use).
This
$RxParse = qr/(?:<(?
..)|(..)|(..))>)|(.+?)/s;
( ( 1 1|2 2|3 3) )|4 4
broken down is --
two outer posibilities separed by '|', one is $1,2 or 3, the other is $4.
pattern, which means +? will never backtrack to consume more. It should
be equivalent to (.).
The whole thing looks like a horribly broken regex for HTML parsing.
The whole thing is a high performance main parsing regexp used in a finished
XML 1.1 compliant parser. I say main because there are several subsequent regexp.
It kinda looks like this in a program line --
while ($$ref_parse_ln =~ /$RxParse/g) {}
It
produces weird results for input like '<META content=">foo">' or '<img
alt="foo"> this is not part of "foo">'. The last one is due to
inappropriate greediness.
I won't recommend any perldoc reading or any of that shit but
'<META content=">foo">'
poses a paradox that results in a non-parsing dilema that logic can't cure.
Primarily, the default general entities
&><'"
&><'"
apply even in html META char data.
I have written in perl, a high performance XML 1.1 compliant parser. META is not part of
XML. I have generalized my main regexp (with much penalty) to include META. The position
of META in the regexp overlaps HTML and XHTML because of closure.
This is because I am going to integrated HTML, XHTML and XML into a parser. I'm sitting on
a pure perl 1.1 compliant parser on my hard drive right now. A highly tuned, high-performance,
1.1 compliant parser. There's many tools involved. I also want to do full Schema validation.
I also want to jam as many tools as possible into it. I also don't want to give it away, I'm
not into this for the glory!
The $Name above works out to this (I don't feel I'm giving anything away here, this is trivial) --
@UC_Nstart = (
"\\x{C0}-\\x{D6}",
"\\x{D8}-\\x{F6}",
"\\x{F8}-\\x{2FF}",
"\\x{370}-\\x{37D}",
"\\x{37F}-\\x{1FFF}",
"\\x{200C}-\\x{200D}",
"\\x{2070}-\\x{218F}",
"\\x{2C00}-\\x{2FEF}",
"\\x{3001}-\\x{D7FF}",
"\\x{F900}-\\x{FDCF}",
"\\x{FDF0}-\\x{FFFD}",
"\\x{10000}-\\x{EFFFF}",
);
@UC_Nchar = (
"\\x{B7}",
"\\x{0300}-\\x{036F}",
"\\x{203F}-\\x{2040}",
);
$Nstrt = "[A-Za-z_:".join ('',@UC_Nstart)."]";
$Nchar = "[-\\w:\\.".join ('',@UC_Nchar).join ('',@UC_Nstart)."]";
$Name = "(?:$Nstrt$Nchar*?)";
I don't understand that but it's "icebergs".
I hope you can forgive a dislexic, bad speller
Hey thanks Lukas!
Any more questions, just let me know