Repost: Conditional look-ahead?

S

Steve Dunn

I'm stuck with a regular expression that closes tags if a closing tag
doesn't already exist. It's probably easier to demonstrate than explain, so
here goes:

I need to turn the following structure:

<TAG1>
<TAG2>foo
<FOOBAR>
<TAG3>I'm ok</TAG3>
</TAG1>

into the follow:

<TAG1>
<TAG2>foo</TAG2>
<FOOBAR />
<TAG3>I'm ok</TAG3>
</TAG1>

So, TAG1 and TAG3 are left alone, as they contains a closing tag, TAG2 is
closed because there's text and no closing tag, and <FOOBAR> is closed
because there's no text and closing tag.

Any help very much appreciated,

Thanks,

Steve.
p.s. I'm using the .NET regex classes
 
G

Gunnar Hjalmarsson

Steve said:
I'm stuck with a regular expression that closes tags if a closing
tag doesn't already exist. It's probably easier to demonstrate
than explain, so here goes:

I need to turn the following structure:

<TAG1>
<TAG2>foo
<FOOBAR>
<TAG3>I'm ok</TAG3>
</TAG1>

into the follow:

<TAG1>
<TAG2>foo</TAG2>
<FOOBAR />
<TAG3>I'm ok</TAG3>
</TAG1>

So, TAG1 and TAG3 are left alone, as they contains a closing tag,
TAG2 is closed because there's text and no closing tag, and
<FOOBAR> is closed because there's no text and closing tag.

I find it difficult to believe that you should keep struggling with a
regexp for this task. Have you studied any of the many modules for
dealing with HTML?
 
B

Ben Morrow

Steve Dunn said:
I'm stuck with a regular expression that closes tags if a closing tag
doesn't already exist. It's probably easier to demonstrate than explain, so
here goes:

I need to turn the following structure:

<TAG1>
<TAG2>foo
<FOOBAR>
<TAG3>I'm ok</TAG3>
</TAG1>

into the follow:

<TAG1>
<TAG2>foo</TAG2>
<FOOBAR />
<TAG3>I'm ok</TAG3>
</TAG1>

So, TAG1 and TAG3 are left alone, as they contains a closing tag, TAG2 is
closed because there's text and no closing tag, and <FOOBAR> is closed
because there's no text and closing tag.

Firstly, this is not a task for a regex alone. In general it is a task
for an SGML -> XML translator, which is a thoroughly non-trivial
problem :(.

From your example, though, am I to take it that your content model is
never mixed; that is, that elements either contain other elements, or
text, but never both? If the answer to this is 'no', then you will
need to use a DTD to do what you wish, and probably a full SGML
parser.

I reckon the best way for you proceed is to use XML::SAX, writing a
consumer class which just spits what it is fed back out again,
*unless* it sees the sequence of events 'start_tag, characters,
start_tag' without an end_tag, in which case it inserts one. You can
track that with a class-level variable. Unless...
p.s. I'm using the .NET regex classes

....you are in fact not writing Perl? If not, then, as I said, it's not
a task for a single regex so I don't think anyone here can (or rather
will) help you...

Ben
 
S

Steve Dunn

You find it difficult to believe? Well Gunnar, believe it! Actually, I've
not studied any of the HTML 'modules' as I don't class the issue as an HTML
parsing issue. I was just asking for assistance in a regular expression
that would close a tag based on the non-presence of said tag. I thought
this would have been covered by the conditional look-ahead functions of
regex.

Now it's my turn. I find it difficult to believe YOU struggle with
questions such as mine. If you don't like the question or can't offer any
constructive criticism, then ignore it! You diblet!

Steve.
 
S

Steve Dunn

Hi Ben,
As the title implied, I 'thought' this may be possible using conditional
look-aheads in regex. Of course, I could use a combination of regex and
native code, but I was looking for an elegant solution; a solution which I
though possible with regex alone. Thanks for the feed-back though.

Steve
 
B

Ben Morrow

Steve Dunn said:
Now it's my turn. I find it difficult to believe YOU struggle with
questions such as mine. If you don't like the question or can't offer any
constructive criticism, then ignore it! You diblet!

*PLONK*
 
J

Jeff 'japhy' Pinyan

<TAG1>
<TAG2>foo
<FOOBAR>
<TAG3>I'm ok</TAG3>
</TAG1> [should become]
<TAG1>
<TAG2>foo</TAG2>
<FOOBAR />
<TAG3>I'm ok</TAG3>
</TAG1>

So, TAG1 and TAG3 are left alone, as they contains a closing tag, TAG2 is
closed because there's text and no closing tag, and <FOOBAR> is closed
because there's no text and closing tag.

Why does <FOOBAR> become <FOOBAR />, and not

<FOOBAR>
<TAG3>I'm ok</TAG3>
</FOOBAR>

You need a DTD for this. It's not possible without one.
 
G

Gunnar Hjalmarsson

Please don't top post!

Steve said:
You find it difficult to believe? Well Gunnar, believe it!
Actually, I've not studied any of the HTML 'modules' as I don't
class the issue as an HTML parsing issue.

Really? Now, after having read the other responses, I'm even more
convinced that you ask about how to accomplish 'mission impossible'.
Now it's my turn. I find it difficult to believe YOU struggle with
questions such as mine.

It was obviously wasted time.
If you don't like the question or can't offer any constructive
criticism, then ignore it!

I didn't see any code that I could criticize. But, as others, I
offered an _advise_: Don't use regex, use a module.
You diblet!

I don't understand the meaning of that word (I'm not an English
native). Suppose it's not anything nice, right?
 
C

Chris Mattern

Steve Dunn wrote:

Don't top post, please.
You find it difficult to believe? Well Gunnar, believe it! Actually, I've
not studied any of the HTML 'modules' as I don't class the issue as an HTML
parsing issue. I was just asking for assistance in a regular expression
that would close a tag based on the non-presence of said tag.

You are looking to locate and understand an HTML tag, and then determine if
it has the proper matching closing tag. If this isn't parsing HTML, then
what the devil is it?
I thought
this would have been covered by the conditional look-ahead functions of
regex.

Don't parse HTML with regexes.
Now it's my turn. I find it difficult to believe YOU struggle with
questions such as mine. If you don't like the question or can't offer any
constructive criticism, then ignore it! You diblet!
I sense a great disturbance in the Usenet--as if a thousand killfiles had
all been plonked at once...

Chris Mattern
 
J

Jay Tilton

: You find it difficult to believe? Well Gunnar, believe it!

Quoting Gunnar:

I find it difficult to believe that you should keep
struggling with a regexp for this task.

I read that to say "There should be a compelling reason for trying to do
this with a regular expression."

You evidently read it differently.

: Now it's my turn. I find it difficult to believe YOU struggle with
: questions such as mine. If you don't like the question or can't offer any
: constructive criticism,

An attempt to steer you towards using a tool more appropriate to the
task at hand is always constructive.

: then ignore it! You diblet!

Many besides Gunnar are likely to do some ignoring now.

Good luck.
 
S

Steve Dunn

Hi Jeff,
Thanks for the reply. Good question. There's a simple rule with the
files I'm processing (They're US Securities disseminations files). The rule
is that any tag on a line that does not have any proceeding characters and
no closing tag elsewhere in the text, then it should be closed immediatly.
That's why the FOOBAR tag is closed immediatly, because there's no closing
tag elswehere in the text.

I don't know too much about DTD's, but based on this simple rule, a DTD
would appear over-kill (and a regex would appear ideal).

I've got the regex to the point where it can recognise a tag on a line that
has characters proceeding it, and closes that line, ie.
<TAG1>characters
becomes
<TAG1>characters</TAG1>
For this, I use the following regex:
^<([^/].+?)>([\S ]+)
I then use an evaluator as part of the regex replace to output <$1>$2</$1>

So, the regex would like like:
opening <, any chars except '/', closing >, new line, any chars on any line
except start of line - </TAG1>

Thanks again,

Steve

Jeff 'japhy' Pinyan said:
<TAG1>
<TAG2>foo
<FOOBAR>
<TAG3>I'm ok</TAG3>
</TAG1> [should become]
<TAG1>
<TAG2>foo</TAG2>
<FOOBAR />
<TAG3>I'm ok</TAG3>
</TAG1>

So, TAG1 and TAG3 are left alone, as they contains a closing tag, TAG2 is
closed because there's text and no closing tag, and <FOOBAR> is closed
because there's no text and closing tag.

Why does <FOOBAR> become <FOOBAR />, and not

<FOOBAR>
<TAG3>I'm ok</TAG3>
</FOOBAR>

You need a DTD for this. It's not possible without one.

--
Jeff Pinyan RPI Acacia Brother #734 2003 Rush Chairman
"And I vos head of Gestapo for ten | Michael Palin (as Heinrich Bimmler)
years. Ah! Five years! Nein! No! | in: The North Minehead Bye-Election
Oh. Was NOT head of Gestapo AT ALL!" | (Monty Python's Flying Circus)
 
H

Helgi Briem

Now it's my turn. I find it difficult to believe YOU struggle with
questions such as mine. If you don't like the question or can't offer any
constructive criticism, then ignore it! You diblet!

You persist in ignoring correct and valuable advice from skilled
programmers and then get rude with them. Go and get stuffed
then, you top-posting loser.

*PLONK*
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,226
Members
46,815
Latest member
treekmostly22

Latest Threads

Top