Repost: Conditional look-ahead?

Steve Dunn · Oct 29, 2003

I'm stuck with a regular expression that closes tags if a closing tag
doesn't already exist. It's probably easier to demonstrate than explain, so
here goes:

I need to turn the following structure:

<TAG1>
<TAG2>foo
<FOOBAR>
<TAG3>I'm ok</TAG3>
</TAG1>

into the follow:

<TAG1>
<TAG2>foo</TAG2>
<FOOBAR />
<TAG3>I'm ok</TAG3>
</TAG1>

So, TAG1 and TAG3 are left alone, as they contains a closing tag, TAG2 is
closed because there's text and no closing tag, and <FOOBAR> is closed
because there's no text and closing tag.

Any help very much appreciated,

Thanks,

Steve.
p.s. I'm using the .NET regex classes

Gunnar Hjalmarsson · Oct 29, 2003

Steve said:
I'm stuck with a regular expression that closes tags if a closing
tag doesn't already exist. It's probably easier to demonstrate
than explain, so here goes:

I need to turn the following structure:

<TAG1>
<TAG2>foo
<FOOBAR>
<TAG3>I'm ok</TAG3>
</TAG1>

into the follow:

<TAG1>
<TAG2>foo</TAG2>
<FOOBAR />
<TAG3>I'm ok</TAG3>
</TAG1>

So, TAG1 and TAG3 are left alone, as they contains a closing tag,
TAG2 is closed because there's text and no closing tag, and
<FOOBAR> is closed because there's no text and closing tag.

I find it difficult to believe that you should keep struggling with a
regexp for this task. Have you studied any of the many modules for
dealing with HTML?

Ben Morrow · Oct 29, 2003

Steve Dunn said:
I'm stuck with a regular expression that closes tags if a closing tag
doesn't already exist. It's probably easier to demonstrate than explain, so
here goes:

I need to turn the following structure:

<TAG1>
<TAG2>foo
<FOOBAR>
<TAG3>I'm ok</TAG3>
</TAG1>

into the follow:

<TAG1>
<TAG2>foo</TAG2>
<FOOBAR />
<TAG3>I'm ok</TAG3>
</TAG1>

So, TAG1 and TAG3 are left alone, as they contains a closing tag, TAG2 is
closed because there's text and no closing tag, and <FOOBAR> is closed
because there's no text and closing tag.

Firstly, this is not a task for a regex alone. In general it is a task
for an SGML -> XML translator, which is a thoroughly non-trivial
problem

.

From your example, though, am I to take it that your content model is
never mixed; that is, that elements either contain other elements, or
text, but never both? If the answer to this is 'no', then you will
need to use a DTD to do what you wish, and probably a full SGML
parser.

I reckon the best way for you proceed is to use XML::SAX, writing a
consumer class which just spits what it is fed back out again,
*unless* it sees the sequence of events 'start_tag, characters,
start_tag' without an end_tag, in which case it inserts one. You can
track that with a class-level variable. Unless...

p.s. I'm using the .NET regex classes

....you are in fact not writing Perl? If not, then, as I said, it's not
a task for a single regex so I don't think anyone here can (or rather
will) help you...

Ben

Steve Dunn · Oct 29, 2003

You find it difficult to believe? Well Gunnar, believe it! Actually, I've
not studied any of the HTML 'modules' as I don't class the issue as an HTML
parsing issue. I was just asking for assistance in a regular expression
that would close a tag based on the non-presence of said tag. I thought
this would have been covered by the conditional look-ahead functions of
regex.

Now it's my turn. I find it difficult to believe YOU struggle with
questions such as mine. If you don't like the question or can't offer any
constructive criticism, then ignore it! You diblet!

Steve.

Steve Dunn · Oct 29, 2003

Hi Ben,
As the title implied, I 'thought' this may be possible using conditional
look-aheads in regex. Of course, I could use a combination of regex and
native code, but I was looking for an elegant solution; a solution which I
though possible with regex alone. Thanks for the feed-back though.

Steve

Ben Morrow · Oct 29, 2003

Steve Dunn said:
Now it's my turn. I find it difficult to believe YOU struggle with
questions such as mine. If you don't like the question or can't offer any
constructive criticism, then ignore it! You diblet!

*PLONK*

Jeff 'japhy' Pinyan · Oct 29, 2003

<TAG1>
<TAG2>foo
<FOOBAR>
<TAG3>I'm ok</TAG3>
</TAG1> [should become]
<TAG1>
<TAG2>foo</TAG2>
<FOOBAR />
<TAG3>I'm ok</TAG3>
</TAG1>

So, TAG1 and TAG3 are left alone, as they contains a closing tag, TAG2 is
closed because there's text and no closing tag, and <FOOBAR> is closed
because there's no text and closing tag.

Why does <FOOBAR> become <FOOBAR />, and not

<FOOBAR>
<TAG3>I'm ok</TAG3>
</FOOBAR>

You need a DTD for this. It's not possible without one.

Gunnar Hjalmarsson · Oct 29, 2003

Please don't top post!

Steve said:
You find it difficult to believe? Well Gunnar, believe it!
Actually, I've not studied any of the HTML 'modules' as I don't
class the issue as an HTML parsing issue.

Really? Now, after having read the other responses, I'm even more
convinced that you ask about how to accomplish 'mission impossible'.

Now it's my turn. I find it difficult to believe YOU struggle with
questions such as mine.

It was obviously wasted time.

If you don't like the question or can't offer any constructive
criticism, then ignore it!

I didn't see any code that I could criticize. But, as others, I
offered an _advise_: Don't use regex, use a module.

You diblet!

I don't understand the meaning of that word (I'm not an English
native). Suppose it's not anything nice, right?

Chris Mattern · Oct 29, 2003

Steve Dunn wrote:

Don't top post, please.

You find it difficult to believe? Well Gunnar, believe it! Actually, I've
not studied any of the HTML 'modules' as I don't class the issue as an HTML
parsing issue. I was just asking for assistance in a regular expression
that would close a tag based on the non-presence of said tag.

You are looking to locate and understand an HTML tag, and then determine if
it has the proper matching closing tag. If this isn't parsing HTML, then
what the devil is it?

I thought
this would have been covered by the conditional look-ahead functions of
regex.

Don't parse HTML with regexes.

Now it's my turn. I find it difficult to believe YOU struggle with
questions such as mine. If you don't like the question or can't offer any
constructive criticism, then ignore it! You diblet!

I sense a great disturbance in the Usenet--as if a thousand killfiles had
all been plonked at once...

Chris Mattern

Jay Tilton · Oct 29, 2003

: You find it difficult to believe? Well Gunnar, believe it!

Quoting Gunnar:

I find it difficult to believe that you should keep
struggling with a regexp for this task.

I read that to say "There should be a compelling reason for trying to do
this with a regular expression."

You evidently read it differently.

: Now it's my turn. I find it difficult to believe YOU struggle with
: questions such as mine. If you don't like the question or can't offer any
: constructive criticism,

An attempt to steer you towards using a tool more appropriate to the
task at hand is always constructive.

: then ignore it! You diblet!

Many besides Gunnar are likely to do some ignoring now.

Good luck.

Steve Dunn · Oct 30, 2003

Hi Jeff,
Thanks for the reply. Good question. There's a simple rule with the
files I'm processing (They're US Securities disseminations files). The rule
is that any tag on a line that does not have any proceeding characters and
no closing tag elsewhere in the text, then it should be closed immediatly.
That's why the FOOBAR tag is closed immediatly, because there's no closing
tag elswehere in the text.

I don't know too much about DTD's, but based on this simple rule, a DTD
would appear over-kill (and a regex would appear ideal).

I've got the regex to the point where it can recognise a tag on a line that
has characters proceeding it, and closes that line, ie.
<TAG1>characters
becomes
<TAG1>characters</TAG1>
For this, I use the following regex:
^<([^/].+?)>([\S ]+)
I then use an evaluator as part of the regex replace to output <$1>$2</$1>

So, the regex would like like:
opening <, any chars except '/', closing >, new line, any chars on any line
except start of line - </TAG1>

Thanks again,

Steve

Jeff 'japhy' Pinyan said:
<TAG1>
<TAG2>foo
<FOOBAR>
<TAG3>I'm ok</TAG3>
</TAG1> [should become]
<TAG1>
<TAG2>foo</TAG2>
<FOOBAR />
<TAG3>I'm ok</TAG3>
</TAG1>

So, TAG1 and TAG3 are left alone, as they contains a closing tag, TAG2 is
closed because there's text and no closing tag, and <FOOBAR> is closed
because there's no text and closing tag.

Click to expand...

Why does <FOOBAR> become <FOOBAR />, and not

<FOOBAR>
<TAG3>I'm ok</TAG3>
</FOOBAR>

You need a DTD for this. It's not possible without one.

--
Jeff Pinyan RPI Acacia Brother #734 2003 Rush Chairman
"And I vos head of Gestapo for ten | Michael Palin (as Heinrich Bimmler)
years. Ah! Five years! Nein! No! | in: The North Minehead Bye-Election
Oh. Was NOT head of Gestapo AT ALL!" | (Monty Python's Flying Circus)

Helgi Briem · Oct 30, 2003

Now it's my turn. I find it difficult to believe YOU struggle with
questions such as mine. If you don't like the question or can't offer any
constructive criticism, then ignore it! You diblet!

You persist in ignoring correct and valuable advice from skilled
programmers and then get rude with them. Go and get stuffed
then, you top-posting loser.

*PLONK*

Conditional look-ahead?	2	Oct 29, 2003
Complicated class initialization/instantiation problem.	6	Jul 16, 2008
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
What have they done!?	100	Apr 15, 2007
MiniQuiz : Renesting Nodes (OWLScratch)	1	Jun 23, 2005
anybody help me	1	Feb 10, 2006
PEP thought experiment: Unix style exec for function/method calls	4	Jun 25, 2006
PEP 350: Codetags	20	Sep 26, 2005

Repost: Conditional look-ahead?

Steve Dunn

Gunnar Hjalmarsson

Ben Morrow

Steve Dunn

Steve Dunn

Ben Morrow

Jeff 'japhy' Pinyan

Gunnar Hjalmarsson

Chris Mattern

Jay Tilton

Steve Dunn

Helgi Briem

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads