Yet another Java regex problem

B

bauer

Hi,
there's a DocBook XML file which I want to modify. The file contains
something like
....
<mediaobject>
<imageobject>
<imagedata fileref="PathToImage" format="ImgFormat"/>
</imageobject>
</mediaobject>
....
I just want to match the whole <mediaobject> thingy and prepend one
line which contains the PathToImage as a XML comment just like
<!-- PathToImage -->

My input to the matcher is the whole file as is. First I tried to get a
regex to match the whole thing

content = content.replaceFirst(
"<mediaobject>" +
"\\s*<imageobject>" +
"\\s*<imagedata fileref=\".*\".*/>" +
"\\s*</imageobject>" +
"\\s*</mediaobject>",
"<!-- Test -->"
);

But when I use a backref (like \0 for the whole match or \1 if I use
parentheses for the filename) in the replacement string like this:
"<!-- Test -->\0"
I just get
<!-- Test --> + this square char which cannot display here

The strange thing is that when I use exactly the same pattern with
Pattern.compile(regex).matcher(str).replaceAll(repl)
nothing matches (opposed to the Java API statment for
String.replaceAll()).

I tried Pattern.MULTILINE and Pattern.DOTALL in any combination. I
tried to use .* instead of \\s and even used \r?\n? for the line
endings ... nothing works.

Please can anyone help me?

_

Tom
 
T

TechBookReport

Hi,
there's a DocBook XML file which I want to modify. The file contains
something like
...
<mediaobject>
<imageobject>
<imagedata fileref="PathToImage" format="ImgFormat"/>
</imageobject>
</mediaobject>
...
I just want to match the whole <mediaobject> thingy and prepend one
line which contains the PathToImage as a XML comment just like
<!-- PathToImage -->

My input to the matcher is the whole file as is. First I tried to get a
regex to match the whole thing

content = content.replaceFirst(
"<mediaobject>" +
"\\s*<imageobject>" +
"\\s*<imagedata fileref=\".*\".*/>" +
"\\s*</imageobject>" +
"\\s*</mediaobject>",
"<!-- Test -->"
);

But when I use a backref (like \0 for the whole match or \1 if I use
parentheses for the filename) in the replacement string like this:
"<!-- Test -->\0"
I just get
<!-- Test --> + this square char which cannot display here

The strange thing is that when I use exactly the same pattern with
Pattern.compile(regex).matcher(str).replaceAll(repl)
nothing matches (opposed to the Java API statment for
String.replaceAll()).

I tried Pattern.MULTILINE and Pattern.DOTALL in any combination. I
tried to use .* instead of \\s and even used \r?\n? for the line
endings ... nothing works.

Please can anyone help me?

_

Tom

Have you tried a pattern of "(<mediaobject)(.*)(</mediaobject>)". You
can then use a replacement along the lines of "<!-- PathToImage
-->$1$2$3". I'd also use Pattern.MULTILINE | Pattern.DOTALL when
building the pattern.

Hope that helps.

Pan
======================================================================
TechBookReport Java http://www.techbookreport.com/JavaIndex.html
 
B

bauer

TechBookReport said:
Have you tried a pattern of "(<mediaobject)(.*)(</mediaobject>)". You
can then use a replacement along the lines of "<!-- PathToImage
-->$1$2$3". I'd also use Pattern.MULTILINE | Pattern.DOTALL when
building the pattern.

Hope that helps.

Not really ... this results in the same problem I already described.
Instead of substituting \1\2\3 with the matching groups I get only this
special char (looks like a square, cannot displayed here). Btw I even
noticed that you used $1$2$3. This is perl, right? In Java it would be
\1\2\3 or am I wrong?

You can try it yourself. Save the following content to a file:
<chapter>
<title>Chapter 1</title>
<sect1>
<title>Section 1</title>
<para>
Test Test Test Test Test Test Test Test Test
</para>
<mediaobject>
<imageobject>
<imagedata fileref="image.svg" format="SVG"/>
</imageobject>
</mediaobject>
<para>
Test Test Test Test Test Test Test Test Test
</para>
</sect1>
</chapter>

Read this file with
public String readPlain( File file ) throws Exception
{
String content = new String();
String line = new String();
BufferedReader brd = new BufferedReader( new FileReader( file ) );
while ( ( line = brd.readLine() ) != null )
content += line + "\r\n";
brd.close();
return content;
}

and then apply a
content = Pattern.compile( "(<mediaobject)(.*)(</mediaobject>)",
Pattern.MULTILINE|Pattern.DOTALL).matcher(
content).replaceAll("<!-- Test -->\1\2\3");

_

Tom
 
B

bauer

Damn Java regex !!! It is $1$2$3. That was the point. I used the wrong
syntax for backrefs. But in Java API 1.4.2 under
java.util.regex.Pattern stands

Back references
\n Whatever the nth capturing group matched

So what ... ?!?
 
T

TechBookReport

Damn Java regex !!! It is $1$2$3. That was the point. I used the wrong
syntax for backrefs. But in Java API 1.4.2 under
java.util.regex.Pattern stands

Back references
\n Whatever the nth capturing group matched

So what ... ?!?
Did you escape the backslashes? Also, the funny square character is
probably the \r\n you are using. Try
System.getProperty("line.separator") instead.

Pan

======================================================================
TechBookReport Java http://www.techbookreport.com/JavaIndex.html
 
B

bauer

TechBookReport said:
Did you escape the backslashes? Also, the funny square character is
probably the \r\n you are using. Try
System.getProperty("line.separator") instead.
No the funny square char is not the \r\n cause if so it would be on
every line independant of the regex code. I'm on Windows and the app
runs only on this system but you are right, better I use
getProperty("line.separator").
I guess the funny square is some unicode character (\1=0x01?) if I use
\1 without escaping the backslash.
But that doesn't matter anymore, my problem is solved. Thanks for your
help.
 
A

Alan Moore

Have you tried a pattern of "(<mediaobject)(.*)(</mediaobject>)". You
can then use a replacement along the lines of "<!-- PathToImage
-->$1$2$3". I'd also use Pattern.MULTILINE | Pattern.DOTALL when
building the pattern.

If there can be more than one mediaobject element in a document, you
need to use a reluctant dot-star:

"<mediaobject.*?</mediaobject>"

Otherwise, it will match everything from the first opening tag to the
last closing tag. Even if there's only one such element, it will
probably be more efficient this way.

You don't really need to use capturing parentheses, since you're
re-inserting the whole match; just use $0:

str = str.replaceAll("<mediaobject.*?</mediaobject>",
"<!-- PathToImage -->$0");


The JDK regex package uses the same syntax as Perl WRT
backreferences--"\n" within the regex and "$n" in the replacement
string--except that it uses $0 instead of $& for the whole match, and
doesn't emulate the other dollar-plus-punctuation variables: $`, $',
and $+.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,989
Messages
2,570,207
Members
46,782
Latest member
ThomasGex

Latest Threads

Top