negative regexes.

R

Roedy Green

Here is a regex pattern that looks for unquoted urls.

Pattern.compile( "(href|src)[ ]*=[
]*([!#\\$%&\\(\\)\\+,\\-\\./0-9:;=\\?@\\[\\]\\^_`a-z\\{\\|\\}~]+)[
]", Pattern.CASE_INSENSITIVE );

that looks for things that look like this:

href=abc
href = xyz>
src=http://someplace.com/picture.jpg

However I want to avoid finding things like this:
src="http://someplace.com/picture.jpg"

I read tutorials and docs and decided this SHOULD work.

Pattern.compile( "(href|src)[ ]*=[
]*(?!")([!#\\$%&\\(\\)\\+,\\-\\./0-9:;=\\?@\\[\\]\\^_`a-z\\{\\|\\}~]+)[
]", Pattern.CASE_INSENSITIVE );

It behaves just like the first one. It filters out nothing. What am I
doing wrong?

Are there tools to help debug regexes? They either work or they don't.
I have discovered nothing equivalent to a trace of debugging dumps.
All you can do is experiments on smaller strings to figure out how the
operators work.

Perhaps what's needed for the programming community are some exclusion
examples that work and don't work with notes why.
--
Roedy Green Canadian Mind Products
http://mindprod.com
To err is human, but to really foul things up requires a computer.
~ Farmer's Almanac
It is breathtaking how a misplaced comma in a computer program can
shred megabytes of data in seconds.
 
R

Roedy Green

All of programming is "they either work or they don't". There aren't
any other choices.

What I meant to say was regexes are black boxes. All you have are the
results to try to get them working. With ordinary programs you can
trace, or make them dump out intermediate results so you can peek into
the "white box" and see the details of them working.

I gave you the strings I am trying to accept and one I am trying to
reject. I figured wrapping it up in an SCCE would just distract from
someone solving it by eye. But if you want a wrapper program to
insert that Pattern in, please see
http://mindprod.com/jgloss/regex.html#FINDING
--
Roedy Green Canadian Mind Products
http://mindprod.com
To err is human, but to really foul things up requires a computer.
~ Farmer's Almanac
It is breathtaking how a misplaced comma in a computer program can
shred megabytes of data in seconds.
 
M

markspace

But if you want a wrapper program to
insert that Pattern in, please see


Uh Roedy, you're the one asking us for free help. If you are too lazy
to produce an actual working example, please don't expect me to cobble
one together for you.

SSCCE. With the test vectors you used, and the output, and explain what
output you were expecting to see.

Otherwise, I personally am not going to give this a second look.
 
M

Martin Gregorie

Are there tools to help debug regexes? They either work or they don't. I
have discovered nothing equivalent to a trace of debugging dumps. All
you can do is experiments on smaller strings to figure out how the
operators work.
Try this online tester for Java regexes:

http://www.fileformat.info/tool/regex.htm

But really, its easy enough to write a small Java program that accepts a
regex from the command line, compiles it and then tests it by reading a
file, applying the regex to each line and outputting the results to
stdout. I haven't done that yet for Java, but I have for gawk, Perl and
Python.
 
A

Arne Vajhøj

Here is a regex pattern that looks for unquoted urls.

Pattern.compile( "(href|src)[ ]*=[
]*([!#\\$%&\\(\\)\\+,\\-\\./0-9:;=\\?@\\[\\]\\^_`a-z\\{\\|\\}~]+)[
]", Pattern.CASE_INSENSITIVE );

that looks for things that look like this:

href=abc
href = xyz>
src=http://someplace.com/picture.jpg

However I want to avoid finding things like this:
src="http://someplace.com/picture.jpg"

I read tutorials and docs and decided this SHOULD work.

Pattern.compile( "(href|src)[ ]*=[
]*(?!")([!#\\$%&\\(\\)\\+,\\-\\./0-9:;=\\?@\\[\\]\\^_`a-z\\{\\|\\}~]+)[
]", Pattern.CASE_INSENSITIVE );

It behaves just like the first one. It filters out nothing. What am I
doing wrong?

Are there tools to help debug regexes? They either work or they don't.
I have discovered nothing equivalent to a trace of debugging dumps.
All you can do is experiments on smaller strings to figure out how the
operators work.

Perhaps what's needed for the programming community are some exclusion
examples that work and don't work with notes why.

I would tend to think that the tool is sitting 40 cm in front
of the monitor.

The best trick to find problems in code is to write readable
code that does not look like something written for an
obfuscated coding contest.

For the specific problem look below for inspiration.

Arne

====================

import java.util.regex.Pattern;

public class UnQuotedAttribs {
private static Pattern p1 =
Pattern.compile("(href|src)[ ]*=[
]*([!#\\$%&\\(\\)\\+,\\-\\./0-9:;=\\?@\\[\\]\\^_`a-z\\{\\|\\}~]+)[>]",
Pattern.CASE_INSENSITIVE);
public static boolean isProblem1(String htmlfrag) {
return p1.matcher(htmlfrag).matches();
}
private static Pattern p2 =
Pattern.compile("(href|src)[ ]*=[
]*(?!")([!#\\$%&\\(\\)\\+,\\-\\./0-9:;=\\?@\\[\\]\\^_`a-z\\{\\|\\}~]+)[>]",
Pattern.CASE_INSENSITIVE);
public static boolean isProblem2(String htmlfrag) {
return p2.matcher(htmlfrag).matches();
}
private static Pattern p3 =
Pattern.compile("(href|src)\\s*=\\s*[^'\"]*\\s*>",
Pattern.CASE_INSENSITIVE);
public static boolean isProblem3(String htmlfrag) {
return p3.matcher(htmlfrag).matches();
}
private static Pattern p4 =
Pattern.compile("(href|src)\\s*=\\s*[^'\"]*\\s*>",
Pattern.CASE_INSENSITIVE);
public static boolean isProblem4(String htmlfrag) {
return p4.matcher(htmlfrag.replace(""", "\"")).matches();
}
private static void test(String htmlfrag) {
System.out.println(htmlfrag);
System.out.println(isProblem1(htmlfrag));
System.out.println(isProblem2(htmlfrag));
System.out.println(isProblem3(htmlfrag));
System.out.println(isProblem4(htmlfrag));
}
public static void main(String[] args) {
test("href=\"foobar.html\">");
test("href = \"foobar.html\" >");
test("href='foobar.html'>");
test("href = 'foobar.html' >");
test("href=foobar.html>");
test("href = foobar.html >");
test("href="foobar.html">");
test("href = "foobar.html" >");
}
}
 
R

Roedy Green

No, that's not what I'm suggesting you provide. I'm suggesting you post
a SSCCE. A real SSCCE. Posted here.

In this case an SSCCE will not help you one bit. All it will do in
confirm what I have said. If you can't solve it my eyeballing the
regex, you can't solve it.

You are playing mother may I, not making serious requests. The SSCCE I
gave you will let you test any Regex search. All have to do is plop
the two string to test in place. If you don't want to do that, then
in is unlikely you want to help, just jerk my chain.

You are not obligated to help. In turn I am not obligated to let you
jerk me around.
--
Roedy Green Canadian Mind Products
http://mindprod.com
To err is human, but to really foul things up requires a computer.
~ Farmer's Almanac
It is breathtaking how a misplaced comma in a computer program can
shred megabytes of data in seconds.
 
R

Roedy Green

Some problems, though, can be far easier to solve with a fragment of scanni=
ng code than with a regex. Yours would seem to be a fine example.

From a practical point of view, that is what I did. The code is
working, by excluding the strings I don't want programmatically,
rather than trying to get the Regex to do it. However, it bugs me that
I don't understand how (?! works when I thought I did. I want to
figure it out primarily so I can document it at
http://mindprod.com/jgloss/regex.html

Originally I did everything without regexes, mainly because they
weren't invented I and got in the habit of parsing myself. Now I find
them especially good in two situations:

1. when there are many variant patterns that can be combined. Hard
parsing ties in knots.

2. When the pattern will likely change later. You can change the
regex without changing any of the logic. Hand parsers are quite rigid.

I don't want to create a Bulgarian phrase book that sends others
astray.


--
Roedy Green Canadian Mind Products
http://mindprod.com
To err is human, but to really foul things up requires a computer.
~ Farmer's Almanac
It is breathtaking how a misplaced comma in a computer program can
shred megabytes of data in seconds.
 
R

Roedy Green

By the way, any particular reason you're using "[ ]" instead of "\s"?

In a day I use three different Regex schemes, rotating rapidly:

1. Java/IntelliJ

2. Funduc Search and Replace for bulk search replace over many files

3. Visual Slick Edit editor

All three use different Regex schemes. This is a ruddy nuisance. I
wish there were way the text editor and search/replace used Java
regexes. I have asked the authors and have been turned down. I have
already written a primitive multifile regex bulk search with java
syntax. see http://mindprod.com/products1.html#EXTRACT

I find myself tending toward syntax that works the same way in all
three, even if non optimal for the scheme I am currently using.

I will do a global replace on my Java works to use \s.
--
Roedy Green Canadian Mind Products
http://mindprod.com
To err is human, but to really foul things up requires a computer.
~ Farmer's Almanac
It is breathtaking how a misplaced comma in a computer program can
shred megabytes of data in seconds.
 
R

Roedy Green

You write "it filters out nothing". But what does that mean?

Exactly that. It has no effect. The regex filters the same strings
with it or without. The (?!") might as well not even be there.
--
Roedy Green Canadian Mind Products
http://mindprod.com
To err is human, but to really foul things up requires a computer.
~ Farmer's Almanac
It is breathtaking how a misplaced comma in a computer program can
shred megabytes of data in seconds.
 
A

Arne Vajhøj

In this case an SSCCE will not help you one bit. All it will do in
confirm what I have said. If you can't solve it my eyeballing the
regex, you can't solve it.

You are playing mother may I, not making serious requests. The SSCCE I
gave you will let you test any Regex search. All have to do is plop
the two string to test in place. If you don't want to do that, then
in is unlikely you want to help, just jerk my chain.

You are not obligated to help. In turn I am not obligated to let you
jerk me around.

My I suggest that you read:

http://mindprod.com/jgloss/sscce.html

One of the C's stand for compilable.

The lines you posted does not compile in Java.

So it is not a SSCCE.

Arne
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,153
Members
46,701
Latest member
XavierQ83

Latest Threads

Top