regex capability

R

Roedy Green

Consider a string like this:

Support DDR2 1066/800/667/533/400 DDR2 SDRAM

Is it possible to compose a regex that will peel out those numbers for
you each in its own field, or do you have to extract the string
"1066/800/667/533/400" and use split?

The various things I have tried just grab the last number.
 
R

Roedy Green

Easiest is to just use split. You can always do a regex of the type
"(\\d+)/((\\d+)/)?((\\d+)/)?((\\d+)/)?" but that's just pointlessly
complicated. There's no reason why you should use a regex when "normal"
string parsing is simpler and easier to read.

(xxx|yyy)+ seems to generate only one group item, no matter how many
repetitions there are. That strikes me as a bug, but likely someone
can explain why it is a feature or inevitability.
 
E

Eric Sosman

(xxx|yyy)+ seems to generate only one group item, no matter how many
repetitions there are. That strikes me as a bug, but likely someone
can explain why it is a feature or inevitability.

A (section of a) regex matches a (section of a) string, and the
Matcher machinery can tell you what substring was matched. The
machinery has no provision for doing further processing on that
matched substring, like saying "Oh, your regex didn't match a
string this time, but an array of strings."

You could, perhaps, cook up substitutes for Pattern and Matcher
to do such a thing. But I'm not sure you'd want to, because it
could make the API rather complicated. For example, consider a
fanex (for "fancy expression," like "regular expression" only
more so) along the lines of "(pat1)(pat2)" where "pat1" and "pat2"
can match and return arrays of substrings. The FancyMatcher says
"I matched five substrings." So you call group(3) to get the
third of them -- was it matched by "pat1" or by "pat2"? Yes, you
could invent an API to deal with this -- maybe FancyMatcher returns
a tree of nodes that point to other nodes and/or to substrings --
but I'm not confident this would be an unqualified improvement.
 
R

Robert Klemme

I think normal practice (in Perl, and Java) would be repeated
use of a fairly simple regexp.

In Java, I use

while(matcher.find()) {
...
}

The key is that Matcher is stateful.

And for added security a two level approach could be taken:

// untested
Pattern whole = Pattern.compile("Support DDR2 (\\d+(?:/\\d+)*) DDR2 SDRAM");

Pattern number = Patter.compile("\\d+");

Matcher m = whole.matcher(input);

if ( m.matches() ) {
for (m = number.matcher(m.group(1)); m.find();) {
int x = Integer.parse(m.group());
}
}
else {
// error?
}

Kind regards

robert
 
D

David Lamb

For example, consider a
fanex (for "fancy expression," like "regular expression" only
more so) along the lines of "(pat1)(pat2)" where "pat1" and "pat2"
can match and return arrays of substrings. The FancyMatcher says
"I matched five substrings." So you call group(3) to get the
third of them -- was it matched by "pat1" or by "pat2"? Yes, you
could invent an API to deal with this -- maybe FancyMatcher returns
a tree of nodes that point to other nodes and/or to substrings --
but I'm not confident this would be an unqualified improvement.

Matching a pattern to generate a tree sounds a lot like a full-blown
context-free parser.
 
J

Jim Gibson

Roedy Green said:
(xxx|yyy)+ seems to generate only one group item, no matter how many
repetitions there are. That strikes me as a bug, but likely someone
can explain why it is a feature or inevitability.

The "feature" is that the number of capture groups is equal to the
number of capturing parenthesis pairs. If the above regular expression
results in multiple matches, each match is captured and stored into the
single capture buffer. After the match is finished, only the last
captured substring remains in the single capture buffer.
 
M

markspace

if ( m.matches() ) {
for (m = number.matcher(m.group(1)); m.find();) {
int x = Integer.parse(m.group());
}


Why re-invent the wheel?


public class ScannerTest {
public static void main(String[] args) {
StringReader in = new StringReader(
"Support DDR2 100/200/300/400 DDR2 SDRAM");

Scanner scanner = new Scanner(in);
scanner.useDelimiter( "[^0-9]+" );
while( scanner.hasNextInt() ) {
System.out.println( scanner.nextInt() );
}
}
}


(Lightly tested.)
 
P

Paul Cager

if ( m.matches() ) {
for (m = number.matcher(m.group(1)); m.find();) {
int x = Integer.parse(m.group());
}

Why re-invent the wheel?

public class ScannerTest {
     public static void main(String[] args) {
         StringReader in = new StringReader(
                 "Support DDR2 100/200/300/400 DDR2 SDRAM");

         Scanner scanner = new Scanner(in);
         scanner.useDelimiter( "[^0-9]+" );
         while( scanner.hasNextInt() ) {
             System.out.println( scanner.nextInt() );
         }
     }

}

(Lightly tested.)

$ java ScannerTest
2
100
200
300
400
2
 
R

Robert Klemme

In this case I just wanted to demonstrate the strategy to first check
overall validity of the input and extract the interesting part and
then ripping that interesting part apart. Whether a Scanner or
another Matcher is used for the second step wasn't that important to
me. Also, the thread is called "regex capability". :)

But, of course, your approach using the Scanner is perfectly
compatible with the two step strategy as Patricia also pointed
out. :)
public class ScannerTest {
      public static void main(String[] args) {
          StringReader in = new StringReader(
                  "Support DDR2 100/200/300/400 DDR2SDRAM");
          Scanner scanner = new Scanner(in);
          scanner.useDelimiter( "[^0-9]+" );
          while( scanner.hasNextInt() ) {
              System.out.println( scanner.nextInt() );
          }
      }
}
(Lightly tested.)
$ java ScannerTest
2
100
200
300
400
2

This is a nice illustration of the case for a strategy I often use in
this sort of situation, combining tools using each to do the jobs it
does best.

For example, a regular expression match could pull out the
"100/200/300/400" substring, and a Scanner could extract the integers
from that. More generally, it could be split and then each of the split
results processed some other way.

I generally prefer scanning over splitting in those cases. The
difference might be negligible for this case but assuming that the
original pattern changes (e.g. because we want to allow "@" as
separator instead of or additionally to "/") then for the split
approach two patterns need to be changed while for scanning of
integers (pattern \d+) only the master pattern needs to change. Also,
with scanning it is clear what I want (positively defining the matched
portion) while with splitting it is not so clear (negatively defining
what I do not want, the separator) - but that leaves a lot of room for
what is returned from _between_ separators.

Kind regards

robert
 
M

markspace

In this case I just wanted to demonstrate the strategy to first check
overall validity of the input and extract the interesting part and
then ripping that interesting part apart. Whether a Scanner or
another Matcher is used for the second step wasn't that important to
me. Also, the thread is called "regex capability". :)

Fair enough. :)

But, of course, your approach using the Scanner is perfectly
compatible with the two step strategy as Patricia also pointed
out. :)


Don't forget too that Scanner can do other things besides use
delimiters. It has methods like skip() and findInLine() that ignore
delimiters and could be used to build a simple parser. You can also
change the delimiters on the fly to extract different sections of text.

A simple change to my example above:

public class ScannerTest {
public static void main(String[] args) {
StringReader in = new StringReader(
"Support DDR2 100/200/300/400 DDR2 SDRAM");

Scanner scanner = new Scanner(in);
scanner.findInLine( "Support DDR2" );
scanner.useDelimiter( "[ /]+" );
while( scanner.hasNextInt() ) {
System.out.println( scanner.nextInt() );
}
}
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,186
Members
46,744
Latest member
CortneyMcK

Latest Threads

Top