Regex: Any character in character class

S

Sebastian

I want to match any sequence of characters, including line breaks, in a
suffix of a multi-line string.

I do not want to use Pattern.DOTALL, because line breaks are not
permissible everywhere. I cannot write [.]* because dot loses its
special meaning inside a character class.

I have come up with [\S\s]*
as meaning any sequence of non-whitespace or whitespace (incl.
line-breaks). Is there a better way?

-- Sebastian
 
A

Arne Vajhøj

I want to match any sequence of characters, including line breaks, in a
suffix of a multi-line string.

I do not want to use Pattern.DOTALL, because line breaks are not
permissible everywhere. I cannot write [.]* because dot loses its
special meaning inside a character class.

I have come up with [\S\s]*
as meaning any sequence of non-whitespace or whitespace (incl.
line-breaks). Is there a better way?

Do you always want to accept line breaks or not? If not then when?

Arne
 
A

Arved Sandstrom

I want to match any sequence of characters, including line breaks, in a
suffix of a multi-line string.

I do not want to use Pattern.DOTALL, because line breaks are not
permissible everywhere. I cannot write [.]* because dot loses its
special meaning inside a character class.

I have come up with [\S\s]*
as meaning any sequence of non-whitespace or whitespace (incl.
line-breaks). Is there a better way?

Do you always want to accept line breaks or not? If not then when?

Arne
Good question.

I take it the suffix is a generic last-N characters of the string
(Assumption #1). I take it that line breaks are OK in the suffix, not
necessarily so in the rest of the string (Assumption #2).

If you don't mind me asking, why don't you just grab the suffix, the
last N characters, with substring()? That *is* your match.

AHS
 
S

Sebastian

Am 31.01.2013 04:27, schrieb Arne Vajhøj:
I want to match any sequence of characters, including line breaks, in a
suffix of a multi-line string.

I do not want to use Pattern.DOTALL, because line breaks are not
permissible everywhere. I cannot write [.]* because dot loses its
special meaning inside a character class.

I have come up with [\S\s]*
as meaning any sequence of non-whitespace or whitespace (incl.
line-breaks). Is there a better way?

Do you always want to accept line breaks or not? If not then when?

Arne
the string I want to match basicallyhas two parts (a "protocol" and a
"selection expression"). I want to allow line breaks anywhere in the
selection expression, but not in the protocol.
-- S.
 
L

Lew

Sebastian said:
the string I want to match basicallyhas two parts (a "protocol" and a
"selection expression"). I want to allow line breaks anywhere in the
selection expression, but not in the protocol.

How do you tell which part is which?
 
A

Arne Vajhøj

Am 31.01.2013 04:27, schrieb Arne Vajhøj:
I want to match any sequence of characters, including line breaks, in a
suffix of a multi-line string.

I do not want to use Pattern.DOTALL, because line breaks are not
permissible everywhere. I cannot write [.]* because dot loses its
special meaning inside a character class.

I have come up with [\S\s]*
as meaning any sequence of non-whitespace or whitespace (incl.
line-breaks). Is there a better way?

Do you always want to accept line breaks or not? If not then when?
the string I want to match basicallyhas two parts (a "protocol" and a
"selection expression"). I want to allow line breaks anywhere in the
selection expression, but not in the protocol.

Do you have a separator between the two parts like colon in URL's?

If yes then something like:

[.]+:[.|\n]+

Arne
 
M

markspace

[.]+:[.|\n]+


Watch out for this. +, being greedy, will match a : in the selection
expression (the 2nd part) if : is allowed in the second part.

The reluctant modifier might be a better idea here:

..+?:[.|\n]+

Note that I don't think the initial brackets [] were needed. Also we're
yet again starting to see the problem with regex: it always evolves into
something that looks like your cat walked across the keyboard.
 
A

Arne Vajhøj

[.]+:[.|\n]+


Watch out for this. +, being greedy, will match a : in the selection
expression (the 2nd part) if : is allowed in the second part.

The reluctant modifier might be a better idea here:

.+?:[.|\n]+

Note that I don't think the initial brackets [] were needed. Also we're
yet again starting to see the problem with regex: it always evolves into
something that looks like your cat walked across the keyboard.

You are absolutely right.

Non greedy.

No square brackets for first part.

And also round brackets for the last part.

..+?:(.|\n)+

I think I must have set a new world record. 3 bugs in 12 characters.

:-(

Arne
 
R

Robert Klemme

Am 31.01.2013 04:27, schrieb Arne Vajhøj:
I want to match any sequence of characters, including line breaks, ina
suffix of a multi-line string.

I do not want to use Pattern.DOTALL, because line breaks are not
permissible everywhere. I cannot write [.]* because dot loses its
special meaning inside a character class.

I have come up with [\S\s]*
as meaning any sequence of non-whitespace or whitespace (incl.
line-breaks). Is there a better way?
Yes.
Do you always want to accept line breaks or not? If not then when?
the string I want to match basicallyhas two parts (a "protocol" and a
"selection expression"). I want to allow line breaks anywhere in the
selection expression, but not in the protocol.

Of course you can use DOTALL - as an embedded flag:

package rx;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Dotty {

private static final Pattern PAT =
Pattern.compile("proto.*(?s:sel.*)");

public static void main(String[] args) {
test("protoPselS");
test("protoPPselS\nS");
test("protoP\nPselS\nS");
}

public static void test(final CharSequence cs) {
System.out.println("cs=\"" + cs + "\"");
final Matcher m = PAT.matcher(cs);

if (m.matches()) {
System.out.println("Match: \"" + m.group() + "\"");
} else {
System.out.println("Mismatch");
}

System.out.println();
}

}

Kind regards

robert
 
S

Sebastian

Am 01.02.2013 23:13, schrieb Arne Vajhøj:
[snip]
And also round brackets for the last part.

.+?:(.|\n)+

I think I must have set a new world record. 3 bugs in 12 characters.

:-(

Arne
Here's a concrete example:

SCA:LIST, select[werks_s:default_plant],values[bukrs:bukrs,
company:company]


The second part is everything after the first comma. I was using
(.+?),[\s\S]+

Arne's suggestion modified for my needs (comma as separator, and I only
want to capture the first part as a group) will work fine as well:
(.+?),(?:.|\n)+

Can't say though that I find anything to prefer the one to the other.
Perhaps the second looks even more like the result of a cat walk...

-- Sebastian
 
M

markspace

SCA:LIST, select[werks_s:default_plant],values[bukrs:bukrs,
company:company]

For something this simple you might want to consider just String::split().

String test =
"SCA:LIST,select[werks_s:default_plant],values[bukrs:bukrs,company:company]
";
String[] parse = test.split( ",\\s*", 2 );
System.out.println( Arrays.toString( parse ) );

This could be faster since the second half of the regex, (?:.|\n)+,
doesn't have to execute.
 
A

Arne Vajhøj

Am 01.02.2013 23:13, schrieb Arne Vajhøj:
[snip]
And also round brackets for the last part.

.+?:(.|\n)+

I think I must have set a new world record. 3 bugs in 12 characters.

:-(
Here's a concrete example:

SCA:LIST, select[werks_s:default_plant],values[bukrs:bukrs,
company:company]


The second part is everything after the first comma. I was using
(.+?),[\s\S]+

Arne's suggestion modified for my needs (comma as separator, and I only
want to capture the first part as a group) will work fine as well:
(.+?),(?:.|\n)+

Can't say though that I find anything to prefer the one to the other.
Perhaps the second looks even more like the result of a cat walk...

It is not unusual that there is more than one regex that
does the job.

Arne
 
L

Lew

Arne said:
Sebastian said:
schrieb Arne Vajhï¿œj:
[snip]
And also round brackets for the last part.

.+?:(.|\n)+

I think I must have set a new world record. 3 bugs in 12 characters.
:-(
Here's a concrete example:

SCA:LIST, select[werks_s:default_plant],values[bukrs:bukrs,
company:company]
The second part is everything after the first comma. I was using

You mean 'expression.substring(expression.indexOf(',') + 1)'?
(modulo the usual error checks, of course)
(.+?),[\s\S]+
Arne's suggestion modified for my needs (comma as separator, and I only
want to capture the first part as a group) will work fine as well:

You mean 'expression.substring(0, expression.indexOf(','))'?

If all you need to do is split a string on a comma, why use regexes at all?
It is not unusual that there is more than one regex that
does the job.

It is not unusual that there is more than one non-regex that does the job.
 
A

Arne Vajhøj

If all you need to do is split a string on a comma, why use regexes at all?


It is not unusual that there is more than one non-regex that does the job.

True.

But less surprising.

Arne
 
G

Gene Wirchenko

[snip]
I think I must have set a new world record. 3 bugs in 12 characters.

:-(

I may be able to save your honour. <G>

IBM had bugs in a one-instruction program of two bytes long. The
program was IEFBR14, and you can read about it on Wikipedia. There
was a series of corrections which resulted in a program several times
larger.

Sincerely,

Gene Wirchenko
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top