Should '\w' class match non ascii letters?

J

jb

I've been working with Patterns for a while, and following thing
baffled: \w class doesnt seem to include non ascii letters (well at
least not polish ones :().

Javadoc seems to say nothing about it.

Heres the test:

import java.util.regex.*;


class rTest{
public static void main(String[] args){
System.out.println("Regexp: '\\w+'");
Pattern pat;
Matcher m;
pat = Pattern.compile("\\w+");
m= pat.matcher("a");
System.out.println("Matches 'a' " + m.matches());
m= pat.matcher("\u015b");
System.out.println("Matches '\u015b' " + m.matches());
m= pat.matcher("a");
System.out.println("Matches 'a' " + m.matches());
m= pat.matcher("¶");
System.out.println("Matches '¶' " + m.matches());
}
}

It prints (on my system):
Regexp: '\w+'
Matches 'a' true
Matches '¶' false
Matches 'a' true
Matches '¶' false

The question is: whether it is buggy behaviour or is it according to
specs, and is there any way to include all (polish) letters in a class
in an elegant way?
 
J

Jussi Piitulainen

Eric said:
jb said:
I've been working with Patterns for a while, and following thing
baffled: \w class doesnt seem to include non ascii letters (well at
least not polish ones :().
Javadoc seems to say nothing about it.

The Javadoc for 1.6 says

Predefined character classes
...
\w A word character: [a-zA-Z_0-9]

It says the same for 1.4.2 already.

I've never tried \p{prop} before, but I did now, and \p{L} appears to
match Finnish non-ASCII letters, so I guess it would work for Polish,
too. It is described in Javadoc for Pattern in 1.4.2 under headings
"Classes for Unicode blocks and categories" and "Unicode support".
 
J

jb

Eric said:
Javadoc seems to say nothing about it.

The Javadoc for 1.6 says

Predefined character classes
...
\w A word character: [a-zA-Z_0-9]

Well I assumed that my chars are between a-z, in alphabet they are :).

Jussi said:
I've never tried \p{prop} before, but I did now, and \p{L} appears to
match Finnish non-ASCII letters, so I guess it would work for Polish,
too. It is described in Javadoc for Pattern in 1.4.2 under headings
"Classes for Unicode blocks and categories" and "Unicode support".

Thanks it works :).
 
J

Joshua Cranmer

jb said:
Well I assumed that my chars are between a-z, in alphabet they are :).

Brief note: the range a-z refers to all characters c such that 'a' <= c
and c <= 'z'. 'á' is > 'z', ergo it doesn't match. The range 0-f will
match all uppercase letters, all digits, the lowercase letters 'a', 'b',
'c', 'd', 'e', and 'f', as well as the punctuation characters in the
following string ":;<=>?@[\\]^_`", which is probably not what would be
intended.
 
M

Mike Schilling

bugbear said:
Joshua said:
jb said:
Well I assumed that my chars are between a-z, in alphabet they are
:).

Brief note: the range a-z refers to all characters c such that 'a'
<= c and c <= 'z'. 'á' is > 'z', ergo it doesn't match. The range
0-f will match all uppercase letters, all digits, the lowercase
letters 'a', 'b', 'c', 'd', 'e', and 'f', as well as the punctuation
characters in the following string ":;<=>?@[\\]^_`", which is
probably not what would be intended.

Agreed. ASCII tricks (for which ASCII was, in part, designed)
don't work well in the new world of UNICODE, or even Latin-1

Huh? Both Latin-1 and Unicode are supersets of ASCII, so "0-f" would mean
the same thing in all of them.
 
T

Tom Anderson

bugbear said:
Joshua said:
jb wrote:
Well I assumed that my chars are between a-z, in alphabet they are
:).

Brief note: the range a-z refers to all characters c such that 'a' <=
c and c <= 'z'. 'á' is > 'z', ergo it doesn't match. The range 0-f
will match all uppercase letters, all digits, the lowercase letters
'a', 'b', 'c', 'd', 'e', and 'f', as well as the punctuation
characters in the following string ":;<=>?@[\\]^_`", which is probably
not what would be intended.

Agreed. ASCII tricks (for which ASCII was, in part, designed)
don't work well in the new world of UNICODE, or even Latin-1

Huh? Both Latin-1 and Unicode are supersets of ASCII, so "0-f" would
mean the same thing in all of them.

Exactly. In ASCII, the numerical order of the codepoints is the same as
the collating sequence of the letters, so things like a-z mean what they
look like. In Latin-1 and unicode, this is no longer true: a-z looks like
it should include á, but it actually doesn't.

tom
 
J

Joshua Cranmer

Mike said:
Huh? Both Latin-1 and Unicode are supersets of ASCII, so "0-f" would mean
the same thing in all of them.

I think bugbear was referring to the fact that in the English language
as defined by ASCII (excluding borrowed accents), the statements "char
is a lowercase letter" and |'a' <= char <= 'z'| are equivalent, but in
many scripts, that is not true (e.g., à).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,186
Members
46,740
Latest member
JudsonFrie

Latest Threads

Top