Should '\w' class match non ascii letters?

jb · Aug 26, 2008

I've been working with Patterns for a while, and following thing
baffled: \w class doesnt seem to include non ascii letters (well at
least not polish ones

).

Javadoc seems to say nothing about it.

Heres the test:

import java.util.regex.*;

class rTest{
public static void main(String[] args){
System.out.println("Regexp: '\\w+'");
Pattern pat;
Matcher m;
pat = Pattern.compile("\\w+");
m= pat.matcher("a");
System.out.println("Matches 'a' " + m.matches());
m= pat.matcher("\u015b");
System.out.println("Matches '\u015b' " + m.matches());
m= pat.matcher("a");
System.out.println("Matches 'a' " + m.matches());
m= pat.matcher("¶");
System.out.println("Matches '¶' " + m.matches());
}
}

It prints (on my system):
Regexp: '\w+'
Matches 'a' true
Matches '¶' false
Matches 'a' true
Matches '¶' false

The question is: whether it is buggy behaviour or is it according to
specs, and is there any way to include all (polish) letters in a class
in an elegant way?

Jussi Piitulainen · Aug 26, 2008

Eric said:
jb said:

I've been working with Patterns for a while, and following thing
baffled: \w class doesnt seem to include non ascii letters (well at
least not polish ones ).
Javadoc seems to say nothing about it.

Click to expand...

The Javadoc for 1.6 says

Predefined character classes
...
\w A word character: [a-zA-Z_0-9]

It says the same for 1.4.2 already.

I've never tried \p{prop} before, but I did now, and \p{L} appears to
match Finnish non-ASCII letters, so I guess it would work for Polish,
too. It is described in Javadoc for Pattern in 1.4.2 under headings
"Classes for Unicode blocks and categories" and "Unicode support".

jb · Aug 26, 2008

Eric said:
Javadoc seems to say nothing about it.

Click to expand...

The Javadoc for 1.6 says

Predefined character classes
...
\w A word character: [a-zA-Z_0-9]

Well I assumed that my chars are between a-z, in alphabet they are

.

Jussi said:
I've never tried \p{prop} before, but I did now, and \p{L} appears to
match Finnish non-ASCII letters, so I guess it would work for Polish,
too. It is described in Javadoc for Pattern in 1.4.2 under headings
"Classes for Unicode blocks and categories" and "Unicode support".

Thanks it works

.

Joshua Cranmer · Aug 26, 2008

jb said:
Well I assumed that my chars are between a-z, in alphabet they are .

Brief note: the range a-z refers to all characters c such that 'a' <= c
and c <= 'z'. 'á' is > 'z', ergo it doesn't match. The range 0-f will
match all uppercase letters, all digits, the lowercase letters 'a', 'b',
'c', 'd', 'e', and 'f', as well as the punctuation characters in the
following string ":;<=>?@[\\]^_`", which is probably not what would be
intended.

Mike Schilling · Aug 28, 2008

bugbear said:
Joshua said:

jb said:

Well I assumed that my chars are between a-z, in alphabet they are
.

Click to expand...

Brief note: the range a-z refers to all characters c such that 'a'
<= c and c <= 'z'. 'á' is > 'z', ergo it doesn't match. The range
0-f will match all uppercase letters, all digits, the lowercase
letters 'a', 'b', 'c', 'd', 'e', and 'f', as well as the punctuation
characters in the following string ":;<=>?@[\\]^_`", which is
probably not what would be intended.

Click to expand...

Agreed. ASCII tricks (for which ASCII was, in part, designed)
don't work well in the new world of UNICODE, or even Latin-1

Huh? Both Latin-1 and Unicode are supersets of ASCII, so "0-f" would mean
the same thing in all of them.

Tom Anderson · Aug 28, 2008

bugbear said:
bugbear said:

Joshua said:

jb wrote:
Well I assumed that my chars are between a-z, in alphabet they are
.

Brief note: the range a-z refers to all characters c such that 'a' <=
c and c <= 'z'. 'á' is > 'z', ergo it doesn't match. The range 0-f
will match all uppercase letters, all digits, the lowercase letters
'a', 'b', 'c', 'd', 'e', and 'f', as well as the punctuation
characters in the following string ":;<=>?@[\\]^_`", which is probably
not what would be intended.

Click to expand...

Agreed. ASCII tricks (for which ASCII was, in part, designed)
don't work well in the new world of UNICODE, or even Latin-1

Click to expand...

Huh? Both Latin-1 and Unicode are supersets of ASCII, so "0-f" would
mean the same thing in all of them.

Exactly. In ASCII, the numerical order of the codepoints is the same as
the collating sequence of the letters, so things like a-z mean what they
look like. In Latin-1 and unicode, this is no longer true: a-z looks like
it should include á, but it actually doesn't.

tom

Joshua Cranmer · Aug 28, 2008

Mike said:
Huh? Both Latin-1 and Unicode are supersets of ASCII, so "0-f" would mean
the same thing in all of them.

I think bugbear was referring to the fact that in the English language
as defined by ASCII (excluding borrowed accents), the statements "char
is a lowercase letter" and |'a' <= char <= 'z'| are equivalent, but in
many scripts, that is not true (e.g., à).

Schrodinger's regular expression	3	Apr 3, 2007
Pattern Match Question	5	Jun 4, 2008
getting mac address through aglet	13	Aug 16, 2009
newbie Java regexp question	4	Jul 2, 2007
JSP not working with Windows 7	2	Oct 6, 2010
Java regex can't match lengthy match?	0	Jan 29, 2004
String#split regex \W on non-ASCII text	1	Nov 9, 2010
regex woes	3	May 7, 2004

Should '\w' class match non ascii letters?

jb

Jussi Piitulainen

jb

Joshua Cranmer

Mike Schilling

Tom Anderson

Joshua Cranmer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads