Regex for unicode letter characters

schickb · Jan 11, 2009

I need a regex that will match strings containing only unicode letter
characters (not including numeric or the _ character). I was surprised
to find the 're' module does not include a special character class for
this already (python 2.6). Or did I miss something?

It seems like this would be a very common need. Is the following the
only option to generate the character class (based on an old post by
Martin v. Löwis )?

import unicodedata, sys

def letters():
start = end = None
result = []
for index in xrange(sys.maxunicode + 1):
c = unichr(index)
if unicodedata.category(c)[0] == 'L':
if start is None:
start = end = c
else:
end = c
elif start:
if start == end:
result.append(start)
else:
result.append(start + "-" + end)
start = None
return u'[' + u''.join(result) + u']'

Seems rather cumbersome.

-Brad

MRAB · Jan 11, 2009

schickb said:
I need a regex that will match strings containing only unicode letter
characters (not including numeric or the _ character). I was surprised
to find the 're' module does not include a special character class for
this already (python 2.6). Or did I miss something?

It seems like this would be a very common need. Is the following the
only option to generate the character class (based on an old post by
Martin v. Löwis )?

[snip]
Basically, yes.

The re module was last worked on in 2003 (remember it's all voluntary!).
Such omissions should be addressed in Python 2.7.

Steve Holden · Jan 11, 2009

MRAB said:
schickb said:

I need a regex that will match strings containing only unicode letter
characters (not including numeric or the _ character). I was surprised
to find the 're' module does not include a special character class for
this already (python 2.6). Or did I miss something?

It seems like this would be a very common need. Is the following the
only option to generate the character class (based on an old post by
Martin v. Löwis )?

Click to expand...

[snip]
Basically, yes.

The re module was last worked on in 2003 (remember it's all voluntary!).
Such omissions should be addressed in Python 2.7.

By "should be" do you mean "ought to be (but I have no intention of
helping)", "are expected to be (but someone else will be doing the
work", "it's on my list and I am expecting to get finished in time for
2.7 integration" or something else?

regards
Steve

MRAB · Jan 11, 2009

Steve said:
MRAB said:

schickb said:

I need a regex that will match strings containing only unicode letter
characters (not including numeric or the _ character). I was surprised
to find the 're' module does not include a special character class for
this already (python 2.6). Or did I miss something?

It seems like this would be a very common need. Is the following the
only option to generate the character class (based on an old post by
Martin v. Löwis )?

Click to expand...

[snip]
Basically, yes.

The re module was last worked on in 2003 (remember it's all voluntary!).
Such omissions should be addressed in Python 2.7.

Click to expand...

By "should be" do you mean "ought to be (but I have no intention of
helping)", "are expected to be (but someone else will be doing the
work", "it's on my list and I am expecting to get finished in time for
2.7 integration" or something else?

The third one.

Steve Holden · Jan 11, 2009

MRAB said:
Steve said:

MRAB said:

schickb wrote:
I need a regex that will match strings containing only unicode letter
characters (not including numeric or the _ character). I was surprised
to find the 're' module does not include a special character class for
this already (python 2.6). Or did I miss something?

It seems like this would be a very common need. Is the following the
only option to generate the character class (based on an old post by
Martin v. Löwis )?

[snip]
Basically, yes.

The re module was last worked on in 2003 (remember it's all voluntary!).
Such omissions should be addressed in Python 2.7.

Click to expand...

By "should be" do you mean "ought to be (but I have no intention of
helping)", "are expected to be (but someone else will be doing the
work", "it's on my list and I am expecting to get finished in time for
2.7 integration" or something else?

Click to expand...

The third one.

Well, that's good news. Let me know if you need help.

regards
Steve

API for custom Unicode error handlers	5	Oct 4, 2013
Benchmarking stripping of Unicode characters which are invalid XML	0	Mar 18, 2012
unicode categories -- regex	2	Sep 22, 2007
Unicode characters, XML/RSS	1	Jul 31, 2008
Python Unicode handling wins again -- mostly	67	Nov 30, 2013
python3 Unicode is slow	1	Oct 25, 2009
trying to understand unicode	1	Apr 20, 2005
Unicode in Regex	32	Nov 30, 2007

Regex for unicode letter characters

schickb

MRAB

Steve Holden

MRAB

Steve Holden

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads