S
schickb
I need a regex that will match strings containing only unicode letter
characters (not including numeric or the _ character). I was surprised
to find the 're' module does not include a special character class for
this already (python 2.6). Or did I miss something?
It seems like this would be a very common need. Is the following the
only option to generate the character class (based on an old post by
Martin v. Löwis )?
import unicodedata, sys
def letters():
start = end = None
result = []
for index in xrange(sys.maxunicode + 1):
c = unichr(index)
if unicodedata.category(c)[0] == 'L':
if start is None:
start = end = c
else:
end = c
elif start:
if start == end:
result.append(start)
else:
result.append(start + "-" + end)
start = None
return u'[' + u''.join(result) + u']'
Seems rather cumbersome.
-Brad
characters (not including numeric or the _ character). I was surprised
to find the 're' module does not include a special character class for
this already (python 2.6). Or did I miss something?
It seems like this would be a very common need. Is the following the
only option to generate the character class (based on an old post by
Martin v. Löwis )?
import unicodedata, sys
def letters():
start = end = None
result = []
for index in xrange(sys.maxunicode + 1):
c = unichr(index)
if unicodedata.category(c)[0] == 'L':
if start is None:
start = end = c
else:
end = c
elif start:
if start == end:
result.append(start)
else:
result.append(start + "-" + end)
start = None
return u'[' + u''.join(result) + u']'
Seems rather cumbersome.
-Brad