re.finditer() skips unicode into selection

A

akshay.ksth

I am using the following Highlighter class for Spell Checking to work on myQTextEdit.

class Highlighter(QSyntaxHighlighter):
pattern = ur'\w+'
def __init__(self, *args):
QSyntaxHighlighter.__init__(self, *args)
self.dict = None

def setDict(self, dict):
self.dict = dict

def highlightBlock(self, text):
if not self.dict:
return
text = unicode(text)
format = QTextCharFormat()
format.setUnderlineColor(Qt.red)
format.setUnderlineStyle(QTextCharFormat.SpellCheckUnderline)
unicode_pattern=re.compile(self.pattern,re.UNICODE|re.LOCALE)

for word_object in unicode_pattern.finditer(text):
if not self.dict.spell(word_object.group()):
print word_object.group()
self.setFormat(word_object.start(), word_object.end() - word_object.start(), format)

But whenever I pass unicode values into my QTextEdit the re.finditer() doesnot seem to collect it.

When I pass "I am a नेपाली"into the QTextEdit. The output is like this:

I I I a I am I am I am a I am a I am a I am a I am a I am a I am a I ama

It is completely ignoring the unicode. What might be the issue. I am new toPyQt and regex. Im using Python 2.7 and PyQt4.
 
T

Terry Reedy

I am using the following Highlighter class for Spell Checking to work on my QTextEdit.

class Highlighter(QSyntaxHighlighter):
pattern = ur'\w+'
def __init__(self, *args):
QSyntaxHighlighter.__init__(self, *args)
self.dict = None

def setDict(self, dict):
self.dict = dict

def highlightBlock(self, text):
if not self.dict:
return
text = unicode(text)
format = QTextCharFormat()
format.setUnderlineColor(Qt.red)
format.setUnderlineStyle(QTextCharFormat.SpellCheckUnderline)
unicode_pattern=re.compile(self.pattern,re.UNICODE|re.LOCALE)

for word_object in unicode_pattern.finditer(text):
if not self.dict.spell(word_object.group()):
print word_object.group()
self.setFormat(word_object.start(), word_object.end() - word_object.start(), format)

But whenever I pass unicode values into my QTextEdit the re.finditer() does not seem to collect it.

When I pass "I am a नेपाली" into the QTextEdit. The output is like this:

I I I a I am I am I am a I am a I am a I am a I am a I am a I am aI am a

It is completely ignoring the unicode.

The whole text is unicode. It is ignoring the non-ascii, as you asked it
to with re.LOCALE.

With 3.3.2:
import re

pattern = re.compile(r'\w+', re.LOCALE)
text = "I am a नेपाली"

for word in pattern.finditer(text):
print(word.group())I
am
a

Delete ', re.LOCALE' and the following are also printed:
न
प
ल

There is an issue on the tracker about the vowel marks in नेपाली being
mis-seen as word separators, but that is another issue.

Lesson: when you do not understand output, simplify code to see what
changes. Separating re issues from framework issues is a big step in
that direction.

? What might be the issue. I am new to PyQt and regex. Im using Python
2.7 and PyQt4.
 
M

MRAB

I am using the following Highlighter class for Spell Checking to work on my QTextEdit.

class Highlighter(QSyntaxHighlighter):

In Python 2.7, the re module has a somewhat limited idea of what a
"word" character is. It recognises 'DEVANAGARI LETTER NA' as a letter,
but 'DEVANAGARI VOWEL SIGN E' as a diacritic. The pattern ur'(?u)\w+'
will therefore split "नेपाली" into 3 parts.
pattern = ur'\w+'
def __init__(self, *args):
QSyntaxHighlighter.__init__(self, *args)
self.dict = None

def setDict(self, dict):
self.dict = dict

def highlightBlock(self, text):
if not self.dict:
return
text = unicode(text)
format = QTextCharFormat()
format.setUnderlineColor(Qt.red)
format.setUnderlineStyle(QTextCharFormat.SpellCheckUnderline)

The LOCALE flag is for locale-sensitive 1-byte per character
bytestrings. It's rarely useful.

The UNICODE flag is for dealing with Unicode strings, which is what you
need here. You shouldn't be using both at the same time!
unicode_pattern=re.compile(self.pattern,re.UNICODE|re.LOCALE)

for word_object in unicode_pattern.finditer(text):
if not self.dict.spell(word_object.group()):
print word_object.group()
self.setFormat(word_object.start(), word_object.end() - word_object.start(), format)

But whenever I pass unicode values into my QTextEdit the re.finditer() does not seem to collect it.

When I pass "I am a नेपाली" into the QTextEdit. The output is like this:

I I I a I am I am I am a I am a I am a I am a I am a I am a I am a I am a

It is completely ignoring the unicode. What might be the issue. I am new to PyQt and regex. Im using Python 2.7 and PyQt4.
There's an alternative regex implementation at:

http://pypi.python.org/pypi/regex

It's a drop-in replacement for the re module, but with a lot of
additions, including better handling of Unicode.
 
D

darpan6aya

Thanks MRAB, your suggestion worked. But then it brought an error

'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

I corrected this by encoding it to 'utf-8'. The code looks like this now.

pattern = ur'(?u)\w+'

def __init__(self, *args):
QSyntaxHighlighter.__init__(self, *args)
self.dict = None

def setDict(self, dict):
self.dict = dict

def highlightBlock(self, text):
if not self.dict:
return
text = unicode(text)
format = QTextCharFormat()
format.setUnderlineColor(Qt.red)
format.setUnderlineStyle(QTextCharFormat.SpellCheckUnderline)

unicode_pattern=re.compile(self.pattern,re.UNICODE)

for word_object in unicode_pattern.finditer(text):
if not self.dict.spell(word_object.group().encode('utf-8')):
print word_object.group().encode('utf-8')
self.setFormat(word_object.start(), word_object.end() -word_object.start(), format)

The problem now is that all the vowels are separated from the root word, such that if you type मेरो, the म andे are printed separately. (the े appears as a box instead).. What am I doing wrong?

Like this.

मेरो नाम रà¥à¤ªà¤¾ हो।
 
D

darpan6aya

35002rr.png


Heres a screenshot http://i41.tinypic.com/35002rr.png
 
D

darpan6aya

Thanks MRAB your alternative regex implementation worked flawlessly.
It works now.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top