T
tinkerbarbet
Hi
I've read around quite a bit about Unicode and python's support for
it, and I'm still unclear about how it all fits together in certain
scenarios. Can anyone help clarify?
* When I say "# -*- coding: utf-8 -*-" and confirm my IDE is saving
the source file as UTF-8, do I still need to prefix all the strings
constructed in the source with u as in myStr = u"blah", even when
those strings contain only ASCII or ISO-8859-1 chars? (It would be a
bother for me to do this for the complete source I'm working on, where
I rarely need chars outside the ISO-8859-1 range.)
* Will python figure it out if I use different encodings in different
modules -- say a main source file which is "# -*- coding: utf-8 -*-"
and an imported module which doesn't say this (for which python will
presumably use a default encoding)? This seems inevitable given that
standard library modules such as re don't declare an encoding,
presumably because in that case I don't see any non-ASCII chars in the
source.
* If I want to use a Unicode char in a regex -- say an en-dash, U+2013
-- in an ASCII- or ISO-8859-1-encoded source file, can I say
myASCIIRegex = re.compile('[A-Z]')
myUniRegex = re.compile(u'\u2013') # en-dash
then read the source file into a unicode string with codecs.read(),
then expect re to match against the unicode string using either of
those regexes if the string contains the relevant chars? Or do I need
to do make all my regex patterns unicode strings, with u""?
I've been trying to understand this for a while so any clarification
would be a great help.
Tim
I've read around quite a bit about Unicode and python's support for
it, and I'm still unclear about how it all fits together in certain
scenarios. Can anyone help clarify?
* When I say "# -*- coding: utf-8 -*-" and confirm my IDE is saving
the source file as UTF-8, do I still need to prefix all the strings
constructed in the source with u as in myStr = u"blah", even when
those strings contain only ASCII or ISO-8859-1 chars? (It would be a
bother for me to do this for the complete source I'm working on, where
I rarely need chars outside the ISO-8859-1 range.)
* Will python figure it out if I use different encodings in different
modules -- say a main source file which is "# -*- coding: utf-8 -*-"
and an imported module which doesn't say this (for which python will
presumably use a default encoding)? This seems inevitable given that
standard library modules such as re don't declare an encoding,
presumably because in that case I don't see any non-ASCII chars in the
source.
* If I want to use a Unicode char in a regex -- say an en-dash, U+2013
-- in an ASCII- or ISO-8859-1-encoded source file, can I say
myASCIIRegex = re.compile('[A-Z]')
myUniRegex = re.compile(u'\u2013') # en-dash
then read the source file into a unicode string with codecs.read(),
then expect re to match against the unicode string using either of
those regexes if the string contains the relevant chars? Or do I need
to do make all my regex patterns unicode strings, with u""?
I've been trying to understand this for a while so any clarification
would be a great help.
Tim