non-english characters

Daniel Bretoi · Dec 17, 2003

how do I match non-english alphabetical characters? Such as the german
double-s ? (ß)

db

Yukihiro Matsumoto · Dec 17, 2003

Hi,

In message "non-english characters"

|how do I match non-english alphabetical characters? Such as the german
|double-s ? (ß)

Which encoding do you wish to use?

matz.

Daniel Bretoi · Dec 17, 2003

Hi,

In message "non-english characters"

|how do I match non-english alphabetical characters? Such as the german
|double-s ? (?)

Which encoding do you wish to use?

I'm not sure, how can I find out what the germans use? and once I know
that part, how do I use it?

db

Yukihiro Matsumoto · Dec 17, 2003

Hi,

In message "Re: non-english characters"

|I'm not sure, how can I find out what the germans use? and once I know
|that part, how do I use it?

Ask somebody around you to find out. Then if you're going to use
Unicode (UTF-8), write your script in UTF-8 and invoke Ruby with -Ku
option. If you use ISO-8859-* or any other single byte encoding, you
don't have to do anything special.

matz.

messju mohr · Dec 17, 2003

Hi,

In message "Re: non-english characters"

|I'm not sure, how can I find out what the germans use? and once I know
|that part, how do I use it?

Ask somebody around you to find out. Then if you're going to use
Unicode (UTF-8), write your script in UTF-8 and invoke Ruby with -Ku
option. If you use ISO-8859-* or any other single byte encoding, you
don't have to do anything special.

matz.

hmm.

regexp works fine for me with unicode. either with "ruby -Ku" on
startup or with the /u as regexp-option.

but with ISO-8859-+ (1 or 15 in my case) i don't get \w to match
accented characters.

no big deal, i'm just curious what i'm doing wrong here. i'm using
ruby-1.8.1 from debian testing.

Yukihiro Matsumoto · Dec 17, 2003

Hi,

In message "Re: non-english characters"

|but with ISO-8859-+ (1 or 15 in my case) i don't get \w to match
|accented characters.

That's restriction, character class is defined as [a-zA-Z_].
This restriction will be removed in the Ruby 1.9 by using ISO-8859-*
specific encoding.

matz.

Robert Klemme · Dec 17, 2003

messju mohr said:
hmm.

regexp works fine for me with unicode. either with "ruby -Ku" on
startup or with the /u as regexp-option.

but with ISO-8859-+ (1 or 15 in my case) i don't get \w to match
accented characters.

I guess \w is defined in terms of ASCII - and there you don't have "ß", "é"
and similar chars.

Regards

robert

messju mohr · Dec 17, 2003

I guess \w is defined in terms of ASCII - and there you don't have "ß", "é"
and similar chars.

yes, it looks like i got confused by the PCRE library which treats \w
according to the current locale. too-many-languages error.

Ara.T.Howard · Dec 17, 2003

yes, it looks like i got confused by the PCRE library which treats \w
according to the current locale. too-many-languages error.

depends on your definition of 'treats' and 'locale' ;-)

-bash-2.05b$ cat /etc/redhat-release
Red Hat Enterprise Linux WS release 3 (Taroon)

-bash-2.05b$ perl -v | head -2 # why so much output!

This is perl, v5.8.0 built for i386-linux-thread-multi

-bash-2.05b$ ruby -v
ruby 1.6.8 (2002-12-24) [i386-linux-gnu]

# BROKEN "TREATMENT" OF LOCALE
-bash-2.05b$ export LANG=en_US.UTF-8
-bash-2.05b$ echo abc | perl -ne 'print if /[^\s]+/'
-bash-2.05b$ echo abc | ruby -ne 'print if /[^\s]+/'
abc

# THIS IS OK
-bash-2.05b$ export LANG=en_US
-bash-2.05b$ echo abc | perl -ne 'print if /[^\s]+/'
abc
-bash-2.05b$ echo abc | ruby -ne 'print if /[^\s]+/'
abc

definitely need to examine output carefully where regexes and locale are in
effect - probably better off using ruby since matz presumably has more
experience with multibyte chars than 'ol larry!

-a
--

ATTN: please update your address books with address below!

===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| ADDRESS :: E/GC2 325 Broadway, Boulder, CO 80305-3328
| STP :: http://www.ngdc.noaa.gov/stp/
| NGDC :: http://www.ngdc.noaa.gov/
| NESDIS :: http://www.nesdis.noaa.gov/
| NOAA :: http://www.noaa.gov/
| US DOC :: http://www.commerce.gov/
|
| The difference between art and science is that science is what we
| understand well enough to explain to a computer.
| Art is everything else.
| -- Donald Knuth, "Discover"
|
| /bin/sh -c 'for l in ruby perl;do $l -e "print \"\x3a\x2d\x29\x0a\"";done'
===============================================================================

messju mohr · Dec 17, 2003

yes, it looks like i got confused by the PCRE library which treats \w
according to the current locale. too-many-languages error.

Click to expand...

depends on your definition of 'treats' and 'locale' ;-)

-bash-2.05b$ cat /etc/redhat-release
Red Hat Enterprise Linux WS release 3 (Taroon)

-bash-2.05b$ perl -v | head -2 # why so much output!

This is perl, v5.8.0 built for i386-linux-thread-multi

-bash-2.05b$ ruby -v
ruby 1.6.8 (2002-12-24) [i386-linux-gnu]

# BROKEN "TREATMENT" OF LOCALE
-bash-2.05b$ export LANG=en_US.UTF-8
-bash-2.05b$ echo abc | perl -ne 'print if /[^\s]+/'
-bash-2.05b$ echo abc | ruby -ne 'print if /[^\s]+/'
abc

# THIS IS OK
-bash-2.05b$ export LANG=en_US
-bash-2.05b$ echo abc | perl -ne 'print if /[^\s]+/'
abc
-bash-2.05b$ echo abc | ruby -ne 'print if /[^\s]+/'
abc

definitely need to examine output carefully where regexes and locale are in
effect - probably better off using ruby since matz presumably has more
experience with multibyte chars than 'ol larry!

1. i was talking about ISO-8859-* charactersets and already said, that
UTF-8 works for me.

2. your example works fine for me with
"This is perl, v5.8.2 built for i386-linux-thread-multi" (from
debian unstable)

3. i meant the PCRE library from
ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/ . it's meant
to be *perl compatible* but it is not the actual implementation in
the perl-interpreter, AFAIK

no need to convince me to use ruby over perl

greetings
messju

Josef 'Jupp' SCHUGT · Dec 18, 2003

Hi!

* Daniel Bretoi; 2003-12-17, 19:03 UTC:

I'm not sure, how can I find out what the germans use? and once I
know that part, how do I use it?

For German you can use an awful lot of different encodings :-| Take a
look at the charsets listed at http://dwd.da.ru/charsets/index.html

Most likely ISO 8859-1, ISO 8859-15, or UTF-8 are used but ISO 8859-2
is also in use. The ISO charsets have Umlauts and ß in identical
positions. So the question reduces to UTF-8 vs. ISO-8859 (Windows
Codepages those one would consider are ISO 8859 charsets with
additional characters in the 128..159 region that is unused by the
ISO 8859 charsets.

Josef 'Jupp' SCHUGT

Qt4 : disappearing non-English characters	0	Dec 9, 2009
Using characters from the International Phonetic Alphabet in a C program	0	Sep 21, 2022
Chatbot	0	Oct 8, 2024
Trouble with utf-8 values	0	Nov 5, 2013
non-English path problem	2	Nov 10, 2007
regex \w allows non english characters	7	May 10, 2007
Non-English characters	4	Feb 13, 2007
What is AI programming to us non-bigtech programmers?	4	Jun 1, 2023

non-english characters

Daniel Bretoi

Yukihiro Matsumoto

Daniel Bretoi

Yukihiro Matsumoto

messju mohr

Yukihiro Matsumoto

Robert Klemme

messju mohr

Ara.T.Howard

messju mohr

Josef 'Jupp' SCHUGT

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads