non-english characters

D

Daniel Bretoi

how do I match non-english alphabetical characters? Such as the german
double-s ? (ß)

db
 
Y

Yukihiro Matsumoto

Hi,

In message "non-english characters"

|how do I match non-english alphabetical characters? Such as the german
|double-s ? (ß)

Which encoding do you wish to use?

matz.
 
D

Daniel Bretoi

Hi,

In message "non-english characters"

|how do I match non-english alphabetical characters? Such as the german
|double-s ? (?)

Which encoding do you wish to use?

I'm not sure, how can I find out what the germans use? and once I know
that part, how do I use it?

db
 
Y

Yukihiro Matsumoto

Hi,

In message "Re: non-english characters"

|I'm not sure, how can I find out what the germans use? and once I know
|that part, how do I use it?

Ask somebody around you to find out. Then if you're going to use
Unicode (UTF-8), write your script in UTF-8 and invoke Ruby with -Ku
option. If you use ISO-8859-* or any other single byte encoding, you
don't have to do anything special.

matz.
 
M

messju mohr

Hi,

In message "Re: non-english characters"

|I'm not sure, how can I find out what the germans use? and once I know
|that part, how do I use it?

Ask somebody around you to find out. Then if you're going to use
Unicode (UTF-8), write your script in UTF-8 and invoke Ruby with -Ku
option. If you use ISO-8859-* or any other single byte encoding, you
don't have to do anything special.

matz.

hmm.

regexp works fine for me with unicode. either with "ruby -Ku" on
startup or with the /u as regexp-option.

but with ISO-8859-+ (1 or 15 in my case) i don't get \w to match
accented characters.

no big deal, i'm just curious what i'm doing wrong here. i'm using
ruby-1.8.1 from debian testing.
 
Y

Yukihiro Matsumoto

Hi,

In message "Re: non-english characters"

|but with ISO-8859-+ (1 or 15 in my case) i don't get \w to match
|accented characters.

That's restriction, character class is defined as [a-zA-Z_].
This restriction will be removed in the Ruby 1.9 by using ISO-8859-*
specific encoding.

matz.
 
R

Robert Klemme

messju mohr said:
hmm.

regexp works fine for me with unicode. either with "ruby -Ku" on
startup or with the /u as regexp-option.

but with ISO-8859-+ (1 or 15 in my case) i don't get \w to match
accented characters.

I guess \w is defined in terms of ASCII - and there you don't have "ß", "é"
and similar chars.

Regards

robert
 
M

messju mohr

I guess \w is defined in terms of ASCII - and there you don't have "ß", "é"
and similar chars.

yes, it looks like i got confused by the PCRE library which treats \w
according to the current locale. too-many-languages error. :)
 
A

Ara.T.Howard

yes, it looks like i got confused by the PCRE library which treats \w
according to the current locale. too-many-languages error. :)

depends on your definition of 'treats' and 'locale' ;-)

-bash-2.05b$ cat /etc/redhat-release
Red Hat Enterprise Linux WS release 3 (Taroon)

-bash-2.05b$ perl -v | head -2 # why so much output!

This is perl, v5.8.0 built for i386-linux-thread-multi

-bash-2.05b$ ruby -v
ruby 1.6.8 (2002-12-24) [i386-linux-gnu]

# BROKEN "TREATMENT" OF LOCALE
-bash-2.05b$ export LANG=en_US.UTF-8
-bash-2.05b$ echo abc | perl -ne 'print if /[^\s]+/'
-bash-2.05b$ echo abc | ruby -ne 'print if /[^\s]+/'
abc

# THIS IS OK
-bash-2.05b$ export LANG=en_US
-bash-2.05b$ echo abc | perl -ne 'print if /[^\s]+/'
abc
-bash-2.05b$ echo abc | ruby -ne 'print if /[^\s]+/'
abc


definitely need to examine output carefully where regexes and locale are in
effect - probably better off using ruby since matz presumably has more
experience with multibyte chars than 'ol larry!

-a
--

ATTN: please update your address books with address below!

===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| ADDRESS :: E/GC2 325 Broadway, Boulder, CO 80305-3328
| STP :: http://www.ngdc.noaa.gov/stp/
| NGDC :: http://www.ngdc.noaa.gov/
| NESDIS :: http://www.nesdis.noaa.gov/
| NOAA :: http://www.noaa.gov/
| US DOC :: http://www.commerce.gov/
|
| The difference between art and science is that science is what we
| understand well enough to explain to a computer.
| Art is everything else.
| -- Donald Knuth, "Discover"
|
| /bin/sh -c 'for l in ruby perl;do $l -e "print \"\x3a\x2d\x29\x0a\"";done'
===============================================================================
 
M

messju mohr

yes, it looks like i got confused by the PCRE library which treats \w
according to the current locale. too-many-languages error. :)

depends on your definition of 'treats' and 'locale' ;-)

-bash-2.05b$ cat /etc/redhat-release
Red Hat Enterprise Linux WS release 3 (Taroon)

-bash-2.05b$ perl -v | head -2 # why so much output!

This is perl, v5.8.0 built for i386-linux-thread-multi

-bash-2.05b$ ruby -v
ruby 1.6.8 (2002-12-24) [i386-linux-gnu]

# BROKEN "TREATMENT" OF LOCALE
-bash-2.05b$ export LANG=en_US.UTF-8
-bash-2.05b$ echo abc | perl -ne 'print if /[^\s]+/'
-bash-2.05b$ echo abc | ruby -ne 'print if /[^\s]+/'
abc

# THIS IS OK
-bash-2.05b$ export LANG=en_US
-bash-2.05b$ echo abc | perl -ne 'print if /[^\s]+/'
abc
-bash-2.05b$ echo abc | ruby -ne 'print if /[^\s]+/'
abc


definitely need to examine output carefully where regexes and locale are in
effect - probably better off using ruby since matz presumably has more
experience with multibyte chars than 'ol larry!

1. i was talking about ISO-8859-* charactersets and already said, that
UTF-8 works for me.

2. your example works fine for me with
"This is perl, v5.8.2 built for i386-linux-thread-multi" (from
debian unstable)

3. i meant the PCRE library from
ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/ . it's meant
to be *perl compatible* but it is not the actual implementation in
the perl-interpreter, AFAIK


no need to convince me to use ruby over perl :)

greetings
messju
 
J

Josef 'Jupp' SCHUGT

Hi!

* Daniel Bretoi; 2003-12-17, 19:03 UTC:
I'm not sure, how can I find out what the germans use? and once I
know that part, how do I use it?

For German you can use an awful lot of different encodings :-| Take a
look at the charsets listed at http://dwd.da.ru/charsets/index.html

Most likely ISO 8859-1, ISO 8859-15, or UTF-8 are used but ISO 8859-2
is also in use. The ISO charsets have Umlauts and ß in identical
positions. So the question reduces to UTF-8 vs. ISO-8859 (Windows
Codepages those one would consider are ISO 8859 charsets with
additional characters in the 128..159 region that is unused by the
ISO 8859 charsets.

Josef 'Jupp' SCHUGT
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,141
Messages
2,570,817
Members
47,364
Latest member
Stevanida

Latest Threads

Top