Perl: Read French

M

micropentium

Hi,

I have a perl script that needs to read plain text from database that
may contain French. My script failed to interpret French characters
but those French characters look OK in the database.

My question is: How does Perl handle the unicode?

Many thanks
 
J

Jürgen Exner

micropentium said:
I have a perl script that needs to read plain text from database that
may contain French. My script failed to interpret French characters
but those French characters look OK in the database.

My question is: How does Perl handle the unicode?

Just fine, no problems.

Maybe that database is not in Unicode but in some other character set?
Have you tried ISO-8859-1 or -15 or Windopws-1252?
Those are the most likely candidates but there are others, too.

Or maybe your are simply using the wrong encoding, e.g. UTF-16 when the
database returns UTF-8?

jue
 
M

micropentium

Just fine, no problems.

Maybe that database is not in Unicode but in some other character set?
Have you tried ISO-8859-1 or -15 or Windopws-1252?
Those are the most likely candidates but there are others, too.

Or maybe your are simply using the wrong encoding, e.g. UTF-16 when the
database returns UTF-8?

jue

Hi JE,

I am actually a newbie to Perl and not familiar with Perl's unicde
processing. Would you mind to provide a small piece of codes on
unicode handling? So I can take them as the start point.

Cordially,
 
H

Helmut Richter

That's exaggerated.

It is the user who has to keep track which of his strings are meant as
bytes and which are meant as text characters. The details are explained
in http://perldoc.perl.org/perlunitut.html .

Problems may arise when subroutines of unknown modules are used and it is not
specified which kind of strings are expected.

This should be seriously considered as a possible source of problems.
The code used in the data must be known; it cannot be inferred from the
contents read.

(That's the theory. In practice, it is highly improbable that a string of
bytes is meant as anything other the UTF-8 if it is a correct UTF-8 string.)
I am actually a newbie to Perl and not familiar with Perl's unicde
processing. Would you mind to provide a small piece of codes on
unicode handling? So I can take them as the start point.

You should start with thoroughly understanding the tutorial cited above and
then understand other people's code.
 
J

Jim Gibson

micropentium said:
I am actually a newbie to Perl and not familiar with Perl's unicde
processing. Would you mind to provide a small piece of codes on
unicode handling? So I can take them as the start point.

Check out the documentation that comes with Perl:

perldoc perlunicode
 
J

Jürgen Exner

Helmut Richter said:
That's exaggerated.

Well, each and every Perl text is in Unicode already. So there really
_is_ no problem. The problems appear when you start mugging around and
interfacing with other character sets and encodings, Then you really
need to keep track of if you have (Perl) text (in Unicode) or some
binary data in some other format and when and how to convert between
those. Not to mention to use the right encoding settings when reading
from such files as was discussed very recently here.

On the plus side there are some really great conversion tools and years
ago it was Perl that helped me to save a very large software product by
being able to automatically convert text into numerous local email
encodings.
It is the user who has to keep track which of his strings are meant as
bytes and which are meant as text characters. The details are explained
in http://perldoc.perl.org/perlunitut.html .

Yikes! The term "string" usually implies text, therefore may I rephrase
that as "... has to keep track which of his scalars are meant to contain
binary data (e.g. pictures, hex dumps, file images, yenc-encoded data,
shift-JIS encoded email, ...) and which are meant as text"? This way
you can avoid the awkward "byte string".
Problems may arise when subroutines of unknown modules are used and it is not
specified which kind of strings are expected.

It should (emphais being on should) be clear if they expect binary data
or text.
You should start with thoroughly understanding the tutorial cited above and
then understand other people's code.

Thanks, should have mentioned that myself.

jue
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,215
Messages
2,571,113
Members
47,715
Latest member
ReeceTaren

Latest Threads

Top