Reading Text File Encoding and converting to Perls internal UTF-8 encoding

S

sln

Need help from Unicode guru's or anybody with some knowledge on the subject.

I maybe have a text (character) file I just open. But I don't know the encoding and I
can't open it with any encoding attribute.

It would appear to me that at the start of the file, there is an encoding mark (or none),
assuming a text file, a sort of BOM sequence of octets that mark what its encoding is.

Given that I might be passed a file descriptor only, I am module, and I rewind the position
to the start of the file, is there any way I can tell the encoding. If I could, and
its not utf8, I could decode() the rest of the file as octets, ie: in-place memeory decode,
create a temp file decoded, or possibly re-open it with the proper encoding.

I think that encoding is the usual 8/16/32 bit utf but with many locales (chars).

I am still sketchy where to find a list of encoding markers to be able to find out
this information. And still sketchy on the methods available for analysis and transformation.

I know Perl has a massive 'use Encode' lib, nevertheless, this is what I need to do to finalize
a module I'm working on.

Thanks for the help.
-sln
 
R

Robert Billing

Given that I might be passed a file descriptor only, I am module, and I rewind the position
to the start of the file, is there any way I can tell the encoding. If I could, and
its not utf8, I could decode() the rest of the file as octets, ie: in-place memeory decode,
create a temp file decoded, or possibly re-open it with the proper encoding.

As I understand it, and I have just written some Perl code that happily
mixes two dozen languages in one web page, there isn't a really good way
of doing what you want. Part of the reason for this is that given a big
block of text encoded as plain ASCII, the same text in UTF8 is exactly,
bit for bit, the same. It's only when you introduce "wide" characters in
other alphabets that UTF8 does anything.

In some cases it may be possible to make an intelligent guess at the
encoding, but no more.

Incidentally, and somewhat off-topic, is there anyone else for whom the
letters UTF automatically mean 'use the force'?

--
I am Robert Billing, Christian, author, inventor, traveller, cook and
animal lover. "It burned me from within. It quickened; I was with book
as a woman is with child."

Quality e-books for portable readers: http://www.alex-library.com
 
S

sln

As I understand it, and I have just written some Perl code that happily
mixes two dozen languages in one web page, there isn't a really good way
of doing what you want. Part of the reason for this is that given a big
block of text encoded as plain ASCII, the same text in UTF8 is exactly,
bit for bit, the same. It's only when you introduce "wide" characters in
other alphabets that UTF8 does anything.

In some cases it may be possible to make an intelligent guess at the
encoding, but no more.

Incidentally, and somewhat off-topic, is there anyone else for whom the
letters UTF automatically mean 'use the force'?

I'm sorry, 'I exists and therefore I am' doesen't seem to work.

-sln
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,190
Members
46,740
Latest member
AdolphBig6

Latest Threads

Top