Codecs

  • Thread starter Ivan Van Laningham
  • Start date
I

Ivan Van Laningham

Hi All--
As far as I can tell, after looking only at the documentation (and not
searching peps etc.), you cannot query the codecs to give you a list of
registered codecs, or a list of possible codecs it could retrieve for
you if you knew enough to ask for them by name.

Why not? It seems to me that if I want to try to read an unknown file
using an exhaustive list of possible encodings, the best place to keep
the most current list is the codec registry itself, not in the
documentation for the codec module.

Metta,
Ivan
----------------------------------------------
Ivan Van Laningham
God N Locomotive Works
http://www.andi-holmes.com/
http://www.foretec.com/python/workshops/1998-11/proceedings.html
Army Signal Corps: Cu Chi, Class of '70
Author: Teach Yourself Python in 24 Hours
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Ivan said:
Hi All--
As far as I can tell, after looking only at the documentation (and not
searching peps etc.), you cannot query the codecs to give you a list of
registered codecs, or a list of possible codecs it could retrieve for
you if you knew enough to ask for them by name.

Why not?

There are several answers to that question. Which of them is true,
I don't know. In order of likelyhood:
1. When the API was designed, that functionality was forgotten.
It was not possible to add it later on (because of 2)
2. Registration builds on the notion of lookup functions. The
lookup function gets a codec name, and either succeeds in
finding the codec, or raises an exception.
Now, a lookup function, in principle, might not "know" in
advance what codecs it supports, or the number of encoding
it supports might not be finite. So asking such a lookup
function for the complete list of codecs might not be
implementable.

As an example of a lookup function that doesn't know what
encodings it supports, look at my iconv module. This module
provides all codecs that iconv_open(3) supports, yet there
is no standard way to query the iconv library in advance
for a list of all supported codecs.

As an example for a lookup function that supports an infinite
number of codecs, consider the (theoretical) encrypt/password
encoding, which encrypts a string with a password, and the
password is part of the codec name. Each password defines
a new encoding, and there is an infinite number of them.

Now, if 1) would have been considered, it might have been possible
to design the API in a way that didn't support all cases that
the current API supports. Alas, somebody must have misplaced
the time machine.

Regards,
Martin
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Ivan said:
Hi All--
As far as I can tell, after looking only at the documentation (and not
searching peps etc.), you cannot query the codecs to give you a list of
registered codecs, or a list of possible codecs it could retrieve for
you if you knew enough to ask for them by name.

Why not?

There are several answers to that question. Which of them is true,
I don't know. In order of likelyhood:
1. When the API was designed, that functionality was forgotten.
It was not possible to add it later on (because of 2)
2. Registration builds on the notion of lookup functions. The
lookup function gets a codec name, and either succeeds in
finding the codec, or raises an exception.
Now, a lookup function, in principle, might not "know" in
advance what codecs it supports, or the number of encoding
it supports might not be finite. So asking such a lookup
function for the complete list of codecs might not be
implementable.

As an example of a lookup function that doesn't know what
encodings it supports, look at my iconv module. This module
provides all codecs that iconv_open(3) supports, yet there
is no standard way to query the iconv library in advance
for a list of all supported codecs.

As an example for a lookup function that supports an infinite
number of codecs, consider the (theoretical) encrypt/password
encoding, which encrypts a string with a password, and the
password is part of the codec name. Each password defines
a new encoding, and there is an infinite number of them.

Now, if 1) would have been considered, it might have been possible
to design the API in a way that didn't support all cases that
the current API supports. Alas, somebody must have misplaced
the time machine.

Regards,
Martin
 
J

John Machin

Ivan said:
It seems to me that if I want to try to read an unknown file
using an exhaustive list of possible encodings ...


Supposing such a list existed:

What do you mean by "unknown file"? That the encoding is unknown?

Possibility 1:
You are going to try to decode the file from "legacy" to Unicode --
until the first 'success' (defined how?)? But the file could be decoded
by *several* codecs into Unicode without an exception being raised. Just
a simple example: the encodings ['iso-8859-' + x for x in '12459']
define *all* possible 256 characters.

There are various language-guessing algorithms based on e.g. frequency
of ngrams ... try Google.

Possibility 2:
You "know" the file is in a Unicode-encoding e.g. utf-8, have
successfully decoded it to Unicode, and are going to try to encode the
file in a "legacy" encoding but you don't know which one is appropriate?
Sorry, same "But".
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,825
Latest member
VernonQuy6

Latest Threads

Top