Possible bug with stability of mimetypes.guess_* function output

J

Johannes Bauer

Hi group,

I'm using Python 3.3.2+ (default, Oct 9 2013, 14:50:09) [GCC 4.8.1] on
linux and have found what is very peculiar behavior at best and a bug at
worst. It regards the mimetypes module and in particular the
guess_all_extensions and guess_extension functions.

I've found that these do not return stable output. When running the
following commands, it returns one of:

$ python3 -c 'import mimetypes;
print(mimetypes.guess_all_extensions("text/html"),
mimetypes.guess_extension("text/html"))'
['.htm', '.html', '.shtml'] .htm

$ python3 -c 'import mimetypes;
print(mimetypes.guess_all_extensions("text/html"),
mimetypes.guess_extension("text/html"))'
['.html', '.htm', '.shtml'] .html

So guess_extension(x) seems to always return guess_all_extensions(x)[0].

Curiously, "shtml" is never the first element. The other two are mixed
with a probability of around 50% which leads me to believe they're
internally managed as a set and are therefore affected by the
(relatively new) nondeterministic hashing function initialization.

I don't know if stable output is guaranteed for these functions, but it
sure would be nice. Messes up a whole bunch of things otherwise :-/

Please let me know if this is a bug or expected behavior.
Best regards,
Johannes

--
Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <[email protected]>
 
A

Asaf Las

Hi group,

I'm using Python 3.3.2+ (default, Oct 9 2013, 14:50:09) [GCC 4.8.1] on
linux and have found what is very peculiar behavior at best and a bug at
worst. It regards the mimetypes module and in particular the
guess_all_extensions and guess_extension functions.

I've found that these do not return stable output. When running the
following commands, it returns one of:

$ python3 -c 'import mimetypes;
print(mimetypes.guess_all_extensions("text/html"),
mimetypes.guess_extension("text/html"))'
['.htm', '.html', '.shtml'] .htm

$ python3 -c 'import mimetypes;
print(mimetypes.guess_all_extensions("text/html"),
mimetypes.guess_extension("text/html"))'
['.html', '.htm', '.shtml'] .html

So guess_extension(x) seems to always return guess_all_extensions(x)[0].

Curiously, "shtml" is never the first element. The other two are mixed
with a probability of around 50% which leads me to believe they're
internally managed as a set and are therefore affected by the
(relatively new) nondeterministic hashing function initialization.


I don't know if stable output is guaranteed for these functions, but it
sure would be nice. Messes up a whole bunch of things otherwise :-/

Please let me know if this is a bug or expected behavior.

Best regards,

Johannes

dictionary. same for v3.3.3 as well.

it might be you could try to query using sequence below :

import mimetypes
mimetypes.init()
mimetypes.guess_extension("text/html")

i got only 'htm' for 5 consequitive attempts

/Asaf
 
A

Asaf Las

btw, had seen this after own post -
example usage includes mimetypes.init()
before call to module functions.
 
J

Johannes Bauer

it might be you could try to query using sequence below :

import mimetypes
mimetypes.init()
mimetypes.guess_extension("text/html")

i got only 'htm' for 5 consequitive attempts

Doesn't change anything. With this:

#!/usr/bin/python3
import mimetypes
mimetypes.init()
print(mimetypes.guess_extension("application/msword"))

And a call like this:

$ for i in `seq 100`; do ./x.py ; done | sort | uniq -c

I get

35 .doc
24 .dot
41 .wiz

Regards,
Johannes

--
Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <[email protected]>
 
P

Peter Otten

Asaf said:
Hi group,

I'm using Python 3.3.2+ (default, Oct 9 2013, 14:50:09) [GCC 4.8.1] on
linux and have found what is very peculiar behavior at best and a bug at
worst. It regards the mimetypes module and in particular the
guess_all_extensions and guess_extension functions.

I've found that these do not return stable output. When running the
following commands, it returns one of:

$ python3 -c 'import mimetypes;
print(mimetypes.guess_all_extensions("text/html"),
mimetypes.guess_extension("text/html"))'
['.htm', '.html', '.shtml'] .htm

$ python3 -c 'import mimetypes;
print(mimetypes.guess_all_extensions("text/html"),
mimetypes.guess_extension("text/html"))'
['.html', '.htm', '.shtml'] .html

So guess_extension(x) seems to always return guess_all_extensions(x)[0].

Curiously, "shtml" is never the first element. The other two are mixed
with a probability of around 50% which leads me to believe they're
internally managed as a set and are therefore affected by the
(relatively new) nondeterministic hashing function initialization.


I don't know if stable output is guaranteed for these functions, but it
sure would be nice. Messes up a whole bunch of things otherwise :-/

Please let me know if this is a bug or expected behavior.

Best regards,

Johannes

dictionary. same for v3.3.3 as well.

it might be you could try to query using sequence below :

import mimetypes
mimetypes.init()
mimetypes.guess_extension("text/html")

i got only 'htm' for 5 consequitive attempts

As Johannes mentioned, this depends on the hash seed:

$ PYTHONHASHSEED=0 python3 -c 'print({".htm", ".html", ".shtml"}.pop())'
..html
$ PYTHONHASHSEED=1 python3 -c 'print({".htm", ".html", ".shtml"}.pop())'
..htm
$ PYTHONHASHSEED=2 python3 -c 'print({".htm", ".html", ".shtml"}.pop())'
..shtml

You never see ".shtml" as the guessed extension because it is not in the
original mimetypes.types_map dict, but instead programmaticaly read from a
file like /etc/mime.types and then added to a list of extensions.

Johanes,
I'd like the guessed extension to be consistent, too, but even if that is
rejected the current behaviour should be documented.

Please file a bug report.
 
A

Asaf Las

As Johannes mentioned, this depends on the hash seed:
$ PYTHONHASHSEED=0 python3 -c 'print({".htm", ".html", ".shtml"}.pop())'
.html
$ PYTHONHASHSEED=1 python3 -c 'print({".htm", ".html", ".shtml"}.pop())'
.htm
$ PYTHONHASHSEED=2 python3 -c 'print({".htm", ".html", ".shtml"}.pop())'
.shtml

You never see ".shtml" as the guessed extension because it is not in the
original mimetypes.types_map dict, but instead programmaticaly read from a
file like /etc/mime.types and then added to a list of extensions.
as there are bunch of files in mimetypes.py the only repeatability could
be achieved on particular machine level.

"/etc/mime.types",
"/etc/httpd/mime.types",
"/etc/httpd/conf/mime.types",
"/etc/apache/mime.types",
"/etc/apache2/mime.types",
"/usr/local/etc/httpd/conf/mime.types",
"/usr/local/lib/netscape/mime.types",
"/usr/local/etc/httpd/conf/mime.types",
"/usr/local/etc/mime.types"
 
P

Peter Otten

as there are bunch of files in mimetypes.py the only repeatability could
be achieved on particular machine level.

At least the mimetypes already defined in the module could easily produce
the same guessed extension consistently.
 
A

Asaf Las

At least the mimetypes already defined in the module could easily produce
the same guessed extension consistently.

imho one workaround for OP could be to supply own map file in init() thus
ensure unambiguous mapping across every platform and distribution. guess
some libraries already doing that. or write wrapper and process all_guesses
to eliminate ambiguity up to needed requirement.
that is in case if bug request will be rejected.
 
P

Peter Otten

Asaf said:
imho one workaround for OP could be to supply own map file in init() thus
ensure unambiguous mapping across every platform and distribution. guess
some libraries already doing that. or write wrapper and process
all_guesses to eliminate ambiguity up to needed requirement.
that is in case if bug request will be rejected.

You also have to set mimetypes.types_map and mimetypes.common_types to an
empty dict (or an OrderedDict).
 
A

Asaf Las

You also have to set mimetypes.types_map and mimetypes.common_types to an
empty dict (or an OrderedDict).

Hmmm, yes. then the quickest workaround is to get all guesses list then
sort it and use the one at index 0.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,185
Members
46,736
Latest member
AdolphBig6

Latest Threads

Top