S
smcardle
Hi All,
I have recently been working on a project where we needed to detect
the mime type of files. This in itself is not to hard when you
consider the available choice of libraries that have a good guess at
the mime type based on the file extension.
However, in my case we had a CMS that stored images (of any type) into
its repository and for some unknown reason renamed them all with
a .img extension. As the images consisted of ICON, JPEG, GIF and SVG
mime types we needed another way to detect mime types over and above
extension matching (if you can map the extension as certain files such
as Make files don't have an extension).
On unix the OS has a utility called file which makes a good guess at
the mime type of a file. Mime type detection is not bullet proof but
under normal conditions it should be pretty close. Anyway, the file
command uses a couple of text files containing certain rules that are
available on all flavors of UNIX. These files are called magic and
magic.mime.
So having looked at other solutions I didn't find one that really
fitted my needs, so I wrote a new one called mime-util and I have put
this little useful utility on sourceforge at the project location of
http://sourceforge.net/projects/mime-util if anybody is interested.
This little utility uses two methods to detect the mime type of a
file. First it try's to match the file extension and if found will
report the registered list of mime types for that extension. This
method can be modified by by placing a mime properties file on your
classpath thus allowing you to override any of my mappings and even
add new mappings for extensions I did not add to the internal property
file. Secondly, if it is unable to determine the mime type from the
file extension i.e. it is not registered or the file has no extension
it will use the parsed version of the unix magic.mime file (on windows
it will use the internally supplied copy). This file contains rules
that enable various magic numbers to be located at known offsets into
files and then reporting the first match. Again, you can provide your
own version of this file on the classpath as well allowing changes to
the existing matches or even adding new matches without actually
changing the unix magic.mime file itself.
You can use the methods to force only the second match i.e. magic
number matching on all files if you want but the intension is to
provide a fast utility that does a best effort guess. In my tests I
have been able to achieve a 100% match on a wide range of files using
the more expensive magic number matching and over 90% match using a
mixture of both extension and magic number matching. For me this was
sufficient and did not require me to make to many changes or additions
to compensate for the existing magic.mime file. I did however create
an extension to the rules which allow a fuzzy match to occur i.e. I
think this information should be somewhere within the first 1K of data
in this file.
Anyway, If anybody wants to use it its there under an Apache 2.0
license i.e. FREE for ALL and if you have any requests for changes,
additions etc please feel free to comment
I have recently been working on a project where we needed to detect
the mime type of files. This in itself is not to hard when you
consider the available choice of libraries that have a good guess at
the mime type based on the file extension.
However, in my case we had a CMS that stored images (of any type) into
its repository and for some unknown reason renamed them all with
a .img extension. As the images consisted of ICON, JPEG, GIF and SVG
mime types we needed another way to detect mime types over and above
extension matching (if you can map the extension as certain files such
as Make files don't have an extension).
On unix the OS has a utility called file which makes a good guess at
the mime type of a file. Mime type detection is not bullet proof but
under normal conditions it should be pretty close. Anyway, the file
command uses a couple of text files containing certain rules that are
available on all flavors of UNIX. These files are called magic and
magic.mime.
So having looked at other solutions I didn't find one that really
fitted my needs, so I wrote a new one called mime-util and I have put
this little useful utility on sourceforge at the project location of
http://sourceforge.net/projects/mime-util if anybody is interested.
This little utility uses two methods to detect the mime type of a
file. First it try's to match the file extension and if found will
report the registered list of mime types for that extension. This
method can be modified by by placing a mime properties file on your
classpath thus allowing you to override any of my mappings and even
add new mappings for extensions I did not add to the internal property
file. Secondly, if it is unable to determine the mime type from the
file extension i.e. it is not registered or the file has no extension
it will use the parsed version of the unix magic.mime file (on windows
it will use the internally supplied copy). This file contains rules
that enable various magic numbers to be located at known offsets into
files and then reporting the first match. Again, you can provide your
own version of this file on the classpath as well allowing changes to
the existing matches or even adding new matches without actually
changing the unix magic.mime file itself.
You can use the methods to force only the second match i.e. magic
number matching on all files if you want but the intension is to
provide a fast utility that does a best effort guess. In my tests I
have been able to achieve a 100% match on a wide range of files using
the more expensive magic number matching and over 90% match using a
mixture of both extension and magic number matching. For me this was
sufficient and did not require me to make to many changes or additions
to compensate for the existing magic.mime file. I did however create
an extension to the rules which allow a fuzzy match to occur i.e. I
think this information should be somewhere within the first 1K of data
in this file.
Anyway, If anybody wants to use it its there under an Apache 2.0
license i.e. FREE for ALL and if you have any requests for changes,
additions etc please feel free to comment