Identifying File type by reading files

hokiegal99 · Dec 26, 2003

This is not really a Python-centric question, however, I am using
Python to solve this problem (as of now) so I thought it appropiate to
pose the question here.

I have some functions that search for files that contain certian
strings and if the files found to have these string do not already
have a filename extension (such as '.doc' or '.xls') the function will
append that to the files and rename them. So, if a file named 'report'
was found to have the string 'Microsoft' and the string
'Word.Document.' (notice the '.' at the end of both words) and it does
not already have an extension, then a rename would take place that
would name the file 'report.doc'

These functions work very well on most files (98% guessed correctly).
However, I would like the functions to be more precise (100%). So,
what should I look for in a file to determine whether or not it is a
MS Word file or an Excel file or a PDF file, etc., etc.? Below is a
list of some of the strings I use to ID files, but I can't help but
wonder that there must be a more precise way of doing this. I know of
the Unix 'file' command. It is not very useful for me as it doesn't
distinguish between MS Office documents... all .xls, .docs, .ppts are
MS documents to it.

Are there certain sets of binary data that are unique to files that
would be a better way of identifying them? For example, on the N line
of a MS doc file begining at position X a binary string that is L
digits in lentgh that begins with B and ends with E will *ALWAYS* be
present... some one tell me that I'm not dreaming and that something
like the above example exists???

A few of my string searches today:

doc = string.find(file(os.path.join(root,fname), 'rb').read(),
'Word.Document.')
xls = string.find(file(os.path.join(root,fname), 'rb').read(),
'Excel.Sheet.')
pdf = string.find(file(os.path.join(root,fname), 'rb').read(),
'PDF-1.')
jpg = string.find(file(os.path.join(root,fname), 'rb').read(), 'JFIF')

Any suggestions or information that better describes how to positively
ID files w/o the possibiliy of mistake would be very helpful to me. As
of now, some of my files, though not many (~ 2%) will be given the
wrong extension, but the logic of the functions is such that they
append any extension that probably applies to the file so at that
point it is a simple process of elimination to determine which
extension is actually the correct one. Normally, I never have more
than 2 unique extensions attached to the same file.

Thank you!!!

Robin Munn · Dec 26, 2003

hokiegal99 said:
Are there certain sets of binary data that are unique to files that
would be a better way of identifying them? For example, on the N line
of a MS doc file begining at position X a binary string that is L
digits in lentgh that begins with B and ends with E will *ALWAYS* be
present... some one tell me that I'm not dreaming and that something
like the above example exists???

A few of my string searches today:

doc = string.find(file(os.path.join(root,fname), 'rb').read(),
'Word.Document.')
xls = string.find(file(os.path.join(root,fname), 'rb').read(),
'Excel.Sheet.')
pdf = string.find(file(os.path.join(root,fname), 'rb').read(),
'PDF-1.')
jpg = string.find(file(os.path.join(root,fname), 'rb').read(), 'JFIF')

Any suggestions or information that better describes how to positively
ID files w/o the possibiliy of mistake would be very helpful to me. As
of now, some of my files, though not many (~ 2%) will be given the
wrong extension, but the logic of the functions is such that they
append any extension that probably applies to the file so at that
point it is a simple process of elimination to determine which
extension is actually the correct one. Normally, I never have more
than 2 unique extensions attached to the same file.

Glutton for punishment, aren't you?

Seriously, that is a non-trivial problem. If that's what you're trying
to do, though, the file format documentation at http://www.wotsit.org/
may be useful to you. Good luck!

hokiegal99 · Dec 27, 2003

WOW. That's a great site. Thanks for the info!!!

Nicolas =?ISO-8859-15?Q?Favre=2DF=E9lix?= · Dec 27, 2003

Try to get the source code of the GNU program 'file'.
I give you some examples :

$ file sqlite.bin
sqlite.bin: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for
GNU/Linux 2.2.5, statically linked, stripped

$ file doc.pdf
doc.pdf: PDF document, version 1.4

$ file archive.tar.gz
archive.tar.gz: gzip compressed data, from Unix

$ file music.mp3
music.mp3: MP3, 128 kBits, 44.1 kHz, JStereo

If you plan to run your script in a UNIX environment, maybe you can look
this way, but if it must be portable, you can compile different versions of
file.c to be called with popen.
Or if you feel having a lot of free nights to spend on it, you can study the
C source code and convert it to Python.

Good luck!

Help in identifying code	2	Jan 29, 2024
Identifying if the program I have is python and then decompiling	0	May 29, 2022
How can I train a neural network by reading different csv files	0	Nov 24, 2022
Cant encrypt a server disk with fernet PYTHON3	0	Jun 6, 2022
Select Eof extension files based on text list of filenames with if condition	0	May 4, 2022
Sending Error when attaching files	1	Aug 7, 2023
Uniquely identifying each & every html template	58	Jan 18, 2013
Select files based on text list of filenames(part of the name:date) with condition	0	May 4, 2022

Identifying File type by reading files

hokiegal99

Robin Munn

hokiegal99

Nicolas =?ISO-8859-15?Q?Favre=2DF=E9lix?=

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads