Identifying File type by reading files

H

hokiegal99

This is not really a Python-centric question, however, I am using
Python to solve this problem (as of now) so I thought it appropiate to
pose the question here.

I have some functions that search for files that contain certian
strings and if the files found to have these string do not already
have a filename extension (such as '.doc' or '.xls') the function will
append that to the files and rename them. So, if a file named 'report'
was found to have the string 'Microsoft' and the string
'Word.Document.' (notice the '.' at the end of both words) and it does
not already have an extension, then a rename would take place that
would name the file 'report.doc'

These functions work very well on most files (98% guessed correctly).
However, I would like the functions to be more precise (100%). So,
what should I look for in a file to determine whether or not it is a
MS Word file or an Excel file or a PDF file, etc., etc.? Below is a
list of some of the strings I use to ID files, but I can't help but
wonder that there must be a more precise way of doing this. I know of
the Unix 'file' command. It is not very useful for me as it doesn't
distinguish between MS Office documents... all .xls, .docs, .ppts are
MS documents to it.

Are there certain sets of binary data that are unique to files that
would be a better way of identifying them? For example, on the N line
of a MS doc file begining at position X a binary string that is L
digits in lentgh that begins with B and ends with E will *ALWAYS* be
present... some one tell me that I'm not dreaming and that something
like the above example exists???

A few of my string searches today:

doc = string.find(file(os.path.join(root,fname), 'rb').read(),
'Word.Document.')
xls = string.find(file(os.path.join(root,fname), 'rb').read(),
'Excel.Sheet.')
pdf = string.find(file(os.path.join(root,fname), 'rb').read(),
'PDF-1.')
jpg = string.find(file(os.path.join(root,fname), 'rb').read(), 'JFIF')

Any suggestions or information that better describes how to positively
ID files w/o the possibiliy of mistake would be very helpful to me. As
of now, some of my files, though not many (~ 2%) will be given the
wrong extension, but the logic of the functions is such that they
append any extension that probably applies to the file so at that
point it is a simple process of elimination to determine which
extension is actually the correct one. Normally, I never have more
than 2 unique extensions attached to the same file.

Thank you!!!
 
R

Robin Munn

hokiegal99 said:
Are there certain sets of binary data that are unique to files that
would be a better way of identifying them? For example, on the N line
of a MS doc file begining at position X a binary string that is L
digits in lentgh that begins with B and ends with E will *ALWAYS* be
present... some one tell me that I'm not dreaming and that something
like the above example exists???

A few of my string searches today:

doc = string.find(file(os.path.join(root,fname), 'rb').read(),
'Word.Document.')
xls = string.find(file(os.path.join(root,fname), 'rb').read(),
'Excel.Sheet.')
pdf = string.find(file(os.path.join(root,fname), 'rb').read(),
'PDF-1.')
jpg = string.find(file(os.path.join(root,fname), 'rb').read(), 'JFIF')

Any suggestions or information that better describes how to positively
ID files w/o the possibiliy of mistake would be very helpful to me. As
of now, some of my files, though not many (~ 2%) will be given the
wrong extension, but the logic of the functions is such that they
append any extension that probably applies to the file so at that
point it is a simple process of elimination to determine which
extension is actually the correct one. Normally, I never have more
than 2 unique extensions attached to the same file.

Glutton for punishment, aren't you? :)

Seriously, that is a non-trivial problem. If that's what you're trying
to do, though, the file format documentation at http://www.wotsit.org/
may be useful to you. Good luck!
 
N

Nicolas =?ISO-8859-15?Q?Favre=2DF=E9lix?=

Try to get the source code of the GNU program 'file'.
I give you some examples :

$ file sqlite.bin
sqlite.bin: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for
GNU/Linux 2.2.5, statically linked, stripped

$ file doc.pdf
doc.pdf: PDF document, version 1.4

$ file archive.tar.gz
archive.tar.gz: gzip compressed data, from Unix

$ file music.mp3
music.mp3: MP3, 128 kBits, 44.1 kHz, JStereo

If you plan to run your script in a UNIX environment, maybe you can look
this way, but if it must be portable, you can compile different versions of
file.c to be called with popen.
Or if you feel having a lot of free nights to spend on it, you can study the
C source code and convert it to Python.

Good luck!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top