Problem with tarfile module to open *.tar.gz files - unreliable ?

M

m_ahlenius

Hi,

I am relatively new to doing serious work in python. I am using it to
access a large number of log files. Some of the logs get corrupted
and I need to detect that when processing them. This code seems to
work for quite a few of the logs (all same structure) It also
correctly identifies some corrupt logs but then it identifies others
as being corrupt when they are not.

example error msg from below code:

Could not open the log file: '/disk/7-29-04-02-01.console.log.tar.gz'
Exception: CRC check\
failed 0x8967e931 != 0x4e5f1036L

When I manually examine the supposed corrupt log file and use
"tar -xzvof /disk/7-29-04-02-01.console.log.tar.gz " on it, it opens
just fine.

Is there anything wrong with how I am using this module? (extra code
removed for clarity)

if tarfile.is_tarfile( file ):
try:
xf = tarfile.open( file, "r:gz" )
for locFile in xf:
logfile = xf.extractfile( locFile )
validFileFlag = True
# iterate through each log file, grab the first and
the last lines
lines = iter( logfile )
firstLine = lines.next()
for nextLine in lines:
....
continue

logfile.close()
...
xf.close()
except Exception, e:
validFileFlag = False
msg = "\nCould not open the log file: " + repr(file) + "
Exception: " + str(e) + "\n"
else:
validFileFlag = False
lTime = extractFileNameTime( file )
msg = ">>>>>>> Warning " + file + " is NOT a valid tar archive
\n"
print msg
 
D

Dave Angel

m_ahlenius said:
Hi,

I am relatively new to doing serious work in python. I am using it to
access a large number of log files. Some of the logs get corrupted
and I need to detect that when processing them. This code seems to
work for quite a few of the logs (all same structure) It also
correctly identifies some corrupt logs but then it identifies others
as being corrupt when they are not.

example error msg from below code:

Could not open the log file: '/disk/7-29-04-02-01.console.log.tar.gz'
Exception: CRC check\
failed 0x8967e931 != 0x4e5f1036L

When I manually examine the supposed corrupt log file and use
"tar -xzvof /disk/7-29-04-02-01.console.log.tar.gz " on it, it opens
just fine.

Is there anything wrong with how I am using this module? (extra code
removed for clarity)

if tarfile.is_tarfile( file ):
try:
xf = tarfile.open( file, "r:gz" )
for locFile in xf:
logfile = xf.extractfile( locFile )
validFileFlag = True
# iterate through each log file, grab the first and
the last lines
lines = iter( logfile )
firstLine = lines.next()
for nextLine in lines:
....
continue

logfile.close()
...
xf.close()
except Exception, e:
validFileFlag = False
msg = "\nCould not open the log file: " + repr(file) + "
Exception: " + str(e) + "\n"
else:
validFileFlag = False
lTime = extractFileNameTime( file )
msg = ">>>>>>> Warning " + file + " is NOT a valid tar archive
\n"
print msg
I haven't used tarfile, but this feels like a problem with the Win/Unix
line endings. I'm going to assume you're running on Windows, which
could trigger the problem I'm going to describe.

You use 'file' to hold something, but don't show us what. In fact, it's
a lousy name, since it's already a Python builtin. But if it's holding
fileobj, that you've separately opened, then you need to change that
open to use mode 'rb'

The problem, if I've guessed right, is that occasionally you'll
accidentally encounter a 0d0a sequence in the middle of the (binary)
compressed data. If you're on Windows, and use the default 'r' mode,
it'll be changed into a 0a byte. Thus corrupting the checksum, and
eventually the contents.

DaveA
 
M

m_ahlenius

I haven't used tarfile, but this feels like a problem with the Win/Unix
line endings.  I'm going to assume you're running on Windows, which
could trigger the problem I'm going to describe.

You use 'file' to hold something, but don't show us what.  In fact, it's
a lousy name, since it's already a Python builtin.  But if it's holding  
fileobj, that you've separately opened, then you need to change that
open to use mode 'rb'

The problem, if I've guessed right, is that occasionally you'll
accidentally encounter a 0d0a sequence in the middle of the (binary)
compressed data.  If you're on Windows, and use the default 'r' mode,
it'll be changed into a 0a byte.  Thus corrupting the checksum, and
eventually the contents.

DaveA

Hi,

thanks for the comments - I'll change the variable name.

I am running this on linux so don't think its a Windows issue. So if
that's the case
is the 0d0a still an issue?

'mark
 
M

m_ahlenius

Hi,

thanks for the comments - I'll change the variable name.

I am running this on linux so don't think its a Windows issue.  So if
that's the case
is the 0d0a still an issue?

'mark

Oh and what's stored currently in
The file var us just the unopened pathname to the
Target file I want to open
 
D

Dave Angel

m_ahlenius said:
Oh and what's stored currently in
The file var us just the unopened pathname to the
Target file I want to open
No, on Linux, there should be no such problem. And I have to assume
that if you pass the filename as a string, the library would use 'rb'
anyway. It's just if you pass a fileobj, AND are on Windows.

Sorry I wasted your time, but nobody else had answered, and I hoped it
might help.

DaveA
 
P

Peter Otten

m_ahlenius said:
Oh and what's stored currently in
The file var us just the unopened pathname to the
Target file I want to open

Random questions:

What python version are you using?
If you have other python versions around, do they exhibit the same problem?
If you extract and compress your data using the external tool, does the
resulting file make problems in Python, too?
If so, can you reduce data size and put a small demo online for others to
experiment with?

Peter
 
M

m_ahlenius

No, on Linux, there should be no such problem.  And I have to assume
that if you pass the filename as a string, the library would use 'rb'
anyway.  It's just if you pass a fileobj,  AND are on Windows.

Sorry I wasted your time, but nobody else had answered, and I hoped it
might help.

DaveA

Hi Dave

thanks for responding - you were not wasting my time but helping me to
be aware of other potential issues.

Appreciate it much.

Its just weird that it works for most files and even finds corrupt
ones, but some of the ones it marks as corrupt seem to be OK.

thanks

'mark
 
M

m_ahlenius

Random questions:

What python version are you using?
If you have other python versions around, do they exhibit the same problem?
If you extract and compress your data using the external tool, does the
resulting file make problems in Python, too?
If so, can you reduce data size and put a small demo online for others to
experiment with?

Peter

Hi,

I am using Python 2.6.5.

Unfortunately I don't have other versions installed so its hard to
test with a different version.

As for the log compression, its a bit hard to test. Right now I may
process 100+ of these logs per night, and will get maybe 5 which are
reported as corrupt (typically a bad CRC) and 2 which it reported as a
bad tar archive. This morning I checked each of the 7 reported
problem files by manually opening them with "tar -xzvof" and they were
all indeed corrupt. Sign.

Unfortunately due to the nature of our business, I can't post the data
files online, I hope you can understand. But I really appreciate your
suggestions.

The thing that gets me is that it seems to work just fine for most
files, but then not others. Labeling normal files as corrupt hurts us
as we then skip getting any log data from those files.

appreciate all your help.

'mark
 
P

Peter Otten

m_ahlenius said:
I am using Python 2.6.5.

Unfortunately I don't have other versions installed so its hard to
test with a different version.

As for the log compression, its a bit hard to test. Right now I may
process 100+ of these logs per night, and will get maybe 5 which are
reported as corrupt (typically a bad CRC) and 2 which it reported as a
bad tar archive. This morning I checked each of the 7 reported
problem files by manually opening them with "tar -xzvof" and they were
all indeed corrupt. Sign.

So many corrupted files? I'd say you have to address the problem with your
infrastructure first.
Unfortunately due to the nature of our business, I can't post the data
files online, I hope you can understand. But I really appreciate your
suggestions.

The thing that gets me is that it seems to work just fine for most
files, but then not others. Labeling normal files as corrupt hurts us
as we then skip getting any log data from those files.

appreciate all your help.

I've written an autocorruption script,

import sys
import subprocess
import tarfile

def process(source, dest, data):
for pos in range(len(data)):
for bit in range(8):
new_data = data[:pos] + chr(ord(data[pos]) ^ (1<<bit)) +
data[pos+1:]
assert len(data) == len(new_data)
out = open(dest, "w")
out.write(new_data)
out.close()
try:
t = tarfile.open(dest)
for f in t:
t.extractfile(f)
except Exception, e:
if 0 == subprocess.call(["tar", "-xf", dest]):
return pos, bit

if __name__ == "__main__":
source, dest = sys.argv[1:]
data = open(source).read()
print process(source, dest, data)

and I can indeed construct an archive that is rejected by tarfile, but not
by tar. My working hypothesis is that the python library is a bit stricter
in what it accepts...

Peter
 
M

m_ahlenius

m_ahlenius said:
I am using Python 2.6.5.
Unfortunately I don't have other versions installed so its hard to
test with a different version.
As for the log compression, its a bit hard to test.  Right now I may
process 100+ of these logs per night, and will get maybe 5 which are
reported as corrupt (typically a bad CRC) and 2 which it reported as a
bad tar archive.  This morning I checked each of the 7 reported
problem files by manually opening them with "tar -xzvof" and they were
all indeed corrupt. Sign.

So many corrupted files? I'd say you have to address the problem with your
infrastructure first.
Unfortunately due to the nature of our business, I can't post the data
files online, I hope you can understand.  But I really appreciate your
suggestions.
The thing that gets me is that it seems to work just fine for most
files, but then not others.  Labeling normal files as corrupt hurts us
as we then skip getting any log data from those files.
appreciate all your help.

I've written an autocorruption script,

import sys
import subprocess
import tarfile

def process(source, dest, data):
    for pos in range(len(data)):
        for bit in range(8):
            new_data = data[:pos] + chr(ord(data[pos]) ^ (1<<bit)) +
data[pos+1:]
            assert len(data) == len(new_data)
            out = open(dest, "w")
            out.write(new_data)
            out.close()
            try:
                t = tarfile.open(dest)
                for f in t:
                    t.extractfile(f)
            except Exception, e:
                if 0 == subprocess.call(["tar", "-xf", dest]):
                    return pos, bit

if __name__ == "__main__":
    source, dest = sys.argv[1:]
    data = open(source).read()
    print process(source, dest, data)

and I can indeed construct an archive that is rejected by tarfile, but not
by tar. My working hypothesis is that the python library is a bit stricter
in what it accepts...

Peter

Thanks - that's cool.

A friend of mine was suggesting that he's seen similar behaviour when
he uses Perl on these types of files when the OS (Unix) has not
finished writing them. We have an rsync process which sync's our
servers for these files and then come down somewhat randomly. So its
conceivable I think that this process could be trying to open a file
as its being written. I know it sounds like a stretch but my guess is
that its a possibility. I could verify that with the timestamps of
the errors in my log and the mod time on the original file.

'mark
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,150
Members
46,697
Latest member
AugustNabo

Latest Threads

Top