O
OW Ghim Siong
Hi all,
I have a big file 1.5GB in size, with about 6 million lines of
tab-delimited data. I have to perform some filtration on the data and
keep the good data. After filtration, I have about 5.5 million data left
remaining. As you might already guessed, I have to read them in batches
and I did so using .readlines(100000000). After reading each batch, I
will split the line (in string format) to a list using .split("\t") and
then check several conditions, after which if all conditions are
satisfied, I will store the list into a matrix.
The code is as follows:
-----Start------
a=open("bigfile")
matrix=[]
while True:
lines = a.readlines(100000000)
for line in lines:
data=line.split("\t")
if several_conditions_are_satisfied:
matrix.append(data)
print "Number of lines read:", len(lines), "matrix.__sizeof__:",
matrix.__sizeof__()
if len(lines)==0:
break
-----End-----
Results:
Number of lines read: 461544 matrix.__sizeof__: 1694768
Number of lines read: 449840 matrix.__sizeof__: 3435984
Number of lines read: 455690 matrix.__sizeof__: 5503904
Number of lines read: 451955 matrix.__sizeof__: 6965928
Number of lines read: 452645 matrix.__sizeof__: 8816304
Number of lines read: 448555 matrix.__sizeof__: 9918368
Traceback (most recent call last):
MemoryError
The peak memory usage at the task manager is > 2GB which results in the
memory error.
However, if I modify the code, to store as a list of string rather than
a list of list by changing the append statement stated above to
"matrix.append("\t".join(data))", then I do not run out of memory.
Results:
Number of lines read: 461544 matrix.__sizeof__: 1694768
Number of lines read: 449840 matrix.__sizeof__: 3435984
Number of lines read: 455690 matrix.__sizeof__: 5503904
Number of lines read: 451955 matrix.__sizeof__: 6965928
Number of lines read: 452645 matrix.__sizeof__: 8816304
Number of lines read: 448555 matrix.__sizeof__: 9918368
Number of lines read: 453455 matrix.__sizeof__: 12552984
Number of lines read: 432440 matrix.__sizeof__: 14122132
Number of lines read: 432921 matrix.__sizeof__: 15887424
Number of lines read: 464259 matrix.__sizeof__: 17873376
Number of lines read: 450875 matrix.__sizeof__: 20107572
Number of lines read: 458552 matrix.__sizeof__: 20107572
Number of lines read: 453261 matrix.__sizeof__: 22621044
Number of lines read: 413456 matrix.__sizeof__: 22621044
Number of lines read: 166464 matrix.__sizeof__: 25448700
Number of lines read: 0 matrix.__sizeof__: 25448700
In this case, the peak memory according to the task manager is about 1.5 GB.
Does anyone know why is there such a big difference memory usage when
storing the matrix as a list of list, and when storing it as a list of
string? According to __sizeof__ though, the values are the same whether
storing it as a list of list, or storing it as a list of string. Is
there any methods how I can store all the info into a list of list? I
have tried creating such a matrix of equivalent size and it only uses
35mb of memory but I am not sure why when using the code above, the
memory usage shot up so fast and exceeded 2GB.
Any advice is greatly appreciated.
Regards,
Jinxiang
I have a big file 1.5GB in size, with about 6 million lines of
tab-delimited data. I have to perform some filtration on the data and
keep the good data. After filtration, I have about 5.5 million data left
remaining. As you might already guessed, I have to read them in batches
and I did so using .readlines(100000000). After reading each batch, I
will split the line (in string format) to a list using .split("\t") and
then check several conditions, after which if all conditions are
satisfied, I will store the list into a matrix.
The code is as follows:
-----Start------
a=open("bigfile")
matrix=[]
while True:
lines = a.readlines(100000000)
for line in lines:
data=line.split("\t")
if several_conditions_are_satisfied:
matrix.append(data)
print "Number of lines read:", len(lines), "matrix.__sizeof__:",
matrix.__sizeof__()
if len(lines)==0:
break
-----End-----
Results:
Number of lines read: 461544 matrix.__sizeof__: 1694768
Number of lines read: 449840 matrix.__sizeof__: 3435984
Number of lines read: 455690 matrix.__sizeof__: 5503904
Number of lines read: 451955 matrix.__sizeof__: 6965928
Number of lines read: 452645 matrix.__sizeof__: 8816304
Number of lines read: 448555 matrix.__sizeof__: 9918368
Traceback (most recent call last):
MemoryError
The peak memory usage at the task manager is > 2GB which results in the
memory error.
However, if I modify the code, to store as a list of string rather than
a list of list by changing the append statement stated above to
"matrix.append("\t".join(data))", then I do not run out of memory.
Results:
Number of lines read: 461544 matrix.__sizeof__: 1694768
Number of lines read: 449840 matrix.__sizeof__: 3435984
Number of lines read: 455690 matrix.__sizeof__: 5503904
Number of lines read: 451955 matrix.__sizeof__: 6965928
Number of lines read: 452645 matrix.__sizeof__: 8816304
Number of lines read: 448555 matrix.__sizeof__: 9918368
Number of lines read: 453455 matrix.__sizeof__: 12552984
Number of lines read: 432440 matrix.__sizeof__: 14122132
Number of lines read: 432921 matrix.__sizeof__: 15887424
Number of lines read: 464259 matrix.__sizeof__: 17873376
Number of lines read: 450875 matrix.__sizeof__: 20107572
Number of lines read: 458552 matrix.__sizeof__: 20107572
Number of lines read: 453261 matrix.__sizeof__: 22621044
Number of lines read: 413456 matrix.__sizeof__: 22621044
Number of lines read: 166464 matrix.__sizeof__: 25448700
Number of lines read: 0 matrix.__sizeof__: 25448700
In this case, the peak memory according to the task manager is about 1.5 GB.
Does anyone know why is there such a big difference memory usage when
storing the matrix as a list of list, and when storing it as a list of
string? According to __sizeof__ though, the values are the same whether
storing it as a list of list, or storing it as a list of string. Is
there any methods how I can store all the info into a list of list? I
have tried creating such a matrix of equivalent size and it only uses
35mb of memory but I am not sure why when using the code above, the
memory usage shot up so fast and exceeded 2GB.
Any advice is greatly appreciated.
Regards,
Jinxiang