Large Data files and sizes - what are you doing about them?

Rich_Elswick · Mar 9, 2006

Hi all,

I am parsing a large data sets (62 gigs on one file). I can parse out
the into smaller files fine with perl, which is what we have to do
anyway (i.e. hex data becomes ascii .csv file type of different decoded
variables.) I am working with CAN data for those that know about
Controller Area Networks collected by Vector CANalyzer.

After they are parsed out, I am looking at the largest data file (1
file becomes ~100 smaller files) is about 2 gigs as of right now, but
who knows how large it could become in the future. I then use GDGraph
to parse through the data files and rapidly generate some .png files
for review (I have issues with this as well and will post those
questions some other time.) I run this on the whole batch of 100
files, going through each file one at a time using a batch program to
call the each perl program separately for each GDGraph, because GDGraph
loads the entire data set into memory before graphing the data. This
limits me to using this method on data files smaller than ~20 megs,
based on system memory. I suppose I could up the memory size of the
individual machine, but that is 1. costs money, 2. makes me request it
form IT (not easy), 3. Still doesn't work with a 2 gig file.

I was wondering 2 things.

1. Is there a better way of graphing this data, which uses less memory?
2. What is everyone else out there using?

Please no comments about just sampling the data (once every 5 lines or
something like that) and graphing the sampled data as we have already
considered this and that may be our method of resolving our issues.

Thanks,
Rich Elswick
Test Engineer
Cobasys LLC
http://www.cobasys.com

Dr.Ruud · Mar 9, 2006

Rich_Elswick schreef:

1. Is there a better way of graphing this data, which uses less
memory?

Maybe RRD.

http://search.cpan.org/~tcaine/POE-Component-RRDTool/RRDTool.pm
(and others)

xhoster · Mar 9, 2006

Rich_Elswick said:
Hi all,

I am parsing a large data sets (62 gigs on one file). I can parse out
the into smaller files fine with perl, which is what we have to do
anyway (i.e. hex data becomes ascii .csv file type of different decoded
variables.) I am working with CAN data for those that know about
Controller Area Networks collected by Vector CANalyzer.

After they are parsed out, I am looking at the largest data file (1
file becomes ~100 smaller files) is about 2 gigs as of right now, but
who knows how large it could become in the future. I then use GDGraph
to parse through the data files and rapidly generate some .png files
for review (I have issues with this as well and will post those
questions some other time.) I run this on the whole batch of 100
files, going through each file one at a time using a batch program to
call the each perl program separately for each GDGraph, because GDGraph
loads the entire data set into memory before graphing the data. This
limits me to using this method on data files smaller than ~20 megs,
based on system memory. I suppose I could up the memory size of the
individual machine, but that is 1. costs money, 2. makes me request it
form IT (not easy), 3. Still doesn't work with a 2 gig file.

I was wondering 2 things.

1. Is there a better way of graphing this data, which uses less memory?

It seems to me that if you are trying to plot 2 gig worth of data, than at
least one of two things is probably the case. Either most of the data
points fall on almost exaclty top of each other, and therefore you can get
the same image by plotting less than all of them. Or the resulting image
is a blob of partially or nearly overlapping symbols, which would convey
little information other than blobiness, and thus by plotting less than all
of them you get a graph that is more informative than plotting all of them.

Since you don't want to hear about sampling, I would suggest two
alternatives which are related to sampling but aren't the same. One would
be filtering, where you exclude points if you know that they are
effectively on top of a previous, included, point. The other would be
summarization--instead of taking every 500th point to plot, like in
sampling, you take the mean of all 500 and plot that, or you take the min,
max, and median of each group of 500 and plot those 3 things rather than
all 500.

2. What is everyone else out there using?

I use GD::Graph using sampling summarization techniques.

Sometimes I use GD::Graph to set up my axes and labels and titles and such
on a dummy data set, but then use GD directly to draw the actual data
points on the canvass provided by GD::Graph. This way all the data doesn't
need to be in memory at once. However, you need to use the internal
methods of GD::Graph to figure out what coordinates to supply to GD, so
this is a lot of work and is fragile.

I also use R and/or gnuplot to draw some types of images (i.e. contour
plots) which summarize very large datasets without actually drawing each
point. These are stand alone programs, and I only use Perl to massage
their inputs, but I think there are modules which will help interface Perl
with both of them.

Xho

Question about multiple metadata files to one file	0	Feb 14, 2022
I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	3	Jun 4, 2023
FAQ 7.2 What are all these $@%&* punctuation signs, and how do I know when to use them?	0	Feb 19, 2011
Web programming: issues with large amounts og data	9	Dec 3, 2008
Aggregating/Sorting Large files	6	May 25, 2006
Frequency in large datasets	15	May 1, 2008
FAQ 2.6 What modules and extensions are available for Perl? What is CPAN? What does CPAN/src/... mea	0	Mar 7, 2011
Slurp large files into an array, first is quick, rest are slow	7	Dec 28, 2005

Large Data files and sizes - what are you doing about them?

Rich_Elswick

Dr.Ruud

xhoster

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads