B
brzak
hi
I was hoping for a few pointers on how to best go about
processing large csv files.
The files:
a typical file has 100K-500K records
approx 150 chars per line (10 'fields')
so file sizes of 15MB-75MB.
The processing:
summarise numerical fields based on conditions applied to other
fields
the resulting summary is a table;
the column headers of which are the list of unique values of one
field
the row headers are decided upon by conditions on other fields
(this may include lookups, exclusions, reclassifications)
Taking into account the size of the files, and number of operations
requierd on each record...
What if any of thse considerations do I need to take into account:
-is the file read in line by line / or in one go?
+if in one go, there would be issues with available memory?
+if it's line by line, is tehre a significant difference in time
taken to process? (i.e from my limited personal experience with
VBA, reading/writing a cell at a time in a spreasheet is far
slower than reading/writing in 'batches')
+or would it be an idea to read a limited number in one go?
e.g. deal with 20,000 at a time in memory
i suppose this question demonstrates a lack of experience with C++
but hey, that's why i'm posting in the learner's forum
-however much of the file is read, is it worth writing a bespoke
solution
or look for a parser/class that's been written for csv files?
+perhaps there is a module that i can import?
+since the csv files are *supposed* to be of a standard format,
would
there be much to gain iin writing something specific to this - this
would be done with the aim of reducing processing time
-data types... should i read the value fields as floating point
numbers (range approx. +/- 500000.00)
+will using floating point data types save memory?
As anyone reading would be able to tell, I'm still quite new to this
language, and am missing some of the basics which I've had a bit of
trouble locating solutions to.
Any advice would be much appreciated!
Brz
I was hoping for a few pointers on how to best go about
processing large csv files.
The files:
a typical file has 100K-500K records
approx 150 chars per line (10 'fields')
so file sizes of 15MB-75MB.
The processing:
summarise numerical fields based on conditions applied to other
fields
the resulting summary is a table;
the column headers of which are the list of unique values of one
field
the row headers are decided upon by conditions on other fields
(this may include lookups, exclusions, reclassifications)
Taking into account the size of the files, and number of operations
requierd on each record...
What if any of thse considerations do I need to take into account:
-is the file read in line by line / or in one go?
+if in one go, there would be issues with available memory?
+if it's line by line, is tehre a significant difference in time
taken to process? (i.e from my limited personal experience with
VBA, reading/writing a cell at a time in a spreasheet is far
slower than reading/writing in 'batches')
+or would it be an idea to read a limited number in one go?
e.g. deal with 20,000 at a time in memory
i suppose this question demonstrates a lack of experience with C++
but hey, that's why i'm posting in the learner's forum
-however much of the file is read, is it worth writing a bespoke
solution
or look for a parser/class that's been written for csv files?
+perhaps there is a module that i can import?
+since the csv files are *supposed* to be of a standard format,
would
there be much to gain iin writing something specific to this - this
would be done with the aim of reducing processing time
-data types... should i read the value fields as floating point
numbers (range approx. +/- 500000.00)
+will using floating point data types save memory?
As anyone reading would be able to tell, I'm still quite new to this
language, and am missing some of the basics which I've had a bit of
trouble locating solutions to.
Any advice would be much appreciated!
Brz