Best way to parse delimited data from a file.

DeveloperDave · Feb 18, 2010

Ok, so I'm sure this is a common task, but I have been unable to find
a good algorithm for it. Currently I just have a horrible looking
nestled 'if' blocks, so hopefully someone can help me out with a more
elegant solution.

I have a large binary file. Essentially the binary file contains 1 or
more paragraphs. Each paragraph starts with a four byte sequence:

0x4F 0x67 0x67 0x53

In my C++ application I open a fstream to the file in binary mode. I
want to iterate through the file, and pass each paragraph into a new
class which will know what to do with it.

I have been using a read() to copy a chunk of data into a buffer, then
iterate through the buffer checking if each character matches 0x4F.
If it does, I then check to see if the next char matches 0x67 and so
on. This gives me four nestled if blocks. I also have to deal with
buffering the data.

Can anyone offer a better solution written in native C/C++

Thanks

DeveloperDave · Feb 18, 2010

Something like the following?

switch(state)
{
case Contents:
if (ch == 0x4F)
state = MaybeParagraph1;
break;
case MaybeParagraph1:
if (ch == 0x67)
state = MaybeParagraph2;
else
state = Contents;
break;
... ... ...
case MaybeParagraph3:
if (ch == 0x53)
/* handle start of new paragraph (process previous paragraph
which will be empty for start of file */
state = Contents;
break;

}

Ah, so like a state machine (doh I should have remembered that). I
suppose the other half of the question is, what is the best way to get
the data out of the stream. Should I copy it into a buffer and
iterate through it or should I be using something like seekg/peek to
try and inspect the data on the stream, and then read off the entire
paragraph when I know the start/end positions.

Cheers

red floyd · Feb 18, 2010

Use a buffer, std::vector<char> should suffice.

It's binary data, so std::vector<unsigned char> might be better.

Stefan Ram · Feb 18, 2010

DeveloperDave said:
0x4F 0x67 0x67 0x53
I have been using a read() to copy a chunk of data into a buffer, then
iterate through the buffer checking if each character matches 0x4F.

What if a chunk ends with 0x4F 0x67, and the next chunk
starts with 0x67 0x53?

Otherwise, you could use a slight variant of the strstr
algorithm or use repeated calls of strstr for each
0-terminated section of a chunk.

Maxim Yegorushkin · Feb 18, 2010

For avoiding 4 if-s, see memcmp() or std::equal().

For avoiding buffering, use memory mapping to map the whole file into
memory (sorry not native C/C++, but platform-specific!). If the file is too
large to fit in the process address space, just switch to a 64-bit system!

Brilliant answer! Quite true that a 64-bit operating system allows you
to do I/O in an awesome novel way.

For reading one can map the file into memory and return two iterators to
the range mapped.

When writing one can map the file into memory again and write directly
into memory without having to issue write() syscall. Resizing a memory
mapped file is (munmap(), ftruncate(), mmap() again) (i.e. resizing
invalidates iterators, but not offsets, just like for std::vector<>).

And the file size does not matter much (unless its larger than 2^64, but
it is less than 2^64 on practise because system shared libraries may get
mapped in the middle of the 64-bit process address space) because
Unix'es (and probably Windoze) do demand paging, that is mapped memory
only allocates physical memory when that mapped memory is touched. For
example, if you mmap() a 64 Gb file no physical memory gets consumed
until you start accessing that memory.

For more details pls see: http://en.wikipedia.org/wiki/Demand_paging

Best Crash Pc Software to Recover Data Easily	0	Oct 1, 2024
Pulling commas from data to store in a CSV file	6	Jun 17, 2024
Is there a way to pass this state from component to the fetch?	1	Apr 24, 2023
How to push data from one HTML page to another	4	Jan 3, 2024
Best Way to Handle Unknown Data Sizes?	5	Apr 13, 2012
New To Javascript - Accessing Data	3	Nov 26, 2023
Sending data from web page to Raspberry Pi	0	Nov 26, 2022
Retrieve data 'live' from spinner within a RecyclerView	1	May 20, 2023

Best way to parse delimited data from a file.

DeveloperDave

DeveloperDave

red floyd

Stefan Ram

Maxim Yegorushkin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads