find a pattern in binary file

V

vizzz

Hi there,
i need to find an hex pattern like 0x650A1010 in a binary file.
i can make a small algorithm that fetch all the file for the match,
but this file is huge, and i'm scared about performances.
Is there any stl method for a fast search?
Andrea
 
K

Kai-Uwe Bux

vizzz said:
Hi there,
i need to find an hex pattern like 0x650A1010 in a binary file.
i can make a small algorithm that fetch all the file for the match,
but this file is huge, and i'm scared about performances.
Is there any stl method for a fast search?

You could try std::search() with istreambuf_iterator< unsigned char >.

However:

(a) It is not clear that you will get good performance. Some implementations
are not really all that good with stream iterators.

(b) I am not sure whether search() is allowed to use backtracking
internally, in which case you cannot use it with stream iterators. You
should check.

(c) Even if search finds an occurrence, it reports the result as an
iterator. I do not know of a convenient way to convert that into an offset.


Maybe, rolling your own is not all that bad. You could read the file in
chunks (keeping the last three characters from the previous block) and use
std::search() on the blocks. With the right blocksize, this could be really
fast.


If your OS allows memory mapping of the file, you could do that and use
std::search() with unsigned char * on the whole thing. That could be the
fasted way, but will leave the realm of standard C++.


Best

Kai-Uwe Bux
 
I

Ivan

Hi there,
i need to find an hex pattern like 0x650A1010 in a binary file.
i can make a small algorithm that fetch all the file for the match,
but this file is huge, and i'm scared about performances.
Is there any stl method for a fast search?
Andrea

Hmmm... I had a look at this and ran accross a simple problem. How do
you read a binary file and just echo the HEX for byte to the screen.
The issue is the c++ read function doesn't return number of bytes
read... so on the last read into a buffer how do you know how many
characters to print?

Thanks,
Ivan Novick
http://www.mycppquiz.com
 
K

Kai-Uwe Bux

Ivan said:
Hmmm... I had a look at this and ran accross a simple problem. How do
you read a binary file and just echo the HEX for byte to the screen.

#include <iostream>
#include <ostream>
#include <fstream>
#include <iterator>
#include <iomanip>
#include <algorithm>
#include <cassert>

class print_hex {

std::eek:stream * ostr_ptr;
unsigned int line_length;
unsigned int index;

public:

print_hex ( std::eek:stream & str_ref, unsigned int length )
: ostr_ptr( &str_ref )
, line_length ( length )
, index ( 0 )
{}

void operator() ( unsigned char ch ) {
++index;
if ( index >= line_length ) {
(*ostr_ptr) << std::hex << std::setw(2) << std::setfill( '0' )
<< (unsigned int)(ch) << '\n';
index = 0;
} else {
(*ostr_ptr) << std::hex << std::setw(2) << std::setfill( '0' )
<< (unsigned int)(ch) << ' ';
}
}

};

int main ( int argn, char ** args ) {
assert( argn == 2 );
std::ifstream in ( args[1] );
std::for_each( std::istreambuf_iterator< char >( in ),
std::istreambuf_iterator< char >(),
print_hex( std::cout, 25 ) );
std::cout << '\n';
}

The issue is the c++ read function doesn't return number of bytes
read... so on the last read into a buffer how do you know how many
characters to print?

Have a look at readsome().



Best

Kai-Uwe Bux
 
J

James Kanze

vizzz wrote:
You could try std::search() with istreambuf_iterator< unsigned char >.

That's very problematic. istreambuf_iterator< unsigned char >
will expect a basic_streambuf< unsigned char >, which isn't
defined by the standard (and you're not allowed to define it).
A number of implementations do provide a generic version of
basic_streambuf, but since the standard doesn't say what the
generic version should do, they tend to differ. (I remember
sometime back someone posting in fr.comp.lang.c++ that he had
problems because g++ and VC++ provide incompatible generic
versions.)

It would, I suppose, be possible to use istream_iterator<
unsigned char >, provided the file was opened in binary mode,
and you reset skipws. I have my doubts about the performance of
this solution, but it's probably worth a try---if the
performance turns out to be acceptable, you won't get much
simpler.

Except, of course, that search requires forward iterators, and
won't (necessarily) work with input iterators.

[...]
Maybe, rolling your own is not all that bad. You could read
the file in chunks (keeping the last three characters from the
previous block) and use std::search() on the blocks. With the
right blocksize, this could be really fast.

A lot depends on other possible constraints. He didn't say, but
his example was to look for 0x650A1010, not the sequence 0x65,
0x0A, 0x10, 0x10. If what he is really looking for is a four
byte word, correctly aligned, then as long as the block size is
a multiple of 4, he could use search() with an
iterator::value_type of uint32_t. For arbitrary positions and
sequences, on the other hand, some special handling might be
necessary for cases where the sequence spans a block boundary.

When I had to do something similar, I reserved a guard zone in
front of my buffer, and used a BM search in the buffer. When
the BM search would have taken me beyond the end of the buffer,
I copied the last N bytes of the buffer into the end of the
guard zone before reading the next block, and started my next
search from them. This would probably make keeping track of the
offset a bit tricky (I didn't need the offset), and for the best
performance on the system I was using then, I had to respect
alignment of the buffer as well, which also added some extra
complexity. (But I got the speed we needed:).)
If your OS allows memory mapping of the file, you could do
that and use std::search() with unsigned char * on the whole
thing. That could be the fasted way, but will leave the realm
of standard C++.

If the entire file will fit into memory, perhaps just reading it
all into memory, and then using std::search, would be an
appropriate solution. Or perhaps not: it's often faster to use
a somewhat smaller buffer, and manage the "paging" yourself.
 
J

James Kanze

#include <iostream>
#include <ostream>
#include <fstream>
#include <iterator>
#include <iomanip>
#include <algorithm>
#include <cassert>
class print_hex {
std::eek:stream * ostr_ptr;
unsigned int line_length;
unsigned int index;

print_hex ( std::eek:stream & str_ref, unsigned int length )
: ostr_ptr( &str_ref )
, line_length ( length )
, index ( 0 )
{}
void operator() ( unsigned char ch ) {
++index;
if ( index >= line_length ) {
(*ostr_ptr) << std::hex << std::setw(2) << std::setfill( '0' )
<< (unsigned int)(ch) << '\n';
index = 0;
} else {
(*ostr_ptr) << std::hex << std::setw(2) << std::setfill( '0' )
<< (unsigned int)(ch) << ' ';

Wouldn't it be preferable to set the formatting flags in the
constructor? I'd also provide an "indent" argument; if index
were 0, I'd output indent spaces, otherwise a single space---or
perhaps the best solution would be to provide a start of line
and a separator string to the constructor, then:

(*ostr_ptr)
<< (inLineCount == 0 ? startString : separString)
<< std::setw( 2 ) << (unsigned int)( ch ) ;
++ inLineCount ;
if ( inLineCount == lineLength ) {
(*ostr_ptr) << endString ;
inLineCount = 0 ;
}

(This supposes that hex and fill were set in the constructor.)
Given the copying that's going on, I'd also simulate move
semantics, so that the final destructor could do something like:

if ( inLineCount != 0 ) {
(*ostr_ptr) << endString ;
}
}
}
};

int main ( int argn, char ** args ) {
assert( argn == 2 );
std::ifstream in ( args[1] );
std::for_each( std::istreambuf_iterator< char >( in ),
std::istreambuf_iterator< char >(),
print_hex( std::cout, 25 ) );

Unless you're doing something relatively generic, with support
for different separators, etc., this really looks like a case of
for_each abuse.
std::cout << '\n';

Which results in one new line too many if the number of elements
just happened to be an exact multiple of the line length.

About the only real use for this sort of output I've found is
debugging or experimenting, but there, I use it often enough
that I've a generic Dump<T> class (and a generic function which
returns it, for automatic type deduction), so that I can write
things like:

std::cout << dump( someObject ) << std::endl ;

The code that ends up getting called in the << operator is:

IOSave saver( dest ) ;
dest.fill( '0' ) ;
dest.setf( std::ios::hex, std::ios::basefield ) ;
char const* baseStr = "" ;
if ( (dest.flags() & std::ios::showbase) != 0 ) {
baseStr = "0x" ;
dest.unsetf( std::ios::showbase ) ;
}
unsigned char const* const
end = myObj + sizeof( T ) ;
for ( unsigned char const* p = myObj ; p != end ; ++ p ) {
if ( p != myObj ) {
dest << ' ' ;
}
dest << baseStr << std::setw( 2 ) << (unsigned int)( *p ) ;
}

(Note that there's extra code there to support my personal
preference: a "0x" with a small x, even if std::ios::uppercase
is specified.)
Have a look at readsome().

Yes, have a look at it. Read it's specification very carefully.
Because if you do, you're realize that it is absolutely
worthless here.

The function he's looking for is istream::gcount(), which
returns the number of bytes read by the last unformatted read.
His basic loop would be:

while ( input.read( &buffer[ 0 ], buffer.size() ) ) {
process( buffer.begin(), buffer.end() ) ;
}
process( buffer.begin(), buffer.begin() + input.gcount() ) ;

(But IMHO, istream really isn't appropriate for binary; if I'm
really working with a binary file, I'll drop down to the system
API.)
 
J

James Kanze

"vizzz" <[email protected]> a écrit dans le message de (e-mail address removed)...
Check out boost::regex

Which requires a forward iterator, and so can't be used on data
in a file (for which he'll have at best an input iterator).

Also, if he's only looking for a fixed string, it's likely to be
significantly slower than some other algorithms.
 
V

vizzz

Which requires a forward iterator, and so can't be used on data
in a file (for which he'll have at best an input iterator).

Also, if he's only looking for a fixed string, it's likely to be
significantly slower than some other algorithms.

Maybe explaining my goal can be useful.
in jpeg2000 files (jp2) there are several boxes made of 4byte length,
4byte type and then data.
i must check if box exist by searching somewhere in the file (boxes
can be anywhere in the whole file) for the box type (ex 0x650A1010).
 
K

Kai-Uwe Bux

James said:
Ivan said:
Hmmm... I had a look at this and ran accross a simple
problem. How do you read a binary file and just echo the
HEX for byte to the screen. [snip]
The issue is the c++ read function doesn't return number of
bytes read... so on the last read into a buffer how do you
know how many characters to print?
Have a look at readsome().

Yes, have a look at it. Read it's specification very carefully.
Because if you do, you're realize that it is absolutely
worthless here.

I reread it again. I fail to see why it's worthless. Obviously, I am missing
something.
The function he's looking for is istream::gcount(), which
returns the number of bytes read by the last unformatted read.
His basic loop would be:

while ( input.read( &buffer[ 0 ], buffer.size() ) ) {
process( buffer.begin(), buffer.end() ) ;
}
process( buffer.begin(), buffer.begin() + input.gcount() ) ;

On the other hand, that looks very clean.


Best

Kai-Uwe
 
M

Mirco Wahab

vizzz said:
Maybe explaining my goal can be useful.
in jpeg2000 files (jp2) there are several boxes made of 4byte length,
4byte type and then data.
i must check if box exist by searching somewhere in the file (boxes
can be anywhere in the whole file) for the box type (ex 0x650A1010).

What is the largest file size and on which system
do you want this to happen?

The C-memchr is, on modern compilers, very very
fast (it does 8 byte alignment on the pointer,
scans 32 or 64 bit at a time by bit ops and so on.)

You can't simply beat that one. Read the file
as a block (fread after stat(), ftell/SEEK_END)
or in chunks and find the first byte (and compare
the rest).

Otherwise, you could give memcmp() a shot
http://www.cplusplus.com/reference/clibrary/cstring/memcmp.html
maybe its optimized as hard as memchr() is.
I didn't look into this but know from memchr()
it would get about double speed compared to the
naive implementation: if(*p == *q) ...

But if you can't slurp the whole file at
once into memory, you have of course to
deal with the possibility of broken pattern
across the read block boundary.

Regards

M.
 
K

Kai-Uwe Bux

James said:
Wouldn't it be preferable to set the formatting flags in the
constructor?
Yup.

I'd also provide an "indent" argument; if index
were 0, I'd output indent spaces, otherwise a single space---or
perhaps the best solution would be to provide a start of line
and a separator string to the constructor, then:

Good idea.

(*ostr_ptr)
<< (inLineCount == 0 ? startString : separString)
<< std::setw( 2 ) << (unsigned int)( ch ) ;
++ inLineCount ;
if ( inLineCount == lineLength ) {
(*ostr_ptr) << endString ;
inLineCount = 0 ;
}

(This supposes that hex and fill were set in the constructor.)
Given the copying that's going on, I'd also simulate move
semantics, so that the final destructor could do something like:

if ( inLineCount != 0 ) {
(*ostr_ptr) << endString ;
}
}
}
};

int main ( int argn, char ** args ) {
assert( argn == 2 );
std::ifstream in ( args[1] );
std::for_each( std::istreambuf_iterator< char >( in ),
std::istreambuf_iterator< char >(),
print_hex( std::cout, 25 ) );

Unless you're doing something relatively generic, with support
for different separators, etc., this really looks like a case of
for_each abuse.

Actually, with regard to for_each, I am growing more and more comfortable
using it. Of all algorithms, for_each seems the most silly; on the other
hand it is also the one that has the largest potential for specialized
versions that take advantage of internal knowledge about the underlying
sequence. E.g., I can easily imagine a special version for iterators into a
deque (where for_each would iterate over pages and within each page would
use a very fast loop using T* where it can skip the test for reaching a
page end). Similar optimizations should be possible for stream iterators.

Which results in one new line too many if the number of elements
just happened to be an exact multiple of the line length.

You are making up specs :)

But seriously: you are right, of course.

About the only real use for this sort of output I've found is
debugging or experimenting, but there, I use it often enough
that I've a generic Dump<T> class (and a generic function which
returns it, for automatic type deduction), so that I can write
things like:

std::cout << dump( someObject ) << std::endl ;
[snip]

Hm, I never had a use for hex dumping objects. But, maybe I should try that
out.


Best

Kai-Uwe Bux
 
J

James Kanze

James said:
Ivan wrote:
Hmmm... I had a look at this and ran accross a simple
problem. How do you read a binary file and just echo the
HEX for byte to the screen. [snip]
The issue is the c++ read function doesn't return number of
bytes read... so on the last read into a buffer how do you
know how many characters to print?
Have a look at readsome().
Yes, have a look at it. Read it's specification very carefully.
Because if you do, you're realize that it is absolutely
worthless here.
I reread it again. I fail to see why it's worthless.
Obviously, I am missing something.

It will read a maximum of streambuf::in_avail characters. If
there are no characters in the buffer, streambuf::in_avail calls
showmanyc. And by default, all showmanyc does is return 0. An
implementation of filebuf may do more, if the system supports
some means of finding out exactly how many characters are in the
file, but it's not required to. Which means that basically,
readsome() may stop (returning 0 characters read) as soon as
there are no more characters in the buffer.
 
J

James Kanze

James Kanze wrote:

[...]
Actually, with regard to for_each, I am growing more and more
comfortable using it.

I'm actually pretty comfortable using it too. Regretfully, we
seem to be a minority, and the programmers having to maintain my
code find it "unnatural", and that it hurts readability, to move
the contents of a loop out into a separate class. Unless that
class is in some way "reusable", i.e. it represents some more
general application.

[...]
You are making up specs :)

You started it:). You decided that he needed newlines in ths
sequence to begin with. (OK: somebody did say something about
megabytes somewhere. But maybe he has a very, very wide
screen.)
But seriously: you are right, of course.
About the only real use for this sort of output I've found is
debugging or experimenting, but there, I use it often enough
that I've a generic Dump<T> class (and a generic function which
returns it, for automatic type deduction), so that I can write
things like:
std::cout << dump( someObject ) << std::endl ;

Hm, I never had a use for hex dumping objects. But, maybe I
should try that out.

I didn't really, for the longest time (which is why it isn't at
my site---I only added it to the library very recently). Even
now, most of its use is for "experimenting": for trying to guess
the representation of some type in an undocumented format, for
example.

On the other hand, if I ever find time to write up an article on
how to correctly use iostream, I'll probably include it, because
it is a good example of how to handle arbitrary formatting for
any possible type.
 
J

James Kanze

What is the largest file size and on which system do you want
this to happen?
The C-memchr is, on modern compilers, very very fast (it does
8 byte alignment on the pointer, scans 32 or 64 bit at a time
by bit ops and so on.)

Maybe. I'm not familiar with the jpeg format, but somehow, I'd
be a bit surprised if the 4 byte value isn't required to be
aligned. And if it's aligned, treating the buffer as an array
of uint32_t, and using std::find, will almost certainly be
significantly faster than memchr.
You can't simply beat that one.

Actually, you almost always can.
 
I

Ivan

(But IMHO, istream really isn't appropriate for binary; if I'm
really working with a binary file, I'll drop down to the system
API.)

That's exactly what was I thinking, but I wasn't sure if it was just
my lack of C++ knowledge that made it a pain to read binary data with
istream.

Thanks,
Ivan Novick
http://www.mycppquiz.com/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,154
Members
46,701
Latest member
XavierQ83

Latest Threads

Top