regular expression too big

P

Peter Schrammel

Hi,

got problem with big regexes:
I have a regex of about 70000+ words concated with '|' that I'd like to
match as a regex. /bla|blub|foo|bar|.....(70000)/

But unfortunately ruby gives me a 'regular expression too big' if I'm
trying to build such a thing.
I had a look at the regex.c code and saw the limit of 1 << 16 bytes for
regexes. Is there a way around this (without going down to 2000 words) ?

Thanks for any hint

Peter
 
P

Peter Schrammel

Jeffrey said:
You could optimize the regex a little for size, e.g. by factoring out
common prefixes:

(b(l(a|ub)|ar)|foo)...

Thought of that.

Of course, that will only help if the | alternatives have a reasonable
amount of redundancy. Alternatively, you could just break the whole
thing into multiple expressions. Instead of

if /first_part|second_part/ =~ text

You could try:

if /first_part/ =~ text or /second_part/ =~ text

Yes, that was my next thought but where to split? Just count the bytes
and splitt near 1 <<16?


Why is there a limitation at all? I implemented the same thing in perl
and it no complains ...
Is the regexp engine of perl that much better?

Thanks for the reply
 
L

Louis J Scoras

This doesn't really help with the actual question of getting past
regex size limits, but are you sure that regexen are correct solution
to this problem? Unless I'm mistaken, the above match is going to be
horribly, painfully slow; the issue you're running into is probably an
indication that you might want to look elsewhere.

I don't want to make any silly claims without doing any benchmarking
on your data, but I would imagine that even doing an efficient search
on a sorted array of your words would give you better performance than
the regex search. You can move up from there into hashes or other
data structures for this sort of thing.
 
K

Ken Bloom

Hi,

got problem with big regexes:
I have a regex of about 70000+ words concated with '|' that I'd like to
match as a regex. /bla|blub|foo|bar|.....(70000)/

But unfortunately ruby gives me a 'regular expression too big' if I'm
trying to build such a thing.
I had a look at the regex.c code and saw the limit of 1 << 16 bytes for
regexes. Is there a way around this (without going down to 2000 words) ?

Thanks for any hint

Peter

Maybe a trie would be useful?
http://rubyforge.org/projects/trie/
(or there's another trie at http://kzk9.net/software/miscprograms/ruby/)

--Ken
 
P

Peter Schrammel

When you post an question like this, it is always a good idea to
reveal the
purpose as well as the method.

First of all: Thanks for the replies, I think I have enough input to
chew on.

And to reveal the purpose:
I'd like to match a LOT of words/strings (words with spaces) and
sometimes regexes against a lot of small and medium size( < 10000 byte )
strings.

The comparison with Perl was just a comparison of the regexp engines and
not the language itself. It was just that until now I had never to think
much about the underlying hardware .... because ruby did that for me. So
I was surprised by the "too big exception".

Peter
 
K

Ken Bloom

If you are going to compare strings to strings as well as strings to
regexes, maybe a two-tiered scheme would be better. First tier, simple
textual comparison using a hash of strings. Second tier, regexes.

This removes the ambiguity that a particular data string might pass a string
comparison but fail any of the provided regexes, or the opposite. It will
also be much faster than a gigantic list of strings and regexes, all
presented to the regex engine as though they were all regular expressions,
even though some are simple string comparisons.

I think this (e.g speed improvement) would have been true in Perl also.

His regex isn't anchored to the beginning and end of the string. This makes
hashes useless for the kind of comparison he seems to want to perform.

--Ken
 
G

Gabriele Marrone

Hi,

got problem with big regexes:
I have a regex of about 70000+ words concated with '|' that I'd
like to
match as a regex. /bla|blub|foo|bar|.....(70000)/

But unfortunately ruby gives me a 'regular expression too big' if I'm
trying to build such a thing.
I had a look at the regex.c code and saw the limit of 1 << 16 bytes
for
regexes. Is there a way around this (without going down to 2000
words) ?

I friend of mine told me to suggest you to system() a perl script
which does the job :p

However, since he was just kidding, I think I'll suggest you
something like this:


class String
def include_at_least_one_of?(words)
words.find { |w| include? w }
end
end

my_words = %w(bla blub foo bar)

"zump blub asd".include_at_least_one_of? my_words # => "blub"
"nah none included".include_at_least_one_of? my_words # => nil


You probably don't want to hack core classes in order to do this, but
you get the idea.
 
J

James Edward Gray II

class String
def include_at_least_one_of?(words)
words.find { |w| include? w }
end
end

Just a tiny style related suggestion. I would use any?() instead of
find() there.

James Edward Gray II
 
P

Peter Schrammel

Paul said:
Yes, unless he is matching whole words, he is stuck with regexes. There is
very likely to be a refactoring for this problem, and it would have to
start with a clear statement of the problem to be solved.
Sorry, my fault.

The problem is to match a whole bunch (>70000) of words (later regexps)
on a string.
The actual implementation is to concat the word with | and create a big
regexp. word1|word2|word3|word4 ...
This went well until I tested with some 100 words. Now I have the 'big
regexp problem' problem. The solution has to work with regexps as words
as well like: word1(the|a|an)word11|regexp2|...
So the currently working solutions are:
-use ruby 1.9 (the performance is far below perl, but I have to
benchmark this.)
-loop through the words/regexp and match each of them on the string
(don't ask for performance here).
-perhaps I'll realy pipe the problem into a
perl-implemented-matching-server (not that bad idea)

peter
 
S

Simon Strandgaard

Sorry, my fault.

The problem is to match a whole bunch (>70000) of words (later regexps)
on a string.
The actual implementation is to concat the word with | and create a big
regexp. word1|word2|word3|word4 ...

Is it exact word matching?

if so then you can split the input-string on blankspace,
so you get an array of words. Then you can sort the array of words

irb(main):001:0> a = %w(a b c e f g)
=> ["a", "b", "c", "e", "f", "g"]
irb(main):002:0> b = %w(b d f)
=> ["b", "d", "f"]
irb(main):003:0> a & b
=> ["b", "f"]
 
P

Peter Schrammel

Paul said:
You are not being very clear. Do you mean to match entire words, letter for
letter, beginning to end, at least some of the time? If so, for those
cases, you can use a hash table that is preloaded with the words as keys.
That will be fast.

no its a substring match kind of /hello/.match("dear customer id like to
say hello to ...."). I have to find 70000 words in a large textfile:
if lotofwords.match(bigtext) .....
I need a trigger to classify the bitext depending on wether any of the
word (later regexps) are in it or not. Tries work well for substring
matching in this case but fail with regexps.
Also precompile the regexes before use. You probably already know this, but
I thought I would mention it anyway.

yes, done that.
Another option is C++, which has a readily available regexp library. The way
I would go about this is to design the entire thing in Ruby, then, if the
speed was not acceptable, recreate it in C++. This gives you the advantage
of speedy development in Ruby, followed by speedy execution.

If this is a full-on language analysis problem, you really should be using
Prolog or Lisp anyway. If the problem really is as complex as you are
hinting at, you may not be using the right language, or even class of
language.
yes, but I wanted to try it with "my favorit" language first before
doing something wild. It's a rails application so I'd have to call
lisp/perl/.... programms, interfacing them (calling on each test would
be a mess because reading somethousand lines every time is a performance
killer.)

Anyway, thanks for the replies I'll give a perl interface a try (due to
some other reasons like robust html parsing ...).
 
B

brabuhr

Jeffrey said:
Have you seen:
"Converts a list of words to a regular expression with minimum
backtracking by joining words with common prefixes. It is a port
of the Perl module MakeRegex.pm by Hakan Kjellerstrand with
some improvements."
http://raa.ruby-lang.org/project/makeregex/

YMMV; I have never used it on anything like the scale you are.

In a little bit of testing here, it goes to long after about 8,000
words.
require 'makeregex'

20.times do |n|
words = IO.readlines("/usr/share/dict/words")[0..(2 ** n)]

start = Time.now

r = Regexp.make(words)

finish = Time.now

puts "Took #{finish - start} seconds to convert #{words.size} words into a regex #{r.size} bytes long."

"FOO".match(r)
end

Took 0.000372 seconds to convert 2 words into a regex 20 bytes long.
Took 0.000285 seconds to convert 3 words into a regex 25 bytes long.
Took 0.000359 seconds to convert 5 words into a regex 51 bytes long.
Took 0.000493 seconds to convert 9 words into a regex 86 bytes long.
Took 0.000973 seconds to convert 17 words into a regex 157 bytes long.
Took 0.001773 seconds to convert 33 words into a regex 285 bytes long.
Took 0.005386 seconds to convert 65 words into a regex 491 bytes long.
Took 0.00823 seconds to convert 129 words into a regex 933 bytes long.
Took 0.019234 seconds to convert 257 words into a regex 1876 bytes long.
Took 0.042557 seconds to convert 513 words into a regex 3856 bytes long.
Took 0.09146 seconds to convert 1025 words into a regex 7807 bytes long.
Took 0.196851 seconds to convert 2049 words into a regex 15669 bytes long.
Took 0.399155 seconds to convert 4097 words into a regex 32325 bytes long.
Took 0.968776 seconds to convert 8193 words into a regex 64671 bytes long.
foo:14:in `match': regular expression too big:
/(?:1(?:0(?:80\n|\-point\n|th\n)|...
 
B

brabuhr

In a little bit of testing here, it goes to long after about 8,000
words.
require 'makeregex'

20.times do |n|
words = IO.readlines("/usr/share/dict/words")[0..(2 ** n)]

Of course, that isn't the best sample of words to test with.

Here is a revised version (fails with a smaller word count):
require 'makeregex'

class Array
def sample n
a = []; s = self.size
n.times do
a << self[rand(s)].chomp
end
a
end
end

words = IO.readlines("/usr/share/dict/words")

20.times do |n|
w = words.sample(2 ** n)

start = Time.now

r = Regexp.make(w)

finish = Time.now

puts "Took #{finish - start} seconds to convert #{w.size} words into a regex #{r.size} bytes long."

"FOO".match(r)
end

Took 9.9e-05 seconds to convert 1 words into a regex 9 bytes long.
Took 0.000163 seconds to convert 2 words into a regex 18 bytes long.
Took 0.000169 seconds to convert 4 words into a regex 42 bytes long.
Took 0.000299 seconds to convert 8 words into a regex 80 bytes long.
Took 0.000618 seconds to convert 16 words into a regex 176 bytes long.
Took 0.001366 seconds to convert 32 words into a regex 324 bytes long.
Took 0.003021 seconds to convert 64 words into a regex 689 bytes long.
Took 0.007129 seconds to convert 128 words into a regex 1391 bytes long.
Took 0.019143 seconds to convert 256 words into a regex 2740 bytes long.
Took 0.149488 seconds to convert 512 words into a regex 5295 bytes long.
Took 0.304324 seconds to convert 1024 words into a regex 10236 bytes long.
Took 0.307832 seconds to convert 2048 words into a regex 19755 bytes long.
Took 0.404071 seconds to convert 4096 words into a regex 37308 bytes long.
foo:26:in `match': regular expression too big:
/(?:A(?:ME|c(?:arapis|kley|mispon)|...
 
G

Giles Bowkett

If this is a full-on language analysis problem, you really should be using
yes, but I wanted to try it with "my favorit" language first before
doing something wild. It's a rails application so I'd have to call
lisp/perl/.... programms, interfacing them (calling on each test would
be a mess because reading somethousand lines every time is a performance
killer.)

Anyway, thanks for the replies I'll give a perl interface a try (due to
some other reasons like robust html parsing ...).

there's got to be a better way to do this. I recently attempted to
implement a particle simulation algorithm for arranging nodes in a
graph in Rails, before I came to my senses.

solving this problem with regexes sounds like a nightmare on the
maintenance front. I think it's possible that you're using the wrong
language but more likely that you're just using the wrong technique.
 
B

brabuhr

Irrespective of whether regex the best solution for your needs, it seems
Oniguruma will improve the situation somewhat with respect to large
regular expressions.

I built a local version of 1.8.5 with the oniguruma engine:
http://raa.ruby-lang.org/project/oniguruma/

And re-ran (a slight variation of) my test program:

[~]$ ruby foo
Using the <undefined> regex engine.
Converted a list of 1 words into a regex 8 bytes long.
Converted a list of 2 words into a regex 36 bytes long.
Converted a list of 4 words into a regex 48 bytes long.
Converted a list of 8 words into a regex 73 bytes long.
Converted a list of 16 words into a regex 173 bytes long.
Converted a list of 32 words into a regex 352 bytes long.
Converted a list of 64 words into a regex 718 bytes long.
Converted a list of 128 words into a regex 1415 bytes long.
Converted a list of 256 words into a regex 2656 bytes long.
Converted a list of 512 words into a regex 5210 bytes long.
Converted a list of 1024 words into a regex 10105 bytes long.
Converted a list of 2048 words into a regex 19432 bytes long.
Converted a list of 4096 words into a regex 37509 bytes long.
@_@

[~]$ /usr/local/bin/ruby foo
Using the Oniguruma regex engine.
Converted a list of 1 words into a regex 11 bytes long.
Converted a list of 2 words into a regex 16 bytes long.
Converted a list of 4 words into a regex 38 bytes long.
Converted a list of 8 words into a regex 97 bytes long.
Converted a list of 16 words into a regex 185 bytes long.
Converted a list of 32 words into a regex 359 bytes long.
Converted a list of 64 words into a regex 686 bytes long.
Converted a list of 128 words into a regex 1387 bytes long.
Converted a list of 256 words into a regex 2715 bytes long.
Converted a list of 512 words into a regex 5264 bytes long.
Converted a list of 1024 words into a regex 10074 bytes long.
Converted a list of 2048 words into a regex 19439 bytes long.
Converted a list of 4096 words into a regex 37452 bytes long.
Converted a list of 8192 words into a regex 71931 bytes long.
Converted a list of 16384 words into a regex 135572 bytes long.
Converted a list of 32768 words into a regex 253027 bytes long.
Converted a list of 65536 words into a regex 461607 bytes long.
Converted a list of 131072 words into a regex 808171 bytes long.
Converted a list of 262144 words into a regex 1326345 bytes long.
Converted a list of 479625 words into a regex 1873539 bytes long.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,999
Messages
2,570,243
Members
46,836
Latest member
login dogas

Latest Threads

Top