Irrespective of whether regex the best solution for your needs, it seems
Oniguruma will improve the situation somewhat with respect to large
regular expressions.
I built a local version of 1.8.5 with the oniguruma engine:
http://raa.ruby-lang.org/project/oniguruma/
And re-ran (a slight variation of) my test program:
[~]$ ruby foo
Using the <undefined> regex engine.
Converted a list of 1 words into a regex 8 bytes long.
Converted a list of 2 words into a regex 36 bytes long.
Converted a list of 4 words into a regex 48 bytes long.
Converted a list of 8 words into a regex 73 bytes long.
Converted a list of 16 words into a regex 173 bytes long.
Converted a list of 32 words into a regex 352 bytes long.
Converted a list of 64 words into a regex 718 bytes long.
Converted a list of 128 words into a regex 1415 bytes long.
Converted a list of 256 words into a regex 2656 bytes long.
Converted a list of 512 words into a regex 5210 bytes long.
Converted a list of 1024 words into a regex 10105 bytes long.
Converted a list of 2048 words into a regex 19432 bytes long.
Converted a list of 4096 words into a regex 37509 bytes long.
@_@
[~]$ /usr/local/bin/ruby foo
Using the Oniguruma regex engine.
Converted a list of 1 words into a regex 11 bytes long.
Converted a list of 2 words into a regex 16 bytes long.
Converted a list of 4 words into a regex 38 bytes long.
Converted a list of 8 words into a regex 97 bytes long.
Converted a list of 16 words into a regex 185 bytes long.
Converted a list of 32 words into a regex 359 bytes long.
Converted a list of 64 words into a regex 686 bytes long.
Converted a list of 128 words into a regex 1387 bytes long.
Converted a list of 256 words into a regex 2715 bytes long.
Converted a list of 512 words into a regex 5264 bytes long.
Converted a list of 1024 words into a regex 10074 bytes long.
Converted a list of 2048 words into a regex 19439 bytes long.
Converted a list of 4096 words into a regex 37452 bytes long.
Converted a list of 8192 words into a regex 71931 bytes long.
Converted a list of 16384 words into a regex 135572 bytes long.
Converted a list of 32768 words into a regex 253027 bytes long.
Converted a list of 65536 words into a regex 461607 bytes long.
Converted a list of 131072 words into a regex 808171 bytes long.
Converted a list of 262144 words into a regex 1326345 bytes long.
Converted a list of 479625 words into a regex 1873539 bytes long.