Help requested -- regexp

  • Thread starter Srinivas Jonnalagadda
  • Start date
S

Srinivas Jonnalagadda

I have a text file with each line representing a 'record'. The line
has tab-separated 'field=value' values. *Note*: Not all fields are
mandatory in all records.

This text file is rather large, and is auto-dumped from another
application.

Now, I am trying to provide a quick interface to my users to query
this file. The interface I am looking at is something on the lines of

(field1 > 4) and ((field4 < 2.25) or (field3 > 8.0))

to be typed in the query shell.

I had first written something similar to:

class Field

...

def gen_regexp(cond)
regexp = "(md = /#{@name}=(.+?)/.match(_line); md and "
regexp += cond.gsub(/#{@name}/, 'md[1].' + @converter_method) #
'to_i'/'to_f'
regexp += ')'
end

...

end

This 'generated' regexp would then be substituted in the place of the
corresponding parenthesized 'condition'. Once all such substitutions
are complete, a query block is generated as:

query_blk = eval("lambda { |_line| #{final_regexp} }")

This query block is then used in a conventional 'select' on the lines.

It worked for the likes of the example above, but started having
problems for clauses with multiple fields in each 'condition', like:

(field1 > 4) and ((field4 < 2.25) or (field3 + field8 > 8.0))

since the above substitution logic is at an individual field level.

Suggestions please. Thanks!

Best regards,

JS
 
J

James Edward Gray II

I have a text file with each line representing a 'record'. The line
has tab-separated 'field=value' values. *Note*: Not all fields are
mandatory in all records.

Well, that's crying out to be a Hash, right?

Hash[*line.split("\t").map { |f| f.split("-") }.flatten]
This text file is rather large, and is auto-dumped from another
application.

You may not want to preload it then, if the users don't mind waiting
for query() (or whatever) to check it line by line.
Now, I am trying to provide a quick interface to my users to query
this file. The interface I am looking at is something on the lines of

(field1 > 4) and ((field4 < 2.25) or (field3 > 8.0))

to be typed in the query shell.

You could use irb for the "shell" and Ruby's block syntax for the
query. If you want the fields to work as you show them above, use a
little method_missing magic.
I had first written something similar to:

class Field

...

def gen_regexp(cond)
regexp = "(md = /#{@name}=(.+?)/.match(_line); md and "
regexp += cond.gsub(/#{@name}/, 'md[1].' + @converter_method)
# 'to_i'/'to_f'
regexp += ')'
end

...

end

This 'generated' regexp would then be substituted in the place of the
corresponding parenthesized 'condition'. Once all such substitutions
are complete, a query block is generated as:

query_blk = eval("lambda { |_line| #{final_regexp} }")

This query block is then used in a conventional 'select' on the lines.

It worked for the likes of the example above, but started having
problems for clauses with multiple fields in each 'condition', like:

(field1 > 4) and ((field4 < 2.25) or (field3 + field8 > 8.0))

since the above substitution logic is at an individual field level.

If you want to express complete relationships, just use Ruby code as
described above.
Suggestions please. Thanks!

Those are my best thoughts. Hope it helps.

James Edward Gray II
 
W

William Ramirez

------=_Part_33603_7205797.1131297960179
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Personally, and maybe because I'm a product of different experiences, I'd
approach this problem differently. I see your problem and it just screams t=
o
me "database".

Is there a reason you couldn't parse the text file and upload it to some
sort of SQL database and then let your users query that?

------=_Part_33603_7205797.1131297960179--
 
S

Srinivas Jonnalagadda

William said:
Personally, and maybe because I'm a product of different experiences, I'd
approach this problem differently. I see your problem and it just screams to
me "database".

Is there a reason you couldn't parse the text file and upload it to some
sort of SQL database and then let your users query that?

Yes. Each dump file is between 800 MB and 1.4 GB. I had indeed setup
a database to hold this data. Here are a few reasons why I tried the
approach that I did:

1. The 'load' (with all relational constraints turned off) was still
taking enormous amount of time (of the order of 3-4 hours per dump
file).

2. The dump file's schema changes rather frequently. This induced
frequent DBA overhead into this process, to keep the database
schema synchronized.

3. The sparse nature of the data (not all fields being mandatory in
all records) has resulted in a high storage overhead (about 2.5X).

4. SQL queries on the resulting database are not a serious option
since my users do not know SQL, and would not be able to interpret
any error diagnosis.

5. When wrapped with objects in Ruby, the same queries take 6-10X
longer (as compared to the regular expression approach). Profiling
shows this to be mostly because of object creation/initialization
overhead.

And, to answer James' question -- it was indeed a hash that was
employed in each object.

So, I sought a solution that worked faster, and ended up with the
current regular expression approach.

Hope that clarifies your question.

Best regards,

JS
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,188
Messages
2,571,002
Members
47,591
Latest member
WoodrowBut

Latest Threads

Top