Help requested -- regexp

Srinivas Jonnalagadda · Nov 6, 2005

I have a text file with each line representing a 'record'. The line
has tab-separated 'field=value' values. *Note*: Not all fields are
mandatory in all records.

This text file is rather large, and is auto-dumped from another
application.

Now, I am trying to provide a quick interface to my users to query
this file. The interface I am looking at is something on the lines of

(field1 > 4) and ((field4 < 2.25) or (field3 > 8.0))

to be typed in the query shell.

I had first written something similar to:

class Field

...

def gen_regexp(cond)
regexp = "(md = /#{@name}=(.+?)/.match(_line); md and "
regexp += cond.gsub(/#{@name}/, 'md[1].' + @converter_method) #
'to_i'/'to_f'
regexp += ')'
end

...

end

This 'generated' regexp would then be substituted in the place of the
corresponding parenthesized 'condition'. Once all such substitutions
are complete, a query block is generated as:

query_blk = eval("lambda { |_line| #{final_regexp} }")

This query block is then used in a conventional 'select' on the lines.

It worked for the likes of the example above, but started having
problems for clauses with multiple fields in each 'condition', like:

(field1 > 4) and ((field4 < 2.25) or (field3 + field8 > 8.0))

since the above substitution logic is at an individual field level.

Suggestions please. Thanks!

Best regards,

JS

James Edward Gray II · Nov 6, 2005

I have a text file with each line representing a 'record'. The line
has tab-separated 'field=value' values. *Note*: Not all fields are
mandatory in all records.

Well, that's crying out to be a Hash, right?

Hash[*line.split("\t").map { |f| f.split("-") }.flatten]

This text file is rather large, and is auto-dumped from another
application.

You may not want to preload it then, if the users don't mind waiting
for query() (or whatever) to check it line by line.

Now, I am trying to provide a quick interface to my users to query
this file. The interface I am looking at is something on the lines of

(field1 > 4) and ((field4 < 2.25) or (field3 > 8.0))

to be typed in the query shell.

You could use irb for the "shell" and Ruby's block syntax for the
query. If you want the fields to work as you show them above, use a
little method_missing magic.

I had first written something similar to:

class Field

...

def gen_regexp(cond)
regexp = "(md = /#{@name}=(.+?)/.match(_line); md and "
regexp += cond.gsub(/#{@name}/, 'md[1].' + @converter_method)
# 'to_i'/'to_f'
regexp += ')'
end

...

end

This 'generated' regexp would then be substituted in the place of the
corresponding parenthesized 'condition'. Once all such substitutions
are complete, a query block is generated as:

query_blk = eval("lambda { |_line| #{final_regexp} }")

This query block is then used in a conventional 'select' on the lines.

It worked for the likes of the example above, but started having
problems for clauses with multiple fields in each 'condition', like:

(field1 > 4) and ((field4 < 2.25) or (field3 + field8 > 8.0))

since the above substitution logic is at an individual field level.

If you want to express complete relationships, just use Ruby code as
described above.

Suggestions please. Thanks!

Those are my best thoughts. Hope it helps.

James Edward Gray II

William Ramirez · Nov 6, 2005

------=_Part_33603_7205797.1131297960179
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Personally, and maybe because I'm a product of different experiences, I'd
approach this problem differently. I see your problem and it just screams t=
o
me "database".

Is there a reason you couldn't parse the text file and upload it to some
sort of SQL database and then let your users query that?

------=_Part_33603_7205797.1131297960179--

daz · Nov 6, 2005

William said:
[...] I see your problem and it just screams to me "database".

I hear these screams

daz

Srinivas Jonnalagadda · Nov 7, 2005

William said:
Personally, and maybe because I'm a product of different experiences, I'd
approach this problem differently. I see your problem and it just screams to
me "database".

Is there a reason you couldn't parse the text file and upload it to some
sort of SQL database and then let your users query that?

Yes. Each dump file is between 800 MB and 1.4 GB. I had indeed setup
a database to hold this data. Here are a few reasons why I tried the
approach that I did:

1. The 'load' (with all relational constraints turned off) was still
taking enormous amount of time (of the order of 3-4 hours per dump
file).

2. The dump file's schema changes rather frequently. This induced
frequent DBA overhead into this process, to keep the database
schema synchronized.

3. The sparse nature of the data (not all fields being mandatory in
all records) has resulted in a high storage overhead (about 2.5X).

4. SQL queries on the resulting database are not a serious option
since my users do not know SQL, and would not be able to interpret
any error diagnosis.

5. When wrapped with objects in Ruby, the same queries take 6-10X
longer (as compared to the regular expression approach). Profiling
shows this to be mostly because of object creation/initialization
overhead.

And, to answer James' question -- it was indeed a hash that was
employed in each object.

So, I sought a solution that worked faster, and ended up with the
current regular expression approach.

Hope that clarifies your question.

Best regards,

JS

Cannot read data from server easily?	1	Jul 7, 2010
1.9 CSV Parsing Issues	5	Nov 4, 2010
regexp help sought	2	Feb 24, 2005
help !	8	Apr 30, 2005
Text::CSV and Mysql - invalid number of columns	3	Feb 23, 2007
Define the fields of a class using dynamic data.(META PROGRAMMING)	1	Dec 12, 2008
unpack query	7	Nov 24, 2003
Text::CSV and Mysql	2	Feb 23, 2007

Help requested -- regexp

Srinivas Jonnalagadda

James Edward Gray II

William Ramirez

daz

Srinivas Jonnalagadda

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads