J
joh12005
Hello,
here is a trouble that i had, i would like to resolve it with python,
even if i still have no clue on how to do it.
i had many small "text" files, so to speed up processes on them, i used
to copy them inside a huge one adding some king of xml separator :
<file name="...">
[content]
</file>
content is tab separated data (columns) ; data are strings
now here come the tricky part for me :
i would like to be able to create some kind of matching rules, using
regular expressions, rules should match data on one line (the smallest
data unit for me) or a set of lines, say for example :
if on this line , match first column against this regexp and match
second column
and on following line match third column
-> trigger something
so, here is how i had tried :
- having all the rules,
- build some kind of analyzer for each rule,
- keep size of longest one L,
- then read each line of the huge file one by one,
- inside a "file", create all the subsets of length <= L
- for each analyzer see if it matches any of the subsets
- if it occurs...
my trouble is here :
"for each analyzer see if it matches any of the subset"
it is really to slow, i had many many rules, and as it is "for loop
inside for loop", and inside each rule also "for loop on subsets lines"
i need to speed up that, have you any idea ?
i am thinking of having "only rules for one line" and to keep traces of
if a rule is a "ending one" (to trigger something) , or a "must
continue" , but is still unclear to me for now...
a great thing could also have been some sort of dict with regexp
keys...
(and actually it would be great if i could also use some kind of regexp
operator to tell one can skip the content of 0 to n lines before
matching, just as if in the example i had changed "following..." by
"skip at least 2 lines and match third column on next line - it would
be great, but i still have really no idea on how to even think about
that)
great thx to anybody who could help,
best
here is a trouble that i had, i would like to resolve it with python,
even if i still have no clue on how to do it.
i had many small "text" files, so to speed up processes on them, i used
to copy them inside a huge one adding some king of xml separator :
<file name="...">
[content]
</file>
content is tab separated data (columns) ; data are strings
now here come the tricky part for me :
i would like to be able to create some kind of matching rules, using
regular expressions, rules should match data on one line (the smallest
data unit for me) or a set of lines, say for example :
if on this line , match first column against this regexp and match
second column
and on following line match third column
-> trigger something
so, here is how i had tried :
- having all the rules,
- build some kind of analyzer for each rule,
- keep size of longest one L,
- then read each line of the huge file one by one,
- inside a "file", create all the subsets of length <= L
- for each analyzer see if it matches any of the subsets
- if it occurs...
my trouble is here :
"for each analyzer see if it matches any of the subset"
it is really to slow, i had many many rules, and as it is "for loop
inside for loop", and inside each rule also "for loop on subsets lines"
i need to speed up that, have you any idea ?
i am thinking of having "only rules for one line" and to keep traces of
if a rule is a "ending one" (to trigger something) , or a "must
continue" , but is still unclear to me for now...
a great thing could also have been some sort of dict with regexp
keys...
(and actually it would be great if i could also use some kind of regexp
operator to tell one can skip the content of 0 to n lines before
matching, just as if in the example i had changed "following..." by
"skip at least 2 lines and match third column on next line - it would
be great, but i still have really no idea on how to even think about
that)
great thx to anybody who could help,
best