Some Regexp

D

Dmitry N Orlov

I want to get array from file like this:

TEXT1
some text, tabs, CRLF etc 1
TEXT2
TEXT1
some text, tabs, CRLF etc 2
TEXT2
TEXT1
some text, tabs, CRLF etc 3
TEXT2
........

Array mast constists of "some text, tabs, CRLF etc 1", "some text,
tabs, CRLF etc 2"

Can You help me?
 
D

Dmitry V. Sabanin

TEXT1
some text, tabs, CRLF etc 1
TEXT2
TEXT1
some text, tabs, CRLF etc 2
TEXT2
TEXT1
some text, tabs, CRLF etc 3
TEXT2
Something like this should help:

open = "TEXT1"
close = "TEXT2"
array = data.scan(/#{Regexp::quote(open)}(.*?)#{Regexp::quote(close)}/m).flatten

I get:
array == ["\nsome text, tabs, CRLF etc 1\n", "\nsome text, tabs, CRLF etc 2\n", "\nsome text, tabs, CRLF etc 3\n"]
with your data.
 
H

Hugh Sasse Staff Elec Eng

Something like this should help:

open = "TEXT1"
close = "TEXT2"
array = data.scan(/#{Regexp::quote(open)}(.*?)#{Regexp::quote(close)}/m).flatten

I get:
array == ["\nsome text, tabs, CRLF etc 1\n", "\nsome text, tabs, CRLF etc 2\n", "\nsome text, tabs, CRLF etc 3\n"]
with your data.

Interesting. This doesn't cope with nesting though. I've just
tried that out. Is there a good way to do that with scan?

Hugh
 
R

Robert Klemme

Hugh Sasse Staff Elec Eng said:
Something like this should help:

open = "TEXT1"
close = "TEXT2"
array = data.scan(/#{Regexp::quote(open)}(.*?)#{Regexp::quote(close)}/m).flatten

I get:
array == ["\nsome text, tabs, CRLF etc 1\n", "\nsome text, tabs, CRLF etc 2\n", "\nsome text, tabs, CRLF etc 3\n"]
with your data.

Interesting. This doesn't cope with nesting though. I've just
tried that out. Is there a good way to do that with scan?

Do you mean nesting of TEXT1...TEXT2 sections within each other? That
can't be done with regexps. You need a context free parser for that.

Cheers

robert
 
R

Robert Klemme

Dmitry V. Sabanin said:
Something like this should help:

open = "TEXT1"
close = "TEXT2"
array =
data.scan(/#{Regexp::quote(open)}(.*?)#{Regexp::quote(close)}/m).flatten

or:

array =
data.scan(/#{Regexp::quote(open)}(.*?)#{Regexp::quote(close)}/m).map{|x|x[
0]}

Directly reading from a file:

IO.read("file.txt").scan(/#{Regexp::quote(open)}(.*?)#{Regexp::quote(close
)}/m).map{|x|x[0]}

Unfortunaltely these solutions require to read in the whole file at once.
If it is guaranteed that TEXT1 and TEXT2 are always on a line by themself
you can apply more efficient but a bit more comples solutions.

Regards

robert
 
M

Michael campbell

Dmitry said:
I want to get array from file like this:

TEXT1
some text, tabs, CRLF etc 1
TEXT2
TEXT1
some text, tabs, CRLF etc 2
TEXT2
TEXT1
some text, tabs, CRLF etc 3
TEXT2
........

Array mast constists of "some text, tabs, CRLF etc 1", "some text,
tabs, CRLF etc 2"


You could do something like this (until, as I understand it, this
feature is to be removed):


arr = Array.new()
while (line = gets)
arr << line if (line =~ /TEXT1/) .. (line =~ /TEXT2/)
end


(You can shorten that up even more, but I think this gets the point across.)
 
H

Hugh Sasse Staff Elec Eng

Do you mean nesting of TEXT1...TEXT2 sections within each other? That
Yes.

can't be done with regexps. You need a context free parser for that.

Is this a limitation inherent in regexps, or just as they are
implemented now? I ask, because Lua (which uses % where we use \
for things like \w, \d etc) has %bxy which matches a balanced x y
pair and its contents. Eg %b() would match the whole of
"(an (example))". The delimiters being one character only couldn't
make a difference here, or could it?
Cheers

robert

Hugh
 
T

ts

H> Is this a limitation inherent in regexps, or just as they are
H> implemented now? I ask, because Lua (which uses % where we use \
H> for things like \w, \d etc) has %bxy which matches a balanced x y
H> pair and its contents. Eg %b() would match the whole of
H> "(an (example))". The delimiters being one character only couldn't
H> make a difference here, or could it?

Yes, this is because it use only *one* character for the delimiter that it
can make it work


Guy Decoux
 
H

Hugh Sasse Staff Elec Eng

H> Is this a limitation inherent in regexps, or just as they are
H> implemented now? I ask, because Lua (which uses % where we use \
H> for things like \w, \d etc) has %bxy which matches a balanced x y
H> pair and its contents. Eg %b() would match the whole of
H> "(an (example))". The delimiters being one character only couldn't
H> make a difference here, or could it?

Yes, this is because it use only *one* character for the delimiter that it
can make it work

OK, I'm probably going to regret asking this (because of the
complexity of Deterministic Finite Automata theory) but:

If the delimters were string constants, not regexps, and therefore
of constant length, how would a length greater than one cause this
to be impossible?
Guy Decoux
Hugh
 
T

ts

H> If the delimters were string constants, not regexps, and therefore
H> of constant length, how would a length greater than one cause this
H> to be impossible?

You have found the problem : the delimiter can't be a regexp. You can have
string constants, this just make the implementation a little more complex
when it's really easy to do it when you have only one character.

This is why generally you see it implemented like this (delimiter with
only one character)


Guy Decoux
 
S

Simon Strandgaard

I ask, because Lua (which uses % where we use \
for things like \w, \d etc) has %bxy which matches a balanced x y
pair and its contents. Eg %b() would match the whole of
"(an (example))". The delimiters being one character only couldn't
make a difference here, or could it?

The %bxy feature seems nice, I had to look it up in Lua's manual:
http://www.lua.org/manual/5.0/manual.html#5.3


Maybe I should add it to my regexp engine ?

--
Simon Strandgaard


BTW: I have just released regepx-engine 0.6
http://raa.ruby-lang.org/list.rhtml?name=regexp
 
H

Hugh Sasse Staff Elec Eng

H> If the delimters were string constants, not regexps, and therefore
H> of constant length, how would a length greater than one cause this
H> to be impossible?

You have found the problem : the delimiter can't be a regexp. You can have
string constants, this just make the implementation a little more complex
when it's really easy to do it when you have only one character.

String constants are probably the most common case. They are the
frequently-asked-for case: for C comment blocks, bounded by /* and
*/, from faqs about regexps that I have seen. (I don't think nesting
is actually respected in C comment blocks, but that's another story.)

I think it would be really useful to have this, even restricted to
string constants. If I put this up as an RCR would there be
support, or have I overlooked somthing else? :)
This is why generally you see it implemented like this (delimiter with
only one character)

"The simplest thing that could possibly work" :)
Guy Decoux

Thank you,
Hugh
 
T

ts

H> String constants are probably the most common case. They are the
H> frequently-asked-for case: for C comment blocks, bounded by /* and
H> */, from faqs about regexps that I have seen. (I don't think nesting
H> is actually respected in C comment blocks, but that's another story.)

not really agree with you : you generally want to parse (), [], <> more
often than string constant.

H> I think it would be really useful to have this, even restricted to
H> string constants. If I put this up as an RCR would there be
H> support, or have I overlooked somthing else? :)

Probably I'm wrong but I think that their use will be very limited, and if
you introduce this feature someone after will try to parse HTML or XML
with it and a regexp is not adapted for this.


Guy Decoux
 
S

Simon Strandgaard

data.scan(/#{Regexp::quote(open)}(.*?)#{Regexp::quote(close)}/m).flatten


How about this ?

--
Simon Strandgaard


server> ruby a.rb
["some text, tabs, CRLF etc 1"]
["some text, tabs, CRLF etc 2"]
["some text, tabs, CRLF etc 3"]
server> cat a.rb
text=<<EOT
TEXT1
some text, tabs, CRLF etc 1
TEXT2
TEXT1
some text, tabs, CRLF etc 2
TEXT2
TEXT1
some text, tabs, CRLF etc 3
TEXT2
EOT
re = /TEXT1$.+?(^.*?$).+?TEXT2/m
text.scan(re){|match| p match }
server>
 
A

Ara.T.Howard

Date: 3 Dec 2003 02:28:24 -0800
From: Dmitry N Orlov <[email protected]>
Newsgroups: comp.lang.ruby
Subject: Some Regexp

I want to get array from file like this:

TEXT1
some text, tabs, CRLF etc 1
TEXT2
TEXT1
some text, tabs, CRLF etc 2
TEXT2
TEXT1
some text, tabs, CRLF etc 3
TEXT2
.......

Array mast constists of "some text, tabs, CRLF etc 1", "some text,
tabs, CRLF etc 2"

Can You help me?

all fields are separated by either
TEXT2
TEXT1
or
TEXT1
as a special case

/tmp > cat foo.rb
txt = <<-txt
TEXT1
some text, tabs, CRLF etc 1
TEXT2
TEXT1
some text, tabs, CRLF etc 2
TEXT2
TEXT1
some text, tabs, CRLF etc 3
TEXT2
txt

p(txt.split(%r/(?:TEXT1)|(?:TEXT2$)?TEXT1/iom)[1..-1])


/tmp > ruby foo.rb
["\n some text, tabs, CRLF etc 1\n TEXT2\n ", "\n some text, tabs, CRLF etc 2\n TEXT2\n ", "\n some text, tabs, CRLF etc 3\n TEXT2\n"]


note that the first field is dropped, since it is empty.

-a
--

ATTN: please update your address books with address below!

===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| ADDRESS :: E/GC2 325 Broadway, Boulder, CO 80305-3328
| STP :: http://www.ngdc.noaa.gov/stp/
| NGDC :: http://www.ngdc.noaa.gov/
| NESDIS :: http://www.nesdis.noaa.gov/
| NOAA :: http://www.noaa.gov/
| US DOC :: http://www.commerce.gov/
|
| The difference between art and science is that science is what we
| understand well enough to explain to a computer.
| Art is everything else.
| -- Donald Knuth, "Discover"
|
| /bin/sh -c 'for l in ruby perl;do $l -e "print \"\x3a\x2d\x29\x0a\"";done'
===============================================================================
 
H

Hugh Sasse Staff Elec Eng

H> String constants are probably the most common case. They are the
H> frequently-asked-for case: for C comment blocks, bounded by /* and
H> */, from faqs about regexps that I have seen. (I don't think nesting
H> is actually respected in C comment blocks, but that's another story.)

not really agree with you : you generally want to parse (), [], <> more
often than string constant.

I may be wrong about the frequecy, but that one comes up in more
than one regexp faq, IIRC.
H> I think it would be really useful to have this, even restricted to
H> string constants. If I put this up as an RCR would there be
H> support, or have I overlooked somthing else? :)

Probably I'm wrong but I think that their use will be very limited, and if
you introduce this feature someone after will try to parse HTML or XML
with it and a regexp is not adapted for this.

Sometimes an imperfect solution is better than none. To borrow
(abuse?) Andy Hunt's carpentry metaphor, sometimes something nailed
together will serve better now than a beautiful piece of joinery
will if it is later. This could be a case where "worse is better"
sometimes.
Guy Decoux

Hugh
 
T

ts

H> I may be wrong about the frequecy, but that one comes up in more
H> than one regexp faq, IIRC.

because many persons use regexp even when they are not adapted (HTML, XML
are good examples for this)

H> Sometimes an imperfect solution is better than none.

Sometimes regexp are not adapted, and you must use another tool rather
than trying to add features which will give you only problems.


p.s. : a regexp engine is stupid, never forget it :)


Guy Decoux
 
R

Robert Klemme

Michael campbell said:
You could do something like this (until, as I understand it, this
feature is to be removed):


arr = Array.new()
while (line = gets)
arr << line if (line =~ /TEXT1/) .. (line =~ /TEXT2/)
end

No that doesn't work since for each line there is a new entry in the
array. But the OP wanted the texts to be in one string.

robert
 
R

Robert Klemme

ts said:
H> If the delimters were string constants, not regexps, and therefore
H> of constant length, how would a length greater than one cause this
H> to be impossible?

You have found the problem : the delimiter can't be a regexp. You can have
string constants, this just make the implementation a little more complex
when it's really easy to do it when you have only one character.

IMHO this is not fully correct: the regexp engine of Lua must have a
special hack to support nesting (and apparently that for single chars
only). You can't do that with regexp engines that stay on the grounds of
regular languages, because finite automata can't count. (Ok, they *can*
count to a certain limit, but then you have to code the count into the
states which quite soon gets very messy.)

So, normally regexps can't nest unless the regexp engine at hand has a
special hack for this implemented, which catapults the set of recognizable
languages out of the regular domain. :)

Regards

robert
 
K

Kaspar Schiess

Hello,
Sometimes an imperfect solution is better than none. To borrow
(abuse?) Andy Hunt's carpentry metaphor, sometimes something nailed
together will serve better now than a beautiful piece of joinery
will if it is later. This could be a case where "worse is better"
sometimes.

Excuse me to barge in your conversation, but why don't you talk about
the correct solution of the [nested tags] problem, which in this case is
a school-book like simple solution using racc (or ryacc, last time I
looked).

file:
# empty production
|
file textblock
;

textblock:
TEXT1 othertext TEXT2
;

othertext:
# empty production
|
othertext TEXTLINE
;

Of course that's just for the parsing. And I realise that there is more
than one way to write this. What I am trying to say is that if the book
says you best use this kind of tool, then why talk about imperfect
half-bread solutions ? The 'perfect' solution is not that far away !

kaspar
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,141
Messages
2,570,815
Members
47,361
Latest member
RogerDuabe

Latest Threads

Top