Some Regexp

Dmitry N Orlov · Dec 3, 2003

I want to get array from file like this:

TEXT1
some text, tabs, CRLF etc 1
TEXT2
TEXT1
some text, tabs, CRLF etc 2
TEXT2
TEXT1
some text, tabs, CRLF etc 3
TEXT2
........

Array mast constists of "some text, tabs, CRLF etc 1", "some text,
tabs, CRLF etc 2"

Can You help me?

Dmitry V. Sabanin · Dec 3, 2003

TEXT1
some text, tabs, CRLF etc 1
TEXT2
TEXT1
some text, tabs, CRLF etc 2
TEXT2
TEXT1
some text, tabs, CRLF etc 3
TEXT2

Something like this should help:

open = "TEXT1"
close = "TEXT2"
array = data.scan(/#{Regexp::quote(open)}(.*?)#{Regexp::quote(close)}/m).flatten

I get:
array == ["\nsome text, tabs, CRLF etc 1\n", "\nsome text, tabs, CRLF etc 2\n", "\nsome text, tabs, CRLF etc 3\n"]
with your data.

Hugh Sasse Staff Elec Eng · Dec 3, 2003

Something like this should help:

open = "TEXT1"
close = "TEXT2"
array = data.scan(/#{Regexp::quote(open)}(.*?)#{Regexp::quote(close)}/m).flatten

I get:
array == ["\nsome text, tabs, CRLF etc 1\n", "\nsome text, tabs, CRLF etc 2\n", "\nsome text, tabs, CRLF etc 3\n"]
with your data.

Interesting. This doesn't cope with nesting though. I've just
tried that out. Is there a good way to do that with scan?

Hugh

Robert Klemme · Dec 3, 2003

Hugh Sasse Staff Elec Eng said:
Something like this should help:

open = "TEXT1"
close = "TEXT2"
array = data.scan(/#{Regexp::quote(open)}(.*?)#{Regexp::quote(close)}/m).flatten

I get:
array == ["\nsome text, tabs, CRLF etc 1\n", "\nsome text, tabs, CRLF etc 2\n", "\nsome text, tabs, CRLF etc 3\n"]
with your data.

Click to expand...

Interesting. This doesn't cope with nesting though. I've just
tried that out. Is there a good way to do that with scan?

Do you mean nesting of TEXT1...TEXT2 sections within each other? That
can't be done with regexps. You need a context free parser for that.

Cheers

robert

Robert Klemme · Dec 3, 2003

Dmitry V. Sabanin said:
Something like this should help:

open = "TEXT1"
close = "TEXT2"
array =

data.scan(/#{Regexp::quote(open)}(.*?)#{Regexp::quote(close)}/m).flatten

or:

array =
data.scan(/#{Regexp::quote(open)}(.*?)#{Regexp::quote(close)}/m).map{|x|x[
0]}

Directly reading from a file:

IO.read("file.txt").scan(/#{Regexp::quote(open)}(.*?)#{Regexp::quote(close
)}/m).map{|x|x[0]}

Unfortunaltely these solutions require to read in the whole file at once.
If it is guaranteed that TEXT1 and TEXT2 are always on a line by themself
you can apply more efficient but a bit more comples solutions.

Regards

robert

Michael campbell · Dec 3, 2003

Dmitry said:
I want to get array from file like this:

TEXT1
some text, tabs, CRLF etc 1
TEXT2
TEXT1
some text, tabs, CRLF etc 2
TEXT2
TEXT1
some text, tabs, CRLF etc 3
TEXT2
........

Array mast constists of "some text, tabs, CRLF etc 1", "some text,
tabs, CRLF etc 2"

You could do something like this (until, as I understand it, this
feature is to be removed):

arr = Array.new()
while (line = gets)
arr << line if (line =~ /TEXT1/) .. (line =~ /TEXT2/)
end

(You can shorten that up even more, but I think this gets the point across.)

Hugh Sasse Staff Elec Eng · Dec 3, 2003

Do you mean nesting of TEXT1...TEXT2 sections within each other? That
Yes.

can't be done with regexps. You need a context free parser for that.

Is this a limitation inherent in regexps, or just as they are
implemented now? I ask, because Lua (which uses % where we use \
for things like \w, \d etc) has %bxy which matches a balanced x y
pair and its contents. Eg %b() would match the whole of
"(an (example))". The delimiters being one character only couldn't
make a difference here, or could it?

Cheers

robert

Hugh

ts · Dec 3, 2003

H> Is this a limitation inherent in regexps, or just as they are
H> implemented now? I ask, because Lua (which uses % where we use \
H> for things like \w, \d etc) has %bxy which matches a balanced x y
H> pair and its contents. Eg %b() would match the whole of
H> "(an (example))". The delimiters being one character only couldn't
H> make a difference here, or could it?

Yes, this is because it use only *one* character for the delimiter that it
can make it work

Guy Decoux

Hugh Sasse Staff Elec Eng · Dec 3, 2003

H> Is this a limitation inherent in regexps, or just as they are
H> implemented now? I ask, because Lua (which uses % where we use \
H> for things like \w, \d etc) has %bxy which matches a balanced x y
H> pair and its contents. Eg %b() would match the whole of
H> "(an (example))". The delimiters being one character only couldn't
H> make a difference here, or could it?

Yes, this is because it use only *one* character for the delimiter that it
can make it work

OK, I'm probably going to regret asking this (because of the
complexity of Deterministic Finite Automata theory) but:

If the delimters were string constants, not regexps, and therefore
of constant length, how would a length greater than one cause this
to be impossible?

Guy Decoux

Hugh

ts · Dec 3, 2003

H> If the delimters were string constants, not regexps, and therefore
H> of constant length, how would a length greater than one cause this
H> to be impossible?

You have found the problem : the delimiter can't be a regexp. You can have
string constants, this just make the implementation a little more complex
when it's really easy to do it when you have only one character.

This is why generally you see it implemented like this (delimiter with
only one character)

Guy Decoux

Simon Strandgaard · Dec 3, 2003

I ask, because Lua (which uses % where we use \
for things like \w, \d etc) has %bxy which matches a balanced x y
pair and its contents. Eg %b() would match the whole of
"(an (example))". The delimiters being one character only couldn't
make a difference here, or could it?

The %bxy feature seems nice, I had to look it up in Lua's manual:
http://www.lua.org/manual/5.0/manual.html#5.3

Maybe I should add it to my regexp engine ?

--
Simon Strandgaard

BTW: I have just released regepx-engine 0.6
http://raa.ruby-lang.org/list.rhtml?name=regexp

Hugh Sasse Staff Elec Eng · Dec 3, 2003

H> If the delimters were string constants, not regexps, and therefore
H> of constant length, how would a length greater than one cause this
H> to be impossible?

You have found the problem : the delimiter can't be a regexp. You can have
string constants, this just make the implementation a little more complex
when it's really easy to do it when you have only one character.

String constants are probably the most common case. They are the
frequently-asked-for case: for C comment blocks, bounded by /* and
*/, from faqs about regexps that I have seen. (I don't think nesting
is actually respected in C comment blocks, but that's another story.)

I think it would be really useful to have this, even restricted to
string constants. If I put this up as an RCR would there be
support, or have I overlooked somthing else?

This is why generally you see it implemented like this (delimiter with
only one character)

"The simplest thing that could possibly work"

Guy Decoux

Thank you,
Hugh

ts · Dec 3, 2003

H> String constants are probably the most common case. They are the
H> frequently-asked-for case: for C comment blocks, bounded by /* and
H> */, from faqs about regexps that I have seen. (I don't think nesting
H> is actually respected in C comment blocks, but that's another story.)

not really agree with you : you generally want to parse (), [], <> more
often than string constant.

H> I think it would be really useful to have this, even restricted to
H> string constants. If I put this up as an RCR would there be
H> support, or have I overlooked somthing else?

Probably I'm wrong but I think that their use will be very limited, and if
you introduce this feature someone after will try to parse HTML or XML
with it and a regexp is not adapted for this.

Guy Decoux

Simon Strandgaard · Dec 3, 2003

data.scan(/#{Regexp::quote(open)}(.*?)#{Regexp::quote(close)}/m).flatten

How about this ?

--
Simon Strandgaard

server> ruby a.rb
["some text, tabs, CRLF etc 1"]
["some text, tabs, CRLF etc 2"]
["some text, tabs, CRLF etc 3"]
server> cat a.rb
text=<<EOT
TEXT1
some text, tabs, CRLF etc 1
TEXT2
TEXT1
some text, tabs, CRLF etc 2
TEXT2
TEXT1
some text, tabs, CRLF etc 3
TEXT2
EOT
re = /TEXT1$.+?(^.*?$).+?TEXT2/m
text.scan(re){|match| p match }
server>

Ara.T.Howard · Dec 3, 2003

Date: 3 Dec 2003 02:28:24 -0800
From: Dmitry N Orlov <[email protected]>
Newsgroups: comp.lang.ruby
Subject: Some Regexp

I want to get array from file like this:

TEXT1
some text, tabs, CRLF etc 1
TEXT2
TEXT1
some text, tabs, CRLF etc 2
TEXT2
TEXT1
some text, tabs, CRLF etc 3
TEXT2
.......

Array mast constists of "some text, tabs, CRLF etc 1", "some text,
tabs, CRLF etc 2"

Can You help me?

all fields are separated by either
TEXT2
TEXT1
or
TEXT1
as a special case

/tmp > cat foo.rb
txt = <<-txt
TEXT1
some text, tabs, CRLF etc 1
TEXT2
TEXT1
some text, tabs, CRLF etc 2
TEXT2
TEXT1
some text, tabs, CRLF etc 3
TEXT2
txt

p(txt.split(%r/(?:TEXT1)|(?:TEXT2$)?TEXT1/iom)[1..-1])

/tmp > ruby foo.rb
["\n some text, tabs, CRLF etc 1\n TEXT2\n ", "\n some text, tabs, CRLF etc 2\n TEXT2\n ", "\n some text, tabs, CRLF etc 3\n TEXT2\n"]

note that the first field is dropped, since it is empty.

-a
--

ATTN: please update your address books with address below!

===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| ADDRESS :: E/GC2 325 Broadway, Boulder, CO 80305-3328
| STP :: http://www.ngdc.noaa.gov/stp/
| NGDC :: http://www.ngdc.noaa.gov/
| NESDIS :: http://www.nesdis.noaa.gov/
| NOAA :: http://www.noaa.gov/
| US DOC :: http://www.commerce.gov/
|
| The difference between art and science is that science is what we
| understand well enough to explain to a computer.
| Art is everything else.
| -- Donald Knuth, "Discover"
|
| /bin/sh -c 'for l in ruby perl;do $l -e "print \"\x3a\x2d\x29\x0a\"";done'
===============================================================================

Hugh Sasse Staff Elec Eng · Dec 3, 2003

H> String constants are probably the most common case. They are the
H> frequently-asked-for case: for C comment blocks, bounded by /* and
H> */, from faqs about regexps that I have seen. (I don't think nesting
H> is actually respected in C comment blocks, but that's another story.)

not really agree with you : you generally want to parse (), [], <> more
often than string constant.

I may be wrong about the frequecy, but that one comes up in more
than one regexp faq, IIRC.

H> I think it would be really useful to have this, even restricted to
H> string constants. If I put this up as an RCR would there be
H> support, or have I overlooked somthing else?

Probably I'm wrong but I think that their use will be very limited, and if
you introduce this feature someone after will try to parse HTML or XML
with it and a regexp is not adapted for this.

Sometimes an imperfect solution is better than none. To borrow
(abuse?) Andy Hunt's carpentry metaphor, sometimes something nailed
together will serve better now than a beautiful piece of joinery
will if it is later. This could be a case where "worse is better"
sometimes.

Guy Decoux

Hugh

ts · Dec 3, 2003

H> I may be wrong about the frequecy, but that one comes up in more
H> than one regexp faq, IIRC.

because many persons use regexp even when they are not adapted (HTML, XML
are good examples for this)

H> Sometimes an imperfect solution is better than none.

Sometimes regexp are not adapted, and you must use another tool rather
than trying to add features which will give you only problems.

p.s. : a regexp engine is stupid, never forget it

Guy Decoux

Robert Klemme · Dec 3, 2003

Michael campbell said:
You could do something like this (until, as I understand it, this
feature is to be removed):

arr = Array.new()
while (line = gets)
arr << line if (line =~ /TEXT1/) .. (line =~ /TEXT2/)
end

No that doesn't work since for each line there is a new entry in the
array. But the OP wanted the texts to be in one string.

robert

Robert Klemme · Dec 3, 2003

ts said:
H> If the delimters were string constants, not regexps, and therefore
H> of constant length, how would a length greater than one cause this
H> to be impossible?

You have found the problem : the delimiter can't be a regexp. You can have
string constants, this just make the implementation a little more complex
when it's really easy to do it when you have only one character.

IMHO this is not fully correct: the regexp engine of Lua must have a
special hack to support nesting (and apparently that for single chars
only). You can't do that with regexp engines that stay on the grounds of
regular languages, because finite automata can't count. (Ok, they *can*
count to a certain limit, but then you have to code the count into the
states which quite soon gets very messy.)

So, normally regexps can't nest unless the regexp engine at hand has a
special hack for this implemented, which catapults the set of recognizable
languages out of the regular domain.

Regards

robert

Kaspar Schiess · Dec 3, 2003

Hello,

Sometimes an imperfect solution is better than none. To borrow
(abuse?) Andy Hunt's carpentry metaphor, sometimes something nailed
together will serve better now than a beautiful piece of joinery
will if it is later. This could be a case where "worse is better"
sometimes.

Excuse me to barge in your conversation, but why don't you talk about
the correct solution of the [nested tags] problem, which in this case is
a school-book like simple solution using racc (or ryacc, last time I
looked).

file:
# empty production
|
file textblock
;

textblock:
TEXT1 othertext TEXT2
;

othertext:
# empty production
|
othertext TEXTLINE
;

Of course that's just for the parsing. And I realise that there is more
than one way to write this. What I am trying to say is that if the book
says you best use this kind of tool, then why talk about imperfect
half-bread solutions ? The 'perfect' solution is not that far away !

kaspar

How to convert while loop to for loop in my code?	3	Dec 20, 2024
Only one table shows up with the information	2	Mar 29, 2023
Help with regex	4	Nov 26, 2009
Align separate li to right	2	Jun 19, 2024
Translater + module + tkinter	1	Feb 16, 2023
Batch modifying text - content and context based	5	Jan 19, 2023
replace a string delimited by 2 other string, regexp problem	3	Oct 2, 2006
need help on a regular expression of text OR text OR etc...	1	Oct 3, 2006

Some Regexp

Dmitry N Orlov

Dmitry V. Sabanin

Hugh Sasse Staff Elec Eng

Robert Klemme

Robert Klemme

Michael campbell

Hugh Sasse Staff Elec Eng

ts

Hugh Sasse Staff Elec Eng

ts

Simon Strandgaard

Hugh Sasse Staff Elec Eng

ts

Simon Strandgaard

Ara.T.Howard

Hugh Sasse Staff Elec Eng

ts

Robert Klemme

Robert Klemme

Kaspar Schiess

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads