Question concerning multiline regexps and best practice

Oliver Andrich · Dec 6, 2005

Hi,

I am currently moving an application from Python to Ruby for a training
purpose and to learn Ruby. Inside this application I am parsing text
files delivered by news agencies. These follow more or less a
specification developed by the IPTC consortium. But back to the
question.

At the momemt I use a concatenation of strings as an input for the
Regexp, but I asked myself wether I can use a HERE document, as it would
make things a lot clearer without all the these single and double
quotes around strings. But sadly inside a HERE document, the \n at the
end of a line are used by the regexp. Is it possible to write a HERE
document or something like that with \n inside, but afterwards the \n
the source are skipped?

Or may be there is an even better way to do it. I even think about
writing a bunch of methods to parse all the stuff without a regex.

Best regards,
Oliver

ako... · Dec 6, 2005

unless i do not understand the question, a regex option that allows
multiline matches might be used.

konstantin

Oliver Andrich · Dec 6, 2005

ako... said:
unless i do not understand the question, a regex option that allows
multiline matches might be used.

Well, to better describe what I am currently dealing with, I post a
snippet of python code with the regex.

msg_rx = re.compile(
"^\x01?" +
"(?P<srcid>[a-zA-Z]{3,4})(?P<msgnum>\\d{3,4}) " +
"(?P<prio>\\d) " +
"(?P<department>[a-zA-Z]{1,3}) " +
"(?P<wordcnt>\\d{1,4}) " +
"(?P<optional>.*)\r\n*" +
"(?P<keywords>.*)\r\n*" +
"\x02" +
"(?

?P<headline>.*)=\\s*\r\n)?" +
"(?P<text>.*)" +
"\x03.*" +
"(?P<day>\\d{2})(?P<hour>\\d{2})(?P<minute>\\d{2}) " +
"(?P<mon>[a-zA-Z]{3}) " +
"(?P<year>\\d{2})",
re.S
)

This little "baby" does the job. As ruby doesn't have named groups in
regexps, I have to add comment lines (?#...) to document the invidual
groups. This would glutter the thing even more. Now ruby has these nice
HERE documents, %r{...} and so on. I would be happy if I could achieve
something like that.

msg_rx = %r{
^\x01?
(?# comment for the line)
([a-zA-Z]{3,4})(?P<msgnum>\\d{3,4})\s
(\\d)\s
(?# comment for the line)
([a-zA-Z]{1,3})\s
(?# comment for the line)
(\\d{1,4})\s
(.*)\r\n*
(.*)\r\n*
\x02
(?# comment for the line)
(?

.*)=\\s*\r\n)?
(?# comment for the line)
(.*)
\x03.*
(?# comment for the line)
(\\d{2})(\\d{2})(\\d{2})\s
(?# comment for the line)
([a-zA-Z]{3})\s
(?# comment for the line)
(\\d{2})
}

Thinks looks a lot cleaner for me, but sadly the "\n" at the end of the
lines are in my way.

I could strip them, but if it would just
"happen" it would be nicer.

Hopefully, this makes my question a little clearer.

Best regards, Oliver

Edward Faulkner · Dec 6, 2005

--RIYY1s2vRbPFwWeW
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

=20
But sadly inside a HERE document, the \n at the end of a line are
used by the regexp. Is it possible to write a HERE document or
something like that with \n inside, but afterwards the \n the source
are skipped?

The cleanest solution is to make a regular expression that can work
regardless of the presence of newlines. You probably want multiline
mode. I'd need to see your specific example.

Or you could strip out the newlines with mystring.sub("\n","").

regards,
Ed

--RIYY1s2vRbPFwWeW
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFDlhvFnhUz11p9MSARAru5AJ9FOkVvmmVPzwG7pbw8ZXbuEcMb1gCfQLvE
nzym58fOvVyZwOJLuMeBy1U=
=3hQZ
-----END PGP SIGNATURE-----

--RIYY1s2vRbPFwWeW--

ako... · Dec 6, 2005

the best that i could come up with is to remove new lines from your
regexp:

(ms_rx = <<END).gsub!(/\n/, '')
line one
line two
END

ako... · Dec 7, 2005

the best that i could come up with:

(var = <<TARGET).gsub!(/\n/, '')
line one
line two
TARGET

puts var

gabriele renzi · Dec 7, 2005

Oliver Andrich ha scritto:

ako... schrieb:

Thinks looks a lot cleaner for me, but sadly the "\n" at the end of the
lines are in my way. I could strip them, but if it would just
"happen" it would be nicer.

use a /x swicth, it should work even with %r stuff:

rgx=%r[

Click to expand...

foo #foo
bar #bar
]x
=> /
foo #foo
bar #bar
/x

=> # said:
m=rgx.match "foobar"

Click to expand...

=> # said:

m[0]

Click to expand...

=> "foobar"

William James · Dec 7, 2005

Oliver said:
This little "baby" does the job. As ruby doesn't have named groups in
regexps, I have to add comment lines (?#...) to document the invidual
groups. This would glutter the thing even more. Now ruby has these nice
HERE documents, %r{...} and so on. I would be happy if I could achieve
something like that.

msg_rx = %r{
^\x01?
(?# comment for the line)
([a-zA-Z]{3,4})(?P<msgnum>\\d{3,4})\s
(\\d)\s
(?# comment for the line)
([a-zA-Z]{1,3})\s
(?# comment for the line)
(\\d{1,4})\s
(.*)\r\n*
(.*)\r\n*
\x02
(?# comment for the line)
(?.*)=\\s*\r\n)?
(?# comment for the line)
(.*)
\x03.*
(?# comment for the line)
(\\d{2})(\\d{2})(\\d{2})\s
(?# comment for the line)
([a-zA-Z]{3})\s
(?# comment for the line)
(\\d{2})
}

Use extended mode:

msg_rx = %r{
^\x01?
# comment for the line
([a-zA-Z]{3,4}) (<msgnum>\d{3,4}) \s
(\d) \s
}x

Oliver Andrich · Dec 7, 2005

Thank William and Gabriele! This is exactly what I have been looking
for. Now this part of my module looks nice, clean and uncluttered.

Best regards,
Oliver

Best Practice for Multiline Regexps	8	Sep 1, 2009
Directory structure best practice?	1	Jul 26, 2010
Question concerning ruby file access	3	Oct 30, 2008
multiline regexp and newlines	2	Sep 28, 2008
XML-schema 'best practice' question	5	Sep 18, 2008
How would you design regexps in the integer domain?	12	May 5, 2008
Dumb question concerning ruby 1.8.2 and 1.8.3	6	Dec 19, 2005
Best practice? libs and modules	11	Sep 1, 2006

Question concerning multiline regexps and best practice

Oliver Andrich

ako...

Oliver Andrich

Edward Faulkner

ako...

ako...

gabriele renzi

William James

Oliver Andrich

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads