Question concerning multiline regexps and best practice

O

Oliver Andrich

Hi,

I am currently moving an application from Python to Ruby for a training
purpose and to learn Ruby. Inside this application I am parsing text
files delivered by news agencies. These follow more or less a
specification developed by the IPTC consortium. But back to the
question. :)

At the momemt I use a concatenation of strings as an input for the
Regexp, but I asked myself wether I can use a HERE document, as it would
make things a lot clearer without all the these single and double
quotes around strings. But sadly inside a HERE document, the \n at the
end of a line are used by the regexp. Is it possible to write a HERE
document or something like that with \n inside, but afterwards the \n
the source are skipped?

Or may be there is an even better way to do it. I even think about
writing a bunch of methods to parse all the stuff without a regex.

Best regards,
Oliver
 
A

ako...

unless i do not understand the question, a regex option that allows
multiline matches might be used.

konstantin
 
O

Oliver Andrich

ako... said:
unless i do not understand the question, a regex option that allows
multiline matches might be used.

Well, to better describe what I am currently dealing with, I post a
snippet of python code with the regex.

msg_rx = re.compile(
"^\x01?" +
"(?P<srcid>[a-zA-Z]{3,4})(?P<msgnum>\\d{3,4}) " +
"(?P<prio>\\d) " +
"(?P<department>[a-zA-Z]{1,3}) " +
"(?P<wordcnt>\\d{1,4}) " +
"(?P<optional>.*)\r\n*" +
"(?P<keywords>.*)\r\n*" +
"\x02" +
"(?:(?P<headline>.*)=\\s*\r\n)?" +
"(?P<text>.*)" +
"\x03.*" +
"(?P<day>\\d{2})(?P<hour>\\d{2})(?P<minute>\\d{2}) " +
"(?P<mon>[a-zA-Z]{3}) " +
"(?P<year>\\d{2})",
re.S
)

This little "baby" does the job. As ruby doesn't have named groups in
regexps, I have to add comment lines (?#...) to document the invidual
groups. This would glutter the thing even more. Now ruby has these nice
HERE documents, %r{...} and so on. I would be happy if I could achieve
something like that.

msg_rx = %r{
^\x01?
(?# comment for the line)
([a-zA-Z]{3,4})(?P<msgnum>\\d{3,4})\s
(\\d)\s
(?# comment for the line)
([a-zA-Z]{1,3})\s
(?# comment for the line)
(\\d{1,4})\s
(.*)\r\n*
(.*)\r\n*
\x02
(?# comment for the line)
(?:(.*)=\\s*\r\n)?
(?# comment for the line)
(.*)
\x03.*
(?# comment for the line)
(\\d{2})(\\d{2})(\\d{2})\s
(?# comment for the line)
([a-zA-Z]{3})\s
(?# comment for the line)
(\\d{2})
}

Thinks looks a lot cleaner for me, but sadly the "\n" at the end of the
lines are in my way. :) I could strip them, but if it would just
"happen" it would be nicer.

Hopefully, this makes my question a little clearer.

Best regards, Oliver
 
E

Edward Faulkner

--RIYY1s2vRbPFwWeW
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

=20
But sadly inside a HERE document, the \n at the end of a line are
used by the regexp. Is it possible to write a HERE document or
something like that with \n inside, but afterwards the \n the source
are skipped?

The cleanest solution is to make a regular expression that can work
regardless of the presence of newlines. You probably want multiline
mode. I'd need to see your specific example.

Or you could strip out the newlines with mystring.sub("\n","").

regards,
Ed

--RIYY1s2vRbPFwWeW
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFDlhvFnhUz11p9MSARAru5AJ9FOkVvmmVPzwG7pbw8ZXbuEcMb1gCfQLvE
nzym58fOvVyZwOJLuMeBy1U=
=3hQZ
-----END PGP SIGNATURE-----

--RIYY1s2vRbPFwWeW--
 
A

ako...

the best that i could come up with is to remove new lines from your
regexp:

(ms_rx = <<END).gsub!(/\n/, '')
line one
line two
END
 
A

ako...

the best that i could come up with:

(var = <<TARGET).gsub!(/\n/, '')
line one
line two
TARGET

puts var
 
G

gabriele renzi

W

William James

Oliver said:
This little "baby" does the job. As ruby doesn't have named groups in
regexps, I have to add comment lines (?#...) to document the invidual
groups. This would glutter the thing even more. Now ruby has these nice
HERE documents, %r{...} and so on. I would be happy if I could achieve
something like that.

msg_rx = %r{
^\x01?
(?# comment for the line)
([a-zA-Z]{3,4})(?P<msgnum>\\d{3,4})\s
(\\d)\s
(?# comment for the line)
([a-zA-Z]{1,3})\s
(?# comment for the line)
(\\d{1,4})\s
(.*)\r\n*
(.*)\r\n*
\x02
(?# comment for the line)
(?:(.*)=\\s*\r\n)?
(?# comment for the line)
(.*)
\x03.*
(?# comment for the line)
(\\d{2})(\\d{2})(\\d{2})\s
(?# comment for the line)
([a-zA-Z]{3})\s
(?# comment for the line)
(\\d{2})
}

Use extended mode:

msg_rx = %r{
^\x01?
# comment for the line
([a-zA-Z]{3,4}) (<msgnum>\d{3,4}) \s
(\d) \s
}x
 
O

Oliver Andrich

Thank William and Gabriele! This is exactly what I have been looking
for. Now this part of my module looks nice, clean and uncluttered.

Best regards,
Oliver
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,226
Members
46,815
Latest member
treekmostly22

Latest Threads

Top