Parsing String#dump data?

A

Andrew S. Townley

Hi,

I was wondering if there was a better/safer way to parse string data
that has been dumped using the String#dump method. At the moment, I've
been using regular expressions to do it, but that doesn't seem to work
with unicode characters, since they get dumped as follows:

irb(main):007:0> s =3D "=E2=82=AC"
=3D> "=E2=82=AC"
irb(main):008:0> s.dump
=3D> "\"\\342\\202\\254\""

On a lark, I figured that Ruby would be able to get them back, so I
tried this:

irb(main):009:0> x =3D eval s.dump
=3D> "=E2=82=AC"

Which, of course, works. However, I'm a bit leery of doing this from a
safety perspective, because I really don't have any control over these
strings, and I'd prefer not to allow the execution of arbitrary Ruby
code every time I'm trying to restore strings (I need them serialized as
appropriately escaped quoted literals).

Has anyone else ever needed to do this, and, if so, how did you solve
the problem. I guess I could do another pass on the string looking for
'\\[\n]+' values and try and combine them some way, but I'm not really
sure how to do that either.

Any ideas?

Cheers,

ast
--=20
Andrew S. Townley <[email protected]>
http://atownley.org
 
R

Robert Dober

Hi,

I was wondering if there was a better/safer way to parse string data
that has been dumped using the String#dump method. =A0At the moment, I've
been using regular expressions to do it, but that doesn't seem to work
with unicode characters, since they get dumped as follows:

irb(main):007:0> s =3D "=80"
=3D> "=80"
irb(main):008:0> s.dump
=3D> "\"\\342\\202\\254\""

On a lark, I figured that Ruby would be able to get them back, so I
tried this:

irb(main):009:0> x =3D eval s.dump
=3D> "=80"
Maybe
eval s.dump if %r{\A".*"*\Z} =3D=3D=3D s.dump && ! %r{#[{@]} =3D=3D=3D s.d=
ump # not tested
is save, but I am not 100% sure
Cheers
Robert
Which, of course, works. =A0However, I'm a bit leery of doing this from a
safety perspective, because I really don't have any control over these
strings, and I'd prefer not to allow the execution of arbitrary Ruby
code every time I'm trying to restore strings (I need them serialized as
appropriately escaped quoted literals).

Has anyone else ever needed to do this, and, if so, how did you solve
the problem. =A0I guess I could do another pass on the string looking for
'\\[\n]+' values and try and combine them some way, but I'm not really
sure how to do that either.

Any ideas?

Cheers,

ast



--=20
Si tu veux construire un bateau ...
Ne rassemble pas des hommes pour aller chercher du bois, pr=E9parer des
outils, r=E9partir les t=E2ches, all=E9ger le travail=85 mais enseigne aux
gens la nostalgie de l=92infini de la mer.

If you want to build a ship, don=92t herd people together to collect
wood and don=92t assign them tasks and work, but rather teach them to
long for the endless immensity of the sea.
 
U

Urabe Shyouhei

Andrew said:
Which, of course, works. However, I'm a bit leery of doing this from a
safety perspective, because I really don't have any control over these
strings, and I'd prefer not to allow the execution of arbitrary Ruby
code every time I'm trying to restore strings (I need them serialized as
appropriately escaped quoted literals).

IMHO it is a bad idea to use String#dump when you cannot control those strings.

My recommendation is to use Marshal.dump, which also generates a string.
Adding quotes to those marshal-generated strings should be easier than safely
evaluate dumped string.
 
A

Andrew S. Townley

=20
IMHO it is a bad idea to use String#dump when you cannot control those st= rings.
=20
My recommendation is to use Marshal.dump, which also generates a string.
Adding quotes to those marshal-generated strings should be easier than sa= fely
evaluate dumped string.

Thanks for the replies. Actually, as I was doing something else,
another option occurred to me which seems both to a) work properly and
b) be safe(-ish):

irb(main):001:0> $KCODE =3D 'u'
=3D> "u"
irb(main):002:0> s =3D "=E2=82=AC"
=3D> "=E2=82=AC"
irb(main):003:0> x =3D s.dump
=3D> "\"\\342\\202\\254\""
irb(main):004:0> t =3D ""
=3D> ""
irb(main):005:0> t.instance_eval x
=3D> "=E2=82=AC"

Since all I ever want is to have the data back in the string, and string
doesn't have any methods that are likely to cause problems, this might
be a reasonable short-to-medium term solution. It still makes me a bit
uncomfortable though, because I don't really want anything other than
the encoded characters handled.

I can't use Marshal, because I need to have the data available as plain
text (hence the quoted strings) which isn't necessarily guaranteed to be
always processed by Ruby. I chose String#dump because it seemed like it
would always generate a "safe" string that would be parsed using normal
quote literal recognition. I hadn't tested it until recently with lots
of Unicode data, because I simply hadn't gotten there yet. I was just
lucky...

Even the Unicode handling is straightforward enough, and since I posed
the question, I found this blog:
http://dilettantes.code4lib.org/2009/04/parsing-escaped-unicode-in-ruby/
which talks about modifying the JSON parser approach. I might be able
to do that, or, I might need to end up writing my own
serializer/deserializer, since at this stage (over a year), I've a lot
of legacy data lying around that was created with this approach.

I guess, I could write a one-off clean-up utility for the data that I
have now and then use the JSON library just to encode/decode the
strings, but that seems like overkill.

My goals here are interoperability, reuse, ease of adapting to my
existing code (in that order). Until I ran across the site, I hadn't
thought about the JSON approach, but it might make the most sense for
interoperable data. Mind you, I only care about safe string
serialization/deserialization, and I've no use in the application for
the rest of the JSON spec.

Changing the question a little: does anyone know of the best way to
serialize and parse strings containing Unicode and other non-printing
characters? Ideally, I'd like to have something that works like
String#dump except that it used escaped Unicode code point references,
e.g. \uxxxx and \Uxxxxxxxx, and handles all of the "usual suspects" like
\", \\, etc.

Doing some more googling, I also came across this, but I'm not sure what
the status of it is, and I'm not sure that it addresses my issue either.
It seems to be more about processing Unicode rather than serialization
of Unicode to ASCII. (http://snippets.dzone.com/posts/show/4527).

[much time passes...including lunch]

After arsing around for a long time with various stupid stuff, I finally
came up with this. I don't really like it, but it seems to do the job.
Comments welcome:

irb(main):026:0> euro =3D "=E2=82=AC"
=3D> "=E2=82=AC"
irb(main):027:0> x =3D euro.dump
=3D> "\"\\342\\202\\254\""
irb(main):028:0> x.gsub(/\\(\d\d\d)/) { [ $1.oct ].pack("c") }[1..-2]
=3D> "=E2=82=AC"

However, this doesn't get me in/out of the "standard" Unicode escapes.

Thanks in advance for any ideas or suggestions.

Cheers,

ast
--=20
Andrew S. Townley <[email protected]>
http://atownley.org
 
U

Urabe Shyouhei

Andrew said:
Thanks for the replies. Actually, as I was doing something else,
another option occurred to me which seems both to a) work properly and
b) be safe(-ish):

irb(main):001:0> $KCODE = 'u'
=> "u"
irb(main):002:0> s = "€"
=> "€"
irb(main):003:0> x = s.dump
=> "\"\\342\\202\\254\""
irb(main):004:0> t = ""
=> ""
irb(main):005:0> t.instance_eval x
=> "€"

irb(main):001:0> t = ""
=> ""
irb(main):002:0> t.instance_eval "`ls`"
=> "tmp.txt\ntmp.rb\n"
Since all I ever want is to have the data back in the string, and string
doesn't have any methods that are likely to cause problems, this might
be a reasonable short-to-medium term solution. It still makes me a bit
uncomfortable though, because I don't really want anything other than
the encoded characters handled.

Be sure your string do not include something like `rm -rf` ...
I can't use Marshal, because I need to have the data available as plain
text (hence the quoted strings) which isn't necessarily guaranteed to be
always processed by Ruby. I chose String#dump because it seemed like it
would always generate a "safe" string that would be parsed using normal
quote literal recognition. I hadn't tested it until recently with lots
of Unicode data, because I simply hadn't gotten there yet. I was just
lucky...

How about Array#pack. It has an ability to escape strings as MIME
quoted-printable:

irb(main):001:0> s = "abcd€fghi"
=> "abcd€fghi"
irb(main):002:0> t = .pack("M")
=> "abcd=E2=82=ACfghi=\n"
irb(main):003:0> t.unpack("M")[0].force_encoding("UTF-8")
=> "abcd€fghi"

# that force_encoding thing is required for ruby 1.9.

Even the Unicode handling is straightforward enough, and since I posed
the question, I found this blog:
http://dilettantes.code4lib.org/2009/04/parsing-escaped-unicode-in-ruby/
which talks about modifying the JSON parser approach. I might be able
to do that, or, I might need to end up writing my own
serializer/deserializer, since at this stage (over a year), I've a lot
of legacy data lying around that was created with this approach.

JSON is a ruby's stdlib these days (1.9 and above). Using it might be easier
than you might think at first.

irb(main):001:0> require 'json'
=> true
irb(main):002:0> "€".to_json
=> "\"\\u20ac\""

I guess, I could write a one-off clean-up utility for the data that I
have now and then use the JSON library just to encode/decode the
strings, but that seems like overkill.

My goals here are interoperability, reuse, ease of adapting to my
existing code (in that order). Until I ran across the site, I hadn't
thought about the JSON approach, but it might make the most sense for
interoperable data. Mind you, I only care about safe string
serialization/deserialization, and I've no use in the application for
the rest of the JSON spec.

Generally speaking you cannot be safe with eval and eval-type methods used. So
You have to either (1) write your own deserializer without evals, or (2) use
existing one like JSON. I guess using existing libraries is not a bad idea for
interpoerabilities. So JSON might not be that overkill. Quoted-printable is
defined in RFC so might also be a good alternative.
Changing the question a little: does anyone know of the best way to
serialize and parse strings containing Unicode and other non-printing
characters? Ideally, I'd like to have something that works like
String#dump except that it used escaped Unicode code point references,
e.g. \uxxxx and \Uxxxxxxxx, and handles all of the "usual suspects" like
\", \\, etc.

If you want \uxxxx-style escape, JSON library is a best bet I think. Another
choice is to use YAML stdlib, but it generates backslashed escapes so you need
to convert them anyway.
Doing some more googling, I also came across this, but I'm not sure what
the status of it is, and I'm not sure that it addresses my issue either.
It seems to be more about processing Unicode rather than serialization
of Unicode to ASCII. (http://snippets.dzone.com/posts/show/4527).

[much time passes...including lunch]

After arsing around for a long time with various stupid stuff, I finally
came up with this. I don't really like it, but it seems to do the job.
Comments welcome:

irb(main):026:0> euro = "€"
=> "€"
irb(main):027:0> x = euro.dump
=> "\"\\342\\202\\254\""
irb(main):028:0> x.gsub(/\\(\d\d\d)/) { [ $1.oct ].pack("c") }[1..-2]
=> "€"

However, this doesn't get me in/out of the "standard" Unicode escapes.

Thanks in advance for any ideas or suggestions.

Cheers,

ast
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,962
Messages
2,570,134
Members
46,690
Latest member
MacGyver

Latest Threads

Top