How do I set the encoding on a regexp ?

P

Perry Smith

Title pretty much says it all. Here is a small sample program:

#!/usr/bin/env ruby
# -*- coding: utf-8 -*-

s = "string"
puts s.encoding
r = Regexp.new(s)
puts r.encoding

Here is the output:

UTF-8
US-ASCII

I was expecting both to be set to UTF-8. There is no force_encoding
method for RegExp.

If I later try to use it on strings of type UTF-8, it can throw an
exception.

How is this suppose to be handled?

Thanks,
Perry
 
R

Roger Pack

I was expecting both to be set to UTF-8. There is no force_encoding
method for RegExp.

If I later try to use it on strings of type UTF-8, it can throw an
exception.

Do you have an example of this? It might be a bug.

I did notice that

Regexp.new("Café").encoding

keeps it in UTF-8

so maybe it's optimizing it and when it doesn't "have to be" UTF-8 it is
leaving it as ASCII?

-r
 
P

Perry Smith

Roger said:
Do you have an example of this? It might be a bug.

I did notice that

Regexp.new("Café").encoding

keeps it in UTF-8

so maybe it's optimizing it and when it doesn't "have to be" UTF-8 it is
leaving it as ASCII?

I'm not clear what you mean by an example other than what I put in the
original note.

I think I'm going to open a bug report -- it might not be a bug but I
sure am confused. The "Pick Axe" book describes a third argument but I
can't get that to work either. "ri" for Ruby 1.9.1 does not describe
the third argument at all -- but it does seem to exist at least.

It appears as if, as you pointed out, if the input string happens to be
ASCII, then the regexp encoding is ascii and there doesn't seem to be
anything you can do about it.

I'm testing on 1.9.1 p243.

But, due to another discussion thread, I think I want to be in 8 bit
binary anyway in my case. I'm not 100% positive my input is UTF-8. Its
suppose to be but I can't really trust it.

Thanks
Perry
 
R

Roger Pack

If I later try to use it on strings of type UTF-8, it can throw an
I'm not clear what you mean by an example other than what I put in the
original note.


Do you have a small example (like your original) that throws an
exception where you "use it on strings later of type UTF-8" and it
throws an exception?

-r
 
P

Perry Smith

Roger said:
Do you have a small example (like your original) that throws an
exception where you "use it on strings later of type UTF-8" and it
throws an exception?

No I don't. I *think* that I might have had a string that was not
utf-8. I was fetching strings from a file and just doing a
force_encoding because they were suppose to be utf-8 but maybe they were
not.

I'm not sure. Let me see if I can make an example. My trivial examples
so far don't throw an exception.
 
B

Brian Candler

Perry said:
I think I'm going to open a bug report -- it might not be a bug but I
sure am confused.

It's not a bug(*), and it sure is confusing. My own attempt to document
Ruby 1.9's encoding rules, which is woefully incomplete but covers about
200 different cases, is at
http://github.com/candlerb/string19/blob/master/string19.rb

What you've observed is described in section 3.3.

Basically, a Regexp which contains only ASCII characters is given an
encoding of US-ASCII regardless of the original string's encoding (this
is different to Strings, which might have an encoding of say UTF-8 but
have the ascii_only? property true if they contain only ASCII
characters).

However there is a hidden "fixed_encoding" property you can set on a
Regexp:
=> true

I say it's a "hidden" property because the flag isn't revealed if you
use inspect or to_s (unlike the //m, //i and //x properties)
=> "(?-mix:string)"

HTH,

Brian.

(*) Except in as much as the entire Encoding nonsense in ruby 1.9 is one
enormous bug
 
D

David Springer

Perry,

In 1.9 there is only one optional parameter.

You can force the encoding of the string parameter (if needed)
AND also pass the options parameter.

Try this:

#!/usr/bin/env ruby

s = "string"
puts s.encoding
r = Regexp.new(s.encode("utf-8"), Regexp::ENC_UTF8)
puts r.encoding

Here is the output:

US-ASCII
UTF-8

-David
 
P

Perry Smith

Hi Brian and David,

Thanks. I'm doing more experimenting and I'm also looking at the source
code. I need to drag down the latest. I'm looking at 1.9.1 p243 right
now.

Regexp.new has a third optional argument -- it is sorta described in the
Pick Axe book but the code looks wrong. It can be either 'n' or 'xN'
where x can be anything. Perhaps that is gone in the latest code.

But the "fixed encoding" is a key part of the puzzle I was missing.
Also, David, I had not bumped into the ENC_UTF8 constant yet. There are
quite a few constants (like the 16 pointed out by David also) is a flag
to make the encoding "fixed".

The latest code that David posted answers exactly what my original
question was. Thanks!
 
B

Brian Candler

Perry said:
But the "fixed encoding" is a key part of the puzzle I was missing.
Also, David, I had not bumped into the ENC_UTF8 constant yet. There are
quite a few constants (like the 16 pointed out by David also) is a flag
to make the encoding "fixed".

16 is just Regexp::FIXEDENCODING

irb(main):001:0> Regexp::FIXEDENCODING
=> 16

In the 1.9.2 I have here (r24186, 2009-07-18) there is no
Regexp::ENC_UTF8, so it must be relatively new.

irb(main):002:0> Regexp::ENC_UTF8
NameError: uninitialized constant Regexp::ENC_UTF8
from (irb):2
from /usr/local/bin/irb192:12:in `<main>'
irb(main):003:0> Regexp.constants
=> [:IGNORECASE, :EXTENDED, :MULTILINE, :FIXEDENCODING]

As for the third arg to Regexp.new, I have no idea. Documentation is not
Ruby's strong point at the best of times, but it's nonexistent for the
encoding stuff.
 
D

David Springer

[Note: parts of this message were removed to make it a legal post.]

My bad.

I was running 1.9.1, which had no FIXEDENCODING.

Regexp.constants
=> [:IGNORECASE, :EXTENDED, :MULTILINE, :ONCE, :ENC_NONE, :ENC_EUC, :ENC
_SJIS, :ENC_UTF8]

So things have changed since 1.9.1

If you are running 1.9.2 then use FIXEDENCODING and you should be fine.

I THINK that you are saying with FIXEDENCODING is NOT to revert back to
something like ASCII.

BTW in 1.9.1
=> 16

Perry said:
But the "fixed encoding" is a key part of the puzzle I was missing.
Also, David, I had not bumped into the ENC_UTF8 constant yet. There are
quite a few constants (like the 16 pointed out by David also) is a flag
to make the encoding "fixed".

16 is just Regexp::FIXEDENCODING

irb(main):001:0> Regexp::FIXEDENCODING
=> 16

In the 1.9.2 I have here (r24186, 2009-07-18) there is no
Regexp::ENC_UTF8, so it must be relatively new.

irb(main):002:0> Regexp::ENC_UTF8
NameError: uninitialized constant Regexp::ENC_UTF8
from (irb):2
from /usr/local/bin/irb192:12:in `<main>'
irb(main):003:0> Regexp.constants
=> [:IGNORECASE, :EXTENDED, :MULTILINE, :FIXEDENCODING]

As for the third arg to Regexp.new, I have no idea. Documentation is not
Ruby's strong point at the best of times, but it's nonexistent for the
encoding stuff.
 
B

Bob Hutchison

My bad.

I was running 1.9.1, which had no FIXEDENCODING.

Regexp.constants
=>
[:IGNORECASE, :EXTENDED, :MULTILINE, :ONCE, :ENC_NONE, :ENC_EUC, :ENC
_SJIS, :ENC_UTF8]

So things have changed since 1.9.1

If you are running 1.9.2 then use FIXEDENCODING and you should be
fine.

I THINK that you are saying with FIXEDENCODING is NOT to revert back
to
something like ASCII.

This has been really helpful, but I'm still having difficulties. I'm
running 1.9.1p376 and:

Regexp.constants
=> [:IGNORECASE, :EXTENDED, :MULTILINE]

But if I use 16 rather than FIXEDENCODING it works as in the examples
in this thread.

Does anyone know what's going on here? I used to have a pretty good
handle on encodings. This Ruby encoding stuff is something I've been
struggling with for 6 months and I think all that I've managed to do
is completely corrupt my understanding of encoding. It's starting to
look like magic. I know that a bunch of things changed between
1.9.1p243 and 1.9.1p376, but, since I think that what I 'know' about
encoding might be completely delusional at this point, I suppose I
don't really know.

Brian your http://github.com/candlerb/string19/blob/master/string19.rb
is something else! I'm laughing with a slightly hysterical edge.

Cheers,
Bob
BTW in 1.9.1
=> 16

Perry said:
But the "fixed encoding" is a key part of the puzzle I was missing.
Also, David, I had not bumped into the ENC_UTF8 constant yet.
There are
quite a few constants (like the 16 pointed out by David also) is a
flag
to make the encoding "fixed".

16 is just Regexp::FIXEDENCODING

irb(main):001:0> Regexp::FIXEDENCODING
=> 16

In the 1.9.2 I have here (r24186, 2009-07-18) there is no
Regexp::ENC_UTF8, so it must be relatively new.

irb(main):002:0> Regexp::ENC_UTF8
NameError: uninitialized constant Regexp::ENC_UTF8
from (irb):2
from /usr/local/bin/irb192:12:in `<main>'
irb(main):003:0> Regexp.constants
=> [:IGNORECASE, :EXTENDED, :MULTILINE, :FIXEDENCODING]

As for the third arg to Regexp.new, I have no idea. Documentation
is not
Ruby's strong point at the best of times, but it's nonexistent for
the
encoding stuff.
 
P

Perry Smith

Bob said:
On 24-Feb-10, at 6:22 PM, David Springer wrote:

Regexp.constants
=> [:IGNORECASE, :EXTENDED, :MULTILINE]

I might help you to know that your constants are the same as mine. I
don't know how David got his.

Unfortunately, I still have not gotten back to my investigation of this.
Looking at the code in re.c helped me a bit.

Aside from that, I think we are all struggling with this. I'm hoping
that there are a few "bugs" in the code... i.e. Mat has a clear idea of
how things should work but there are just a few mistakes that really
hamper our understanding.

HTH,
Perry
 
C

Charles Oliver Nutter

One has to laugh or cry. As best I could, I factored out my opinion of
all this into a separate file:
http://github.com/candlerb/string19/raw/47b0cba0a2047eca0612b4e24a540f011cf2cac3/soapbox.rb

This is exactly the situation I worried about when Matz proposed the
"all encodings" view of Ruby 1.9. Even though many applications won't
run into this, any that try to deal with >1 encoding at a time will
have a clusterfuck of a time making sure everything fits together. And
this is to say nothing of the implementation effort required, which
still isn't all there in JRuby (and won't be until 1.6 or later).

I didn't read this whole thread, since there's a lot of "it's a
bug/it's not a bug" exploration, but if there's something we need to
fix in JRuby, please do report it (and try to help fix it, too :)).

- Charlie
 
B

Brian Candler

Roger said:
re: string1 + string2 + string3 actually working without fear...

One thing that might help would be to set the default encoding, then all
three strings would (might ?) have the same encoding (?)

That depends where the strings came from. If they were returned by a
library function (either Ruby core or 3rd party) you won't know what
encoding they have unless it is documented what the encoding is or how
it is chosen, and it almost never is.

Equally, if you are writing a library for use by other people, then you
really should not touch global state such as Encoding.default_external.
So you are left with Ruby guessing encodings and forcing them if it
guesses wrongly, e.g.

$ ruby19 -e 'puts %x{cat /bin/sh}.encoding'
UTF-8

Of course, if you're saying that your application handles all strings in
the same encoding, then this whole business of tagging every
*individual* string object with its own encoding is a waste of time and
effort, and is just something which you have to fight against.

But we're flogging a dead horse here. I hate this stuff; other people
seem to love it.
 
C

Caleb Clausen

But we're flogging a dead horse here. I hate this stuff; other people
seem to love it.

Having wrestled with these issues a little bit myself, I think your
criticisms are cogent. Unlike you, tho, I'd rather not drop the whole
string encoding feature in 1.9. (Any solution to the rather ugly
problem of string encodings is going to have some problems. Ruby's got
a different (and more complicated) approach to it than other
languages... but if the remaining wrinkles can be smoothed out, it
will be a better solution overall.)

I wish someone would take the inconsistencies you've found and
criticisms you've made to heart and find some kind of way to address
them.

One thing that might help is a variant of the Rope class Intransition
was wishing for just recently. There's no reason that the individual
String segments of a Rope couldn't each have different encodings....
this would help with the catenation of Strings with different
encodings, for instance. It gets complicated, tho. How do you do a
Regexp match against a multi-encoded Rope? (It's hard and/or tricky,
but I think can be done.) I've suggested this on ruby-core before, but
no-one wants this in the interpreter itself... probably appropriately.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,981
Messages
2,570,187
Members
46,729
Latest member
ScarlettJe

Latest Threads

Top