String.strip with UTF-8

E

Erik E.

Hi

I can't strip the leading whitespace (or what at least looks like
whitespace) from a Ruby 1.9.2 string


ruby-1.9.2-p0 :002 > d.entity
=3D> "=C2=A0United Arab Emirates"
ruby-1.9.2-p0 :003 > d.entity.strip
=3D> "=C2=A0United Arab Emirates"
ruby-1.9.2-p0 :004 > d.entity.class
=3D> String
ruby-1.9.2-p0 :005 > d.entity.encoding
=3D> #<Encoding:UTF-8>
ruby-1.9.2-p0 :006 >

It's inside the Rails 3.0.3 console..

Erik

-- =

Posted via http://www.ruby-forum.com/.=
 
D

David Masover

Hi
=20
I can't strip the leading whitespace (or what at least looks like
whitespace) from a Ruby 1.9.2 string
=20
=20
ruby-1.9.2-p0 :002 > d.entity
=3D> " United Arab Emirates"
ruby-1.9.2-p0 :003 > d.entity.strip
=3D> " United Arab Emirates"
ruby-1.9.2-p0 :004 > d.entity.class
=3D> String
ruby-1.9.2-p0 :005 > d.entity.encoding
=3D> #<Encoding:UTF-8>
ruby-1.9.2-p0 :006 >
=20
It's inside the Rails 3.0.3 console..

Try this:

d.entity[0].ord

I'm not sure how useful that will be, but you can compare it to that of a=20
space. It _seems_ to be unicode-aware:

ruby-1.9.2-p136 :020 > '=E2=98=83'.ord
=3D> 9731=20
ruby-1.9.2-p136 :021 > _.to_s 16
=3D> "2603"=20
ruby-1.9.2-p136 :022 > "\u2603"
=3D> "=E2=98=83"=20

And for good measure:

ruby-1.9.2-p136 :023 > _.ord
=3D> 9731=20

(If you're wondering, that underscore means "The result of the last command=
I=20
entered into IRB." It's fantastically useful, though it gets annoying when =
you=20
want to repeat commands using up arrow, etc.)

So, if you get something other than:

ruby-1.9.2-p136 :024 > ' '.ord
=3D> 32=20

=2E..then it's not a space. At that point, maybe report a bug, but maybe yo=
u'll=20
also be able to work around it with a regex or something.
 
P

Peter Vandenabeele

Erik E. wrote in post #974416:
Hi

I can't strip the leading whitespace (or what at least looks like
whitespace) from a Ruby 1.9.2 string


ruby-1.9.2-p0 :002 > d.entity
=3D> "=C2=A0United Arab Emirates"
ruby-1.9.2-p0 :003 > d.entity.strip
=3D> "=C2=A0United Arab Emirates"
ruby-1.9.2-p0 :004 > d.entity.class
=3D> String
ruby-1.9.2-p0 :005 > d.entity.encoding
=3D> #<Encoding:UTF-8>
ruby-1.9.2-p0 :006 >

It's inside the Rails 3.0.3 console..

Erik

Hi, I made a fresh install with rvm 1.9.2-p0 and rails 3.0.3
and I cannot reproduce your problem. Maybe you could try to
replay what I did and see if you can still reproduce it ?

Also, to examine that first character in detail, what is the
result when you try this:

009:0> d.entity.bytes.to_a[0..5]
=3D> [32, 85, 110, 105, 116, 101]

I see a "regular" space (character 32 in decimal notation)
as first character.

HTH,

Peter


peterv@ASUS:~/ra/apps/trials$ rvm install 1.9.2-p0
/home/peterv/.rvm/rubies/ruby-1.9.2-p0, this may take a while depending
on your cpu(s)...

ruby-1.9.2-p0 - #fetching
...
Install of ruby-1.9.2-p0 - #complete

peterv@ASUS:~/ra/apps/trials$ rvm use 1.9.2-p0
Using /home/peterv/.rvm/gems/ruby-1.9.2-p0

peterv@ASUS:~/ra/apps/trials$ rvm gemset create rails3
'rails3' gemset created (/home/peterv/.rvm/gems/ruby-1.9.2-p0@rails3).

peterv@ASUS:~/ra/apps/trials$ rvm gemset use rails3
Now using gemset 'rails3'

peterv@ASUS:~/ra/apps/trials$ gem install rails --no-rdoc --no-ri
Successfully installed activesupport-3.0.3
Successfully installed builder-2.1.2
Successfully installed i18n-0.5.0
Successfully installed activemodel-3.0.3
Successfully installed rack-1.2.1
Successfully installed rack-test-0.5.7
Successfully installed rack-mount-0.6.13
Successfully installed tzinfo-0.3.23
Successfully installed abstract-1.0.0
Successfully installed erubis-2.6.6
Successfully installed actionpack-3.0.3
Successfully installed arel-2.0.6
Successfully installed activerecord-3.0.3
Successfully installed activeresource-3.0.3
Successfully installed mime-types-1.16
Successfully installed polyglot-0.3.1
Successfully installed treetop-1.4.9
Successfully installed mail-2.2.14
Successfully installed actionmailer-3.0.3
Successfully installed thor-0.14.6
Successfully installed railties-3.0.3
Successfully installed bundler-1.0.7
Successfully installed rails-3.0.3
23 gems installed

peterv@ASUS:~/ra/apps/trials$ rails new issue_with_strip
create
...
create vendor/plugins/.gitkeep
peterv@ASUS:~/ra/apps/trials$ cd issue_with_strip/
peterv@ASUS:~/ra/apps/trials/issue_with_strip$ bundle install
Fetching source index for http://rubygems.org/
Using rake (0.8.7)
...
Using rails (3.0.3)
Installing sqlite3-ruby (1.3.2) with native extensions
Your bundle is complete! Use `bundle show [gemname]` to see where a
bundled gem is installed.

peterv@ASUS:~/ra/apps/trials/issue_with_strip$ rails g model D
entity:string
invoke active_record
create db/migrate/20110112222955_create_ds.rb
create app/models/d.rb
invoke test_unit
create test/unit/d_test.rb
create test/fixtures/ds.yml

peterv@ASUS:~/ra/apps/trials/issue_with_strip$ rake db:migrate
(in /home/peterv/data/back/rails-apps/apps/trials/issue_with_strip)
=3D=3D CreateDs: migrating
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D
-- create_table:)ds)
-> 0.0010s
=3D=3D CreateDs: migrated (0.0011s)
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

US:~/ra/apps/trials/issue_with_strip$ rails c
Loading development environment (Rails 3.0.3)
001:0> IRB.prompt_mode=3D:RVM # this is a local patch
=3D> :RVM
ruby-1.9.2-p0 :002 > d =3D D.create :entity =3D> " United Arab Emirates"
=3D> #<D id: 1, entity: " United Arab Emirates", created_at: "2011-01-12=

22:31:21", updated_at: "2011-01-12 22:31:21">
ruby-1.9.2-p0 :003 > d.entity
=3D> " United Arab Emirates"
ruby-1.9.2-p0 :004 > d.entity.strip
=3D> "United Arab Emirates"
ruby-1.9.2-p0 :005 > d.entity.class
=3D> String
ruby-1.9.2-p0 :006 > d.entity.encoding
=3D> #<Encoding:UTF-8>
ruby-1.9.2-p0 :007 > exit

peterv@ASUS:~/ra/apps/trials/issue_with_strip$ rails c
Loading development environment (Rails 3.0.3)
001:0> d =3D D.find :last
=3D> #<D id: 1, entity: " United Arab Emirates", created_at: "2011-01-12
22:31:21", updated_at: "2011-01-12 22:31:21">
002:0> d.entity
=3D> " United Arab Emirates"
003:0> d.entity.strip
=3D> "United Arab Emirates"

-- =

Posted via http://www.ruby-forum.com/.=
 
E

Erik E.

Thank you for quick reply David & Peter, I was upgrading Ruby to see if
it made a difference, but I can see it's not a space now which explains
why it didn't strip

Loading development environment (Rails 3.0.3)
ruby-1.9.2-p136 :001 > d =3D Domain.last
=3D> #<Domain id: 2055, classification: "Internationalized Country Code =

Top Level Domain", dns_name: "xn--mgbaam7a8h", idn_name: "=D8=A7=D9=85=D8=
=A7=D8=B1=D8=A7=D8=AA.", =

entity: "=C2=A0United Arab Emirates", explanation: "im=C4=81r=C4=81t", no=
tes: nil, =

related_id: 1795, idn: true, dnssec: false, created_at: "2011-01-12 =

19:04:54", updated_at: "2011-01-12 19:04:54">
ruby-1.9.2-p136 :002 > d.entity
=3D> "=C2=A0United Arab Emirates"
ruby-1.9.2-p136 :003 > d.entity.class
=3D> String
ruby-1.9.2-p136 :004 > d.entity.encoding
=3D> #<Encoding:UTF-8>
ruby-1.9.2-p136 :005 > d.entity[0].ord
=3D> 160
ruby-1.9.2-p136 :006 > d.entity.bytes.to_a
=3D> [194, 160, 85, 110, 105, 116, 101, 100, 32, 65, 114, 97, 98, 32, 69=
,
109, 105, 114, 97, 116, 101, 115]


Peter Vandenabeele wrote in post #974440:
Hi, I made a fresh install with rvm 1.9.2-p0 and rails 3.0.3
and I cannot reproduce your problem. Maybe you could try to
replay what I did and see if you can still reproduce it ?

Also, to examine that first character in detail, what is the
result when you try this:

009:0> d.entity.bytes.to_a[0..5]
=3D> [32, 85, 110, 105, 116, 101]

I see a "regular" space (character 32 in decimal notation)
as first character.

HTH,

Peter

Loading development environment (Rails 3.0.3)
001:0> IRB.prompt_mode=3D:RVM # this is a local patch
=3D> :RVM
ruby-1.9.2-p0 :002 > d =3D D.create :entity =3D> " United Arab Emirates= "
=3D> #<D id: 1, entity: " United Arab Emirates", created_at: "2011-01-= 12
22:31:21", updated_at: "2011-01-12 22:31:21">
ruby-1.9.2-p0 :003 > d.entity
=3D> " United Arab Emirates"
ruby-1.9.2-p0 :004 > d.entity.strip
=3D> "United Arab Emirates"
ruby-1.9.2-p0 :005 > d.entity.class
=3D> String
ruby-1.9.2-p0 :006 > d.entity.encoding
=3D> #<Encoding:UTF-8>
ruby-1.9.2-p0 :007 > exit

peterv@ASUS:~/ra/apps/trials/issue_with_strip$ rails c
Loading development environment (Rails 3.0.3)
001:0> d =3D D.find :last
=3D> #<D id: 1, entity: " United Arab Emirates", created_at: "2011-01-1= 2
22:31:21", updated_at: "2011-01-12 22:31:21">
002:0> d.entity
=3D> " United Arab Emirates"
003:0> d.entity.strip
=3D> "United Arab Emirates"

-- =

Posted via http://www.ruby-forum.com/.=
 
J

Jonas Pfenniger (zimbatm)

2011/1/12 Erik E. said:
Thank you for quick reply David & Peter, I was upgrading Ruby to see if
it made a difference, but I can see it's not a space now which explains
why it didn't strip

Yeah, it's the dreaded non-breaking space [1]. Unfortunately, somebody
thought it would be nice to map Alt+Space to this character on some
keymaps (like mine, which is Swiss-French). If you're on a mac, see my
solution here :
http://0x2a.im/2009/04/16/terminal-unicode-problem-2.html




[1]: https://secure.wikimedia.org/wikipedia/en/wiki/Non-breaking_space
 
E

Erik E.

Cool, thanks for that! I can just gsub/gsub! it out now that I know what
it is.

zimbatm ... wrote in post #974462:
2011/1/12 Erik E. said:
Thank you for quick reply David & Peter, I was upgrading Ruby to see if
it made a difference, but I can see it's not a space now which explains
why it didn't strip

Yeah, it's the dreaded non-breaking space [1]. Unfortunately, somebody
thought it would be nice to map Alt+Space to this character on some
keymaps (like mine, which is Swiss-French). If you're on a mac, see my
solution here :
http://0x2a.im/2009/04/16/terminal-unicode-problem-2.html




[1]: https://secure.wikimedia.org/wikipedia/en/wiki/Non-breaking_space
 
E

Eric Hodel

Cool, thanks for that! I can just gsub/gsub! it out now that I know = what=20
it is.

That will work if NO-BREAK SPACE is the only space you'll encounter.

s.gsub(/\A[[:space:]]*(.*?)[[:space:]]*\z/) { $1 }

will remove:
Space_Separator | Line_Separator | Paragraph_Separator | 0009 | 000A | =
000B | 000C | 000D | 0085

See section 6 of http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt

PS: Note that s.gsub(/=85(=85)=85/, '\1') may alter the encoding of the =
result string.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,961
Messages
2,570,131
Members
46,689
Latest member
liammiller

Latest Threads

Top