purging non ascii chars

  • Thread starter Rajarshi Chakravarty
  • Start date
R

Rajarshi Chakravarty

Hi,
I read records from a text file and insert them in the DB.
Sometimes the data contains non ascii characters and I want to keep
these out of the DB.
How can I cleanse them and where?
I mean should it be done while reading data or has ActiveRecord got any
feature to do it?
 
A

Ammar Ali

Hi,
I read records from a text file and insert them in the DB.
Sometimes the data contains non ascii characters and I want to keep
these out of the DB.
How can I cleanse them and where?
I mean should it be done while reading data or has ActiveRecord got any
feature to do it?

What do you exactly mean by "non ascii"? Do you mean extended ascii
(aka high ascii), printable ascii, or unicode?

Without knowing details, I would suggest a regular expression like:

text.gsub /[^[:ascii:]]/, ''

Or if you're using a ruby older than 1.9 or want cross-version compatibility:

text.gsub /[^\x00-\x7F]/, ''

Note that the class [:ascii:] and the range in the second regular
expression include all valid ascii characters, which include the
control characters and \r (0x0D), \n (x0A), etc. If you only want the
alphabet, newlines, and punctuation, then you need to exclude the
control characters and try something like:

text.gsub /[^\x20-\x7F\x0D\x0A]/, ''

HTH,
Ammar
 
C

Caleb Clausen

Hi,
I read records from a text file and insert them in the DB.
Sometimes the data contains non ascii characters and I want to keep
these out of the DB.
How can I cleanse them and where?
I mean should it be done while reading data or has ActiveRecord got any
feature to do it?

What do you exactly mean by "non ascii"? Do you mean extended ascii
(aka high ascii), printable ascii, or unicode?

Without knowing details, I would suggest a regular expression like:

text.gsub /[^[:ascii:]]/, ''

Or if you're using a ruby older than 1.9 or want cross-version
compatibility:

text.gsub /[^\x00-\x7F]/, ''

Note that the class [:ascii:] and the range in the second regular
expression include all valid ascii characters, which include the
control characters and \r (0x0D), \n (x0A), etc. If you only want the
alphabet, newlines, and punctuation, then you need to exclude the
control characters and try something like:

text.gsub /[^\x20-\x7F\x0D\x0A]/, ''

Hmm, actually it should be gsub! rather than gsub here.

Ammar's answer is a good first approximation and may be all you need,
however, it is not universally correct. It's better to find out what
the input's encoding is, and then:
(in 1.8) trancode to utf8 or something before stripping out the
non-ascii chars
(in 1.9) set the encoding of the input correctly to make Ammar's
first example work for you

This line:
text.gsub! /[^\x00-\x7F]/, ''
will be just fine if the input is known to be utf8 or some other
well-behaved encoding. (The euc family of encodings, for example, are
also well-behaved.) But it will fail and leave some garbage in your
strings if the encoding is sjis or big5. (Well-behaved in this case
means that the encoding is a superset of ascii and encodes non-ascii
characters entirely with bytes which are not allowed in ascii-7 text.
That is, bytes >=0x80.)
 
R

Rajarshi Chakravarty

Thank you Ammar for the code and Caleb for the encoding explanation.
The special char that I wanted to remove was
text.gsub /[^\x20-\x7F\x0D\x0A]/, '' did the job :)
 
A

Ammar Ali

Hi,
I read records from a text file and insert them in the DB.
Sometimes the data contains non ascii characters and I want to keep
these out of the DB.
How can I cleanse them and where?
I mean should it be done while reading data or has ActiveRecord got any
feature to do it?

What do you exactly mean by "non ascii"? Do you mean extended ascii
(aka high ascii), printable ascii, or unicode?

Without knowing details, I would suggest a regular expression like:

=C2=A0 text.gsub /[^[:ascii:]]/, ''

Or if you're using a ruby older than 1.9 or want cross-version
compatibility:

=C2=A0 text.gsub /[^\x00-\x7F]/, ''

Note that the class [:ascii:] and the range in the second regular
expression include all valid ascii characters, which include the
control characters and \r (0x0D), \n (x0A), etc. If you only want the
alphabet, newlines, and punctuation, then you need to exclude the
control characters and try something like:

=C2=A0 text.gsub /[^\x20-\x7F\x0D\x0A]/, ''

Hmm, actually it should be gsub! rather than gsub here.

Ammar's answer is a good first approximation and may be all you need,
however, it is not universally correct. It's better to find out what
the input's encoding is, and then:
=C2=A0(in 1.8) trancode to utf8 or something before stripping out the
non-ascii chars
=C2=A0(in 1.9) set the encoding of the input correctly to make Ammar's
first example work for you

This line:
=C2=A0text.gsub! /[^\x00-\x7F]/, ''
will be just fine if the input is known to be utf8 or some other
well-behaved encoding. (The euc family of encodings, for example, are
also well-behaved.) But it will fail and leave some garbage in your
strings if the encoding is sjis or big5. (Well-behaved in this case
means that the encoding is a superset of ascii and encodes non-ascii
characters entirely with bytes which are not allowed in ascii-7 text.
That is, bytes >=3D0x80.)

Thanks for the corrections Caleb. I missed those possible side effects.

Cheers,
Ammar
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,008
Messages
2,570,271
Members
46,874
Latest member
CyberGateway

Latest Threads

Top