purging non ascii chars

Rajarshi Chakravarty · Dec 4, 2010

Hi,
I read records from a text file and insert them in the DB.
Sometimes the data contains non ascii characters and I want to keep
these out of the DB.
How can I cleanse them and where?
I mean should it be done while reading data or has ActiveRecord got any
feature to do it?

Ammar Ali · Dec 4, 2010

Hi,
I read records from a text file and insert them in the DB.
Sometimes the data contains non ascii characters and I want to keep
these out of the DB.
How can I cleanse them and where?
I mean should it be done while reading data or has ActiveRecord got any
feature to do it?

What do you exactly mean by "non ascii"? Do you mean extended ascii
(aka high ascii), printable ascii, or unicode?

Without knowing details, I would suggest a regular expression like:

text.gsub /[^[:ascii:]]/, ''

Or if you're using a ruby older than 1.9 or want cross-version compatibility:

text.gsub /[^\x00-\x7F]/, ''

Note that the class [:ascii:] and the range in the second regular
expression include all valid ascii characters, which include the
control characters and \r (0x0D), \n (x0A), etc. If you only want the
alphabet, newlines, and punctuation, then you need to exclude the
control characters and try something like:

text.gsub /[^\x20-\x7F\x0D\x0A]/, ''

HTH,
Ammar

Caleb Clausen · Dec 5, 2010

Hi,
I read records from a text file and insert them in the DB.
Sometimes the data contains non ascii characters and I want to keep
these out of the DB.
How can I cleanse them and where?
I mean should it be done while reading data or has ActiveRecord got any
feature to do it?

Click to expand...

What do you exactly mean by "non ascii"? Do you mean extended ascii
(aka high ascii), printable ascii, or unicode?

Without knowing details, I would suggest a regular expression like:

text.gsub /[^[:ascii:]]/, ''

Or if you're using a ruby older than 1.9 or want cross-version
compatibility:

text.gsub /[^\x00-\x7F]/, ''

Note that the class [:ascii:] and the range in the second regular
expression include all valid ascii characters, which include the
control characters and \r (0x0D), \n (x0A), etc. If you only want the
alphabet, newlines, and punctuation, then you need to exclude the
control characters and try something like:

text.gsub /[^\x20-\x7F\x0D\x0A]/, ''

Hmm, actually it should be gsub! rather than gsub here.

Ammar's answer is a good first approximation and may be all you need,
however, it is not universally correct. It's better to find out what
the input's encoding is, and then:
(in 1.8) trancode to utf8 or something before stripping out the
non-ascii chars
(in 1.9) set the encoding of the input correctly to make Ammar's
first example work for you

This line:
text.gsub! /[^\x00-\x7F]/, ''
will be just fine if the input is known to be utf8 or some other
well-behaved encoding. (The euc family of encodings, for example, are
also well-behaved.) But it will fail and leave some garbage in your
strings if the encoding is sjis or big5. (Well-behaved in this case
means that the encoding is a superset of ascii and encodes non-ascii
characters entirely with bytes which are not allowed in ascii-7 text.
That is, bytes >=0x80.)

Rajarshi Chakravarty · Dec 6, 2010

Thank you Ammar for the code and Caleb for the encoding explanation.
The special char that I wanted to remove was
text.gsub /[^\x20-\x7F\x0D\x0A]/, '' did the job

Ammar Ali · Dec 6, 2010

Hi,
I read records from a text file and insert them in the DB.
Sometimes the data contains non ascii characters and I want to keep
these out of the DB.
How can I cleanse them and where?
I mean should it be done while reading data or has ActiveRecord got any
feature to do it?

Click to expand...

What do you exactly mean by "non ascii"? Do you mean extended ascii
(aka high ascii), printable ascii, or unicode?

Without knowing details, I would suggest a regular expression like:

=C2=A0 text.gsub /[^[:ascii:]]/, ''

Or if you're using a ruby older than 1.9 or want cross-version
compatibility:

=C2=A0 text.gsub /[^\x00-\x7F]/, ''

Note that the class [:ascii:] and the range in the second regular
expression include all valid ascii characters, which include the
control characters and \r (0x0D), \n (x0A), etc. If you only want the
alphabet, newlines, and punctuation, then you need to exclude the
control characters and try something like:

=C2=A0 text.gsub /[^\x20-\x7F\x0D\x0A]/, ''

Click to expand...

Hmm, actually it should be gsub! rather than gsub here.

Ammar's answer is a good first approximation and may be all you need,
however, it is not universally correct. It's better to find out what
the input's encoding is, and then:
=C2=A0(in 1.8) trancode to utf8 or something before stripping out the
non-ascii chars
=C2=A0(in 1.9) set the encoding of the input correctly to make Ammar's
first example work for you

This line:
=C2=A0text.gsub! /[^\x00-\x7F]/, ''
will be just fine if the input is known to be utf8 or some other
well-behaved encoding. (The euc family of encodings, for example, are
also well-behaved.) But it will fail and leave some garbage in your
strings if the encoding is sjis or big5. (Well-behaved in this case
means that the encoding is a superset of ascii and encodes non-ascii
characters entirely with bytes which are not allowed in ascii-7 text.
That is, bytes >=3D0x80.)

Thanks for the corrections Caleb. I missed those possible side effects.

Cheers,
Ammar

Marshal erro with non ascii chars	3	Oct 27, 2009
Escaping non-ASCII chars for RTF export	5	Nov 1, 2007
trying to parse lines of files with non-ASCII chars	3	Dec 23, 2006
marshal error with nom ascii chars	0	Oct 27, 2009
hex dump w/ or w/out utf-8 chars	40	Jul 8, 2013
Detect non-ascii substrings in a file	1	Jun 19, 2008
Regex with ASCII and non-ASCII chars	5	Jan 31, 2007
How to clean an xml files from non-utf-8 chars?	18	Sep 17, 2008

purging non ascii chars

Rajarshi Chakravarty

Ammar Ali

Caleb Clausen

Rajarshi Chakravarty

Ammar Ali

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads