Hi,
I read records from a text file and insert them in the DB.
Sometimes the data contains non ascii characters and I want to keep
these out of the DB.
How can I cleanse them and where?
I mean should it be done while reading data or has ActiveRecord got any
feature to do it?
What do you exactly mean by "non ascii"? Do you mean extended ascii
(aka high ascii), printable ascii, or unicode?
Without knowing details, I would suggest a regular expression like:
=C2=A0 text.gsub /[^[:ascii:]]/, ''
Or if you're using a ruby older than 1.9 or want cross-version
compatibility:
=C2=A0 text.gsub /[^\x00-\x7F]/, ''
Note that the class [:ascii:] and the range in the second regular
expression include all valid ascii characters, which include the
control characters and \r (0x0D), \n (x0A), etc. If you only want the
alphabet, newlines, and punctuation, then you need to exclude the
control characters and try something like:
=C2=A0 text.gsub /[^\x20-\x7F\x0D\x0A]/, ''
Hmm, actually it should be gsub! rather than gsub here.
Ammar's answer is a good first approximation and may be all you need,
however, it is not universally correct. It's better to find out what
the input's encoding is, and then:
=C2=A0(in 1.8) trancode to utf8 or something before stripping out the
non-ascii chars
=C2=A0(in 1.9) set the encoding of the input correctly to make Ammar's
first example work for you
This line:
=C2=A0text.gsub! /[^\x00-\x7F]/, ''
will be just fine if the input is known to be utf8 or some other
well-behaved encoding. (The euc family of encodings, for example, are
also well-behaved.) But it will fail and leave some garbage in your
strings if the encoding is sjis or big5. (Well-behaved in this case
means that the encoding is a superset of ascii and encodes non-ascii
characters entirely with bytes which are not allowed in ascii-7 text.
That is, bytes >=3D0x80.)