How to clean an xml files from non-utf-8 chars?

K

Krzysieq

[Note: parts of this message were removed to make it a legal post.]

Hi,

I have a problem. I'm trying to parse with ruby some test results from
jmeter, that are stored in xml files. Unfortunately, while they should be
utf-8, some of them aren't. Probably because some db data isn't. In any
case, this makes other toys break down, like xslt transformation and
anything else that relies on the xml files being utf-8.

Does anyone know, how to get rid of such characters? When opened in an
editor like Kate, they are viewed as a white question mark in black square.
I don't really care much about the data - if it's missing some chars, nobody
will care. The point is not to destroy the xml structure and enable other
tool's operations. Any help will be greatly appreciated.

Cheers,
Chris
 
R

Rob Biedenharn

If you really don't care about the content:
str.gsub(/[\x80-\xff]/,'?')
--


You can have bytes in that range as the first byte of a well-formed
UTF-8 Byte Sequence. They just can't represent a single byte. It's
just not that simple.

-Rob

Rob Biedenharn http://agileconsultingllc.com
(e-mail address removed)
 
J

James Gray

I have a problem. I'm trying to parse with ruby some test results from
jmeter, that are stored in xml files. Unfortunately, while they
should be utf-8, some of them aren't. Probably because some db data
isn't. In any case, this makes other toys break down, like xslt
transformation and
anything else that relies on the xml files being utf-8.

Does anyone know, how to get rid of such characters?

If you can figure out the encoding they are actually in, I recommend
using Iconv's transliterate mode:

require "iconv"
Iconv.conv("UTF-8//TRANSLIT", old_encoding_name, data)

James Edward Gray II
 
K

Krzysieq

[Note: parts of this message were removed to make it a legal post.]

Hey,

Thanks for inputs. So do You have another proposition?

Cheers,
Chris

2008/9/17 Rob Biedenharn said:
str.gsub(/[\x80-\xff]/,'?')


You can have bytes in that range as the first byte of a well-formed UTF-8
Byte Sequence. They just can't represent a single byte. It's just not that
simple.

-Rob

Rob Biedenharn http://agileconsultingllc.com
(e-mail address removed)
 
M

Mark Thomas

[Note:  parts of this message were removed to make it a legal post.]

Hi,

I have a problem. I'm trying to parse with ruby some test results from
jmeter, that are stored in xml files. Unfortunately, while they should be
utf-8, some of them aren't. Probably because some db data isn't. In any
case, this makes other toys break down, like xslt transformation and
anything else that relies on the xml files being utf-8.

Look at http://www.botvector.net/2007/11/encoding-problems.html,
particularly the "iconvert" method which attempts conversion to UTF-8,
but in the case where the string cannot be converted to UTF-8 (e.g.
double-byte chars) then it replaces the chars with "?".

-- Mark.
 
B

Brian Candler

Rob said:
If you really don't care about the content:
str.gsub(/[\x80-\xff]/,'?')
--


You can have bytes in that range as the first byte of a well-formed
UTF-8 Byte Sequence. They just can't represent a single byte. It's
just not that simple.

That's why I said "if you really don't care" ... it strips all valid
non-ASCII UTF8 as well as invalid.

There is a nice table at http://en.wikipedia.org/wiki/UTF-8 which would
let you build something more accurate. Ruby quiz perhaps? :)
 
J

Jeremy Hinegardner

If you can figure out the encoding they are actually in, I recommend using
Iconv's transliterate mode:

require "iconv"
Iconv.conv("UTF-8//TRANSLIT", old_encoding_name, data)

This is the approach we have take on some of our code, basically we wanted to
replicate the 'iconv -c' behavior. Does TRANSLIT do this ? I've never used
that mode before.

module UTF8
module Cleanable
#
# Converts the string representation of this class to a utf8 clean
# string. This assumes that #to_s on the object will result in a utf8
# string. All chars that are not valid utf8 char sequences will be
# silently dropped.
#
def utf8_clean
Iconv.open( "UTF-8", "UTF-8" ) do |iconv|
output = StringIO.new
working = self.to_s
loop do
begin
output.print iconv.iconv( working )
break
rescue Iconv::IllegalSequence => is
output.print is.success
working = is.failed[1..-1]
end
end
return output.string
end
end
end
end

class String
include UTF8::Cleanable
end

enjoy,

-jeremy
 
G

Gregory Brown

This is the approach we have take on some of our code, basically we wanted to
replicate the 'iconv -c' behavior. Does TRANSLIT do this ? I've never used
that mode before.

module UTF8
module Cleanable
#
# Converts the string representation of this class to a utf8 clean
# string. This assumes that #to_s on the object will result in a utf8
# string. All chars that are not valid utf8 char sequences will be
# silently dropped.

To silently drop chars with IConv, you'd want to do:

Iconv.conv("UTF-8//IGNORE", old_encoding_name, data)

TRANSLIT just works a little harder and tries to convert your
characters into a series of UTF-8 chars if possible.
I'm not sure if it drops chars that can't be transliterated...

-greg
 
J

James Gray

This is the approach we have take on some of our code, basically we
wanted to
replicate the 'iconv -c' behavior. Does TRANSLIT do this ? I've
never used
that mode before.


//TRANSLIT is better than that. It tries to translate the
characters. Thus a UTF-8 ellipse would become three periods if
converted to ISO-8859-1 with //TRANSLIT.

You can mimic -c though, just use //IGNORE instead of //TRANSLIT. You
can even do //TRANSLIT//IGNORE which transliterates what it can and
discards the rest.

James Edward Gray II
 
K

Krzysieq

[Note: parts of this message were removed to make it a legal post.]

Unfortunately, there's no way telling the original encoding. I would rather
go for some method of removing / substituting the chars that don't belong
there, but the method first suggested by Brian doesn't seem to work for some
reason. Does anyone have another option? I'm investigating the reasons of
failure, I will write more when I know something more. Thanks for all help
anyways :)

Cheers,
Chris
 
G

Gregory Brown

Unfortunately, there's no way telling the original encoding. I would rather
go for some method of removing / substituting the chars that don't belong
there, but the method first suggested by Brian doesn't seem to work for some
reason. Does anyone have another option? I'm investigating the reasons of
failure, I will write more when I know something more. Thanks for all help
anyways :)

If there is no way of telling the original encoding, the input data
may not have valid unicode in it at all, right?

-greg
 
M

Mark Thomas

[Note:  parts of this message were removed to make it a legal post.]

Unfortunately, there's no way telling the original encoding. I would rather
go for some method of removing / substituting the chars that don't belong
there, but the method first suggested by Brian doesn't seem to work for some
reason. Does anyone have another option?

Try the iconv solutions with latin-1 (iso-8859-1) as the From. That's
as close as you can get to a one-byte "anything-goes" encoding.

-Mark.
 
K

Krzysieq

[Note: parts of this message were removed to make it a legal post.]

Ok, I tried all previous suggestions, neither worked (gsub idea, TRANSLIT,
IGNORE or the one from the link posted by Mark Thomas). In fact, the last
two don't seem to have done anything, while gsub seems to do too much -
seems like it has damaged the xml structure in some way, which seems very
strange to me. I don't really care about the data inside, but I need the xml
to remain valid.

@Gregory - that's true, it may not. However, the places where I found the
funny characters are text nodes inside xml documents, and there aren't that
many of them. Surely, one is many enough to break the whole thing, but
typically there's very few and it seems more like corrupted database data. I
think they store some newspaper articles there or pieces of news. I learned
from the team who maintain that database in their app, that typically it
should all be ISO-8859-1, but for some reason it's not always the case.
Hence the idea with corrupted data seems quite likely.

Thanks for any help You can provide me with :)
Cheers,
Chris

2008/9/18 Mark Thomas said:
[Note: parts of this message were removed to make it a legal post.]

Unfortunately, there's no way telling the original encoding. I would rather
go for some method of removing / substituting the chars that don't belong
there, but the method first suggested by Brian doesn't seem to work for some
reason. Does anyone have another option?

Try the iconv solutions with latin-1 (iso-8859-1) as the From. That's
as close as you can get to a one-byte "anything-goes" encoding.

-Mark.
 
K

Krzysieq

[Note: parts of this message were removed to make it a legal post.]

Sill answer, but what is $KCODE ?? I'm relatively new to Ruby, so this tells
me nothing... And as You might have guessed, no, I haven't set it. What's it
do? :)

Cheers,
Chris
 
M

Mark Thomas

How is the XML file created? If you know in advance which parts of the
XML come from the database, wrap those sections in CDATA blocks and
your XML will remain valid.
 
G

Gregory Brown

Sill answer, but what is $KCODE ?? I'm relatively new to Ruby, so this tells
me nothing... And as You might have guessed, no, I haven't set it. What's it
do? :)

It tells Ruby that you are working with UTF-8 ;)

-greg
 
J

James Gray

Sill answer, but what is $KCODE ??

It's a global variable that affects how Ruby 1.8 handles characters.
And as You might have guessed, no, I haven't set it.

Does your code run inside of a recent version of Rails? I'm just
asking because it sets $KCODE for you.

James Edward Gray II
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top