UTF8 Validator (xml) ?

Peter Fitzgibbons · Oct 21, 2005

------=_Part_1227_1002195.1129911663080
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Hello all,

I have a Rails app that is outputting xml for use with a Xerces
org.apache.xerces.impl.XMLEntityScanner.
First, if you happen to know an active forum for Xerces, please tell,

Second, the error messaging from the scanner tells me that I have a "Invali=
d
byte 1 of 1-byte UTF-8 sequence."
That's nice, but I have no way to tell _what_ byte is in violation.

SO, if you have any reference to what might qualify as a "UTF8 Validator",
please tell. Clearly Java and Ruby have different definitions of UTF8.

BTW: the file I'm testing wtih loads back into REXML:

ocument just fine.

Thanks for your advice!
Peter Fitzgibbons

------=_Part_1227_1002195.1129911663080--

Dave Burt · Oct 21, 2005

Peter said:
Second, the error messaging from the scanner tells me that I have a
"Invalid
byte 1 of 1-byte UTF-8 sequence."
That's nice, but I have no way to tell _what_ byte is in violation.

Sorry, I can't answer any of your other questions, but as this is the Java
end barfing on Ruby (or other) UTF-8 data, the character might be a 0. Java
uses a modified UTF where 0 is encoded in 2 bytes (for compatibility with C
0-terminated strings).

Otherwise, it may be the bytes 0xFE or 0xFF. These are invalid in UTF-8, but
are used sometimes as a byte-order mark.

So, I reckon that an "invalid 1-byte UTF-8 sequence" can only be 0xFE, 0xFF
or 0x00 (but actually that last one is valid UTF-8).

Cheers,
Dave

Simon Strandgaard · Oct 21, 2005

On 10/21/05 said:
Second, the error messaging from the scanner tells me that I have a "Inva= lid
byte 1 of 1-byte UTF-8 sequence."
That's nice, but I have no way to tell _what_ byte is in violation.

SO, if you have any reference to what might qualify as a "UTF8 Validator"= ,
please tell. Clearly Java and Ruby have different definitions of UTF8.

I have made a utf8 decoder capable of this, example below:

irb(main):003:0> require 'iterator'
=3D> trueirb(main):004:0> str =3D "ab\000\200\300"
=3D> "ab\000\200\300"
irb(main):005:0> byte_iterator =3D Iterator::Continuation.new(str, :each_by=
te)
=3D> #<Iterator::Continuation:0x58f480 @symbol=3D:each_byte,
@instance=3D"ab\000\200\300",
@return_where=3D#<Proc:0x00000000@/usr/local/lib/ruby/site_ruby/1.8/iterato=
r.rb:494>,
@value=3D97, @position=3D0, @resume_where=3D#<Continuation:0x58f3f4>>
irb(main):006:0> Iterator:

ecodeUTF8.new(byte_iterator).to_a
Iterator:

ecodeUTF8::Malformed: unexpected continuation byte. byte-offset=
=3D3
from /usr/local/lib/ruby/site_ruby/1.8/iterator.rb:740:in `current'
from /usr/local/lib/ruby/site_ruby/1.8/iterator.rb:91:in `each'
from (irb):6:in `to_a'
from (irb):6
irb(main):007:0>

You need to install my iterator package.
http://rubyforge.org/frs/?group_id=3D18

XML::Simple and utf8 woes	16	Mar 18, 2006
xml-rpc UnicodeDecodeError	0	Jun 10, 2010
filename charset and internal Perl utf8	3	Jun 8, 2006
REXML::XPath results out of sort order ?	6	Oct 21, 2005
where is the Content Assist of RDT?	7	Dec 7, 2005
YARV 0.4.0 and ActiveRecord 1.13.2 rubygem	1	Mar 7, 2006
Confirm	0	Nov 1, 2005
gem error on 1.8.3 - Win32	0	Oct 26, 2005

UTF8 Validator (xml) ?

Peter Fitzgibbons

Dave Burt

Simon Strandgaard

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads