UTF8 Validator (xml) ?

P

Peter Fitzgibbons

------=_Part_1227_1002195.1129911663080
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Hello all,

I have a Rails app that is outputting xml for use with a Xerces
org.apache.xerces.impl.XMLEntityScanner.
First, if you happen to know an active forum for Xerces, please tell,

Second, the error messaging from the scanner tells me that I have a "Invali=
d
byte 1 of 1-byte UTF-8 sequence."
That's nice, but I have no way to tell _what_ byte is in violation.

SO, if you have any reference to what might qualify as a "UTF8 Validator",
please tell. Clearly Java and Ruby have different definitions of UTF8.

BTW: the file I'm testing wtih loads back into REXML::Document just fine.

Thanks for your advice!
Peter Fitzgibbons

------=_Part_1227_1002195.1129911663080--
 
D

Dave Burt

Peter said:
Second, the error messaging from the scanner tells me that I have a
"Invalid
byte 1 of 1-byte UTF-8 sequence."
That's nice, but I have no way to tell _what_ byte is in violation.

Sorry, I can't answer any of your other questions, but as this is the Java
end barfing on Ruby (or other) UTF-8 data, the character might be a 0. Java
uses a modified UTF where 0 is encoded in 2 bytes (for compatibility with C
0-terminated strings).

Otherwise, it may be the bytes 0xFE or 0xFF. These are invalid in UTF-8, but
are used sometimes as a byte-order mark.

So, I reckon that an "invalid 1-byte UTF-8 sequence" can only be 0xFE, 0xFF
or 0x00 (but actually that last one is valid UTF-8).

Cheers,
Dave
 
S

Simon Strandgaard

On 10/21/05 said:
Second, the error messaging from the scanner tells me that I have a "Inva= lid
byte 1 of 1-byte UTF-8 sequence."
That's nice, but I have no way to tell _what_ byte is in violation.

SO, if you have any reference to what might qualify as a "UTF8 Validator"= ,
please tell. Clearly Java and Ruby have different definitions of UTF8.

I have made a utf8 decoder capable of this, example below:

irb(main):003:0> require 'iterator'
=3D> trueirb(main):004:0> str =3D "ab\000\200\300"
=3D> "ab\000\200\300"
irb(main):005:0> byte_iterator =3D Iterator::Continuation.new(str, :each_by=
te)
=3D> #<Iterator::Continuation:0x58f480 @symbol=3D:each_byte,
@instance=3D"ab\000\200\300",
@return_where=3D#<Proc:0x00000000@/usr/local/lib/ruby/site_ruby/1.8/iterato=
r.rb:494>,
@value=3D97, @position=3D0, @resume_where=3D#<Continuation:0x58f3f4>>
irb(main):006:0> Iterator::DecodeUTF8.new(byte_iterator).to_a
Iterator::DecodeUTF8::Malformed: unexpected continuation byte. byte-offset=
=3D3
from /usr/local/lib/ruby/site_ruby/1.8/iterator.rb:740:in `current'
from /usr/local/lib/ruby/site_ruby/1.8/iterator.rb:91:in `each'
from (irb):6:in `to_a'
from (irb):6
irb(main):007:0>


You need to install my iterator package.
http://rubyforge.org/frs/?group_id=3D18
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,181
Messages
2,570,970
Members
47,537
Latest member
BellCorone

Latest Threads

Top