???SGML support for Unicode???

K

krammer

Hello,

I have the following questions that I have not been able to find any
*good* answers for. Your help would me much appreciated!, fyi, I am a
Java XML guy and I have no experience with SGML so my questions will
probably be XML biased.

1) Is is possible to have Unicode text inside an SGML file?

an example would be something like this.......

there is an SGML file with ascii text for alot of the data elements
and one specific element has the acutal unicode text in it. The
Unicode would have alot of CJK in it too, don't know if that matters.

2) I read something that SGML can not handle variable length byte
characters in it, so does this mean that if Unicode was supported that
only UTF-32 would be supported cuase it is fixed byte length as
opposed to UTF-16 and UTF-8?

3) if it is possible, how do you do it? and can you *please* give me
some web pages that have some examples.

4) If it is not possible, can you *please* give me a list of reasons
why it is not possible or the 20 hoops you have to jump through to do
it.

I am writting up a white paper and need answers to back up whatever I
say.

once again, thank you for you help, it is much appreciated!

krammer
 
B

Bjorn Brox

krammer said:
Hello,

I have the following questions that I have not been able to find any
*good* answers for. Your help would me much appreciated!, fyi, I am a
Java XML guy and I have no experience with SGML so my questions will
probably be XML biased.

1) Is is possible to have Unicode text inside an SGML file?

Yes, as UTF-8 if the declaration files allows the byte range 128-255,
simply by adding this line: DESCSET 128 128 128
an example would be something like this.......

As an XML guy you know how..
there is an SGML file with ascii text for alot of the data elements
and one specific element has the acutal unicode text in it. The
Unicode would have alot of CJK in it too, don't know if that matters.

2) I read something that SGML can not handle variable length byte
characters in it, so does this mean that if Unicode was supported that
only UTF-32 would be supported cuase it is fixed byte length as
opposed to UTF-16 and UTF-8?

You are mixing bytes and characters...

Seeing from a parsers view an UTF-8 encoded character streams is just a
sequence of 8-bit bytes. It is your presentation layer that should know
how to combine these bytes into unicode characters.

However: You cannot represent unicode characters > 255 as numeric
entities like Ӓ without modifying your SGML parser to pass it as
UTF-8.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,226
Members
46,816
Latest member
nipsseyhussle

Latest Threads

Top