PDF Writer UTF-8 Support

B

Brian Schröder

Hello,

I'm having a hard time getting PDF Writer to output my UTF-8 encoded
text correctly. Has anybody around here got some tips for me?

thanks a lot,

Brian
 
A

Austin Ziegler

I'm having a hard time getting PDF Writer to output my UTF-8 encoded
text correctly. Has anybody around here got some tips for me?

Unfortunately, PDF::Writer needs "help" understanding UTF-8 input and
I have been focussing on a number of basic feature changes before
making this "easy" as it also makes a difference as how each font is
handled.

I am hoping to have PDF::Writer 1.0 out -- with documentation on how
to do this at all -- in the next two weeks or so. I apologise for the
inconvenience.

-austin
 
B

Brian Schröder

Unfortunately, PDF::Writer needs "help" understanding UTF-8 input and
I have been focussing on a number of basic feature changes before
making this "easy" as it also makes a difference as how each font is
handled.

I am hoping to have PDF::Writer 1.0 out -- with documentation on how
to do this at all -- in the next two weeks or so. I apologise for the
inconvenience.

-austin

Thanks for your reply, austin,

Is there any possibility to output UTF-8 encoded text right know? I
need no fancy fonts or formating, just some plain text output at
specific x-y corrdinates.

best regards and thanks for the great library,

brian
 
A

Austin Ziegler

Thanks for your reply, austin,

Is there any possibility to output UTF-8 encoded text right know?
I need no fancy fonts or formating, just some plain text output at
specific x-y corrdinates.

best regards and thanks for the great library,

Yes -- but you have to wade through the font encoding mapping
information for PDF documents right now, and you have to be using a
Unicode-capable font. From the PDF 1.6 Reference:

Font management is primarily concerned with producing the
correct appearance of text—that is, the shape and placement of
glyphs. However, it is sometimes necessary for a PDF application
to extract the meaning of the text, represented in some standard
information encoding such as Unicode. In some cases, this
information can be deduced from the encoding used to represent
the text in the PDF file. Otherwise, the PDF producer
application should specify the mapping explicitly by including a
special object, the ToUnicode CMap.

I have not added support for the /ToUnicode CMap in PDF::Writer, but
it may be possible. However:

Certain strings contain information that is intended to be
human-readable, such as text annotations, bookmark names,
article names, document information, and so forth. Such strings
are referred to as text strings. Text strings are encoded in
either PDFDocEncoding or Unicode character encoding.
PDFDocEncoding is a superset of the ISO Latin 1 encoding and is
documented in Appendix D. Unicode is described in the Unicode
Standard by the Unicode Consortium (see the Bibliography).

For text strings encoded in Unicode, the first two bytes must be
254 followed by 255. These two bytes represent the Unicode byte
order marker, U+FEFF, indicating that the string is encoded in
the UTF-16BE (big-endian) encoding scheme specified in the
Unicode standard. (This mechanism precludes beginning a string
using PDFDocEncoding with the two characters thorn ydieresis,
which is unlikely to be a meaningful beginning of a word or
phrase). Note: Applications that process PDF files containing
Unicode text strings should be prepared to handle supplementary
characters; that is, characters requiring more than two bytes to
represent.

An escape sequence may appear anywhere in a Unicode text string
to indicate the language in which subsequent text is written,
which is useful when the language cannot be determined from the
character codes used in the text. The escape sequence consists
of the following elements, in order:

1. The Unicode value U+001B (that is, the byte sequence 0
followed by 27)
2. A 2-character ISO 639 language code—for example, en for
English or ja for Japanese
3. (Optional) A 2-character ISO 3166 country code—for example,
US for the United States or JP for Japan
4. The Unicode value U+001B

The complete list of codes defined by ISO 639 and ISO 3166 can
be obtained from the International Organization for
Standardization (see the Bibliography).

So you can't specify UTF-8, but you can specify UTF-16BE if you
provide the 0xFEFF BOM.

-austin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,169
Messages
2,570,920
Members
47,464
Latest member
Bobbylenly

Latest Threads

Top