PDF Writer UTF-8 Support

Brian Schröder · Mar 30, 2005

Hello,

I'm having a hard time getting PDF Writer to output my UTF-8 encoded
text correctly. Has anybody around here got some tips for me?

thanks a lot,

Brian

Austin Ziegler · Mar 31, 2005

I'm having a hard time getting PDF Writer to output my UTF-8 encoded
text correctly. Has anybody around here got some tips for me?

Unfortunately, PDF::Writer needs "help" understanding UTF-8 input and
I have been focussing on a number of basic feature changes before
making this "easy" as it also makes a difference as how each font is
handled.

I am hoping to have PDF::Writer 1.0 out -- with documentation on how
to do this at all -- in the next two weeks or so. I apologise for the
inconvenience.

-austin

Brian Schröder · Mar 31, 2005

Unfortunately, PDF::Writer needs "help" understanding UTF-8 input and
I have been focussing on a number of basic feature changes before
making this "easy" as it also makes a difference as how each font is
handled.

I am hoping to have PDF::Writer 1.0 out -- with documentation on how
to do this at all -- in the next two weeks or so. I apologise for the
inconvenience.

-austin

Thanks for your reply, austin,

Is there any possibility to output UTF-8 encoded text right know? I
need no fancy fonts or formating, just some plain text output at
specific x-y corrdinates.

best regards and thanks for the great library,

brian

Austin Ziegler · Mar 31, 2005

Thanks for your reply, austin,

Is there any possibility to output UTF-8 encoded text right know?
I need no fancy fonts or formating, just some plain text output at
specific x-y corrdinates.

best regards and thanks for the great library,

Yes -- but you have to wade through the font encoding mapping
information for PDF documents right now, and you have to be using a
Unicode-capable font. From the PDF 1.6 Reference:

Font management is primarily concerned with producing the
correct appearance of text—that is, the shape and placement of
glyphs. However, it is sometimes necessary for a PDF application
to extract the meaning of the text, represented in some standard
information encoding such as Unicode. In some cases, this
information can be deduced from the encoding used to represent
the text in the PDF file. Otherwise, the PDF producer
application should specify the mapping explicitly by including a
special object, the ToUnicode CMap.

I have not added support for the /ToUnicode CMap in PDF::Writer, but
it may be possible. However:

Certain strings contain information that is intended to be
human-readable, such as text annotations, bookmark names,
article names, document information, and so forth. Such strings
are referred to as text strings. Text strings are encoded in
either PDFDocEncoding or Unicode character encoding.
PDFDocEncoding is a superset of the ISO Latin 1 encoding and is
documented in Appendix D. Unicode is described in the Unicode
Standard by the Unicode Consortium (see the Bibliography).

For text strings encoded in Unicode, the first two bytes must be
254 followed by 255. These two bytes represent the Unicode byte
order marker, U+FEFF, indicating that the string is encoded in
the UTF-16BE (big-endian) encoding scheme specified in the
Unicode standard. (This mechanism precludes beginning a string
using PDFDocEncoding with the two characters thorn ydieresis,
which is unlikely to be a meaningful beginning of a word or
phrase). Note: Applications that process PDF files containing
Unicode text strings should be prepared to handle supplementary
characters; that is, characters requiring more than two bytes to
represent.

An escape sequence may appear anywhere in a Unicode text string
to indicate the language in which subsequent text is written,
which is useful when the language cannot be determined from the
character codes used in the text. The escape sequence consists
of the following elements, in order:

1. The Unicode value U+001B (that is, the byte sequence 0
followed by 27)
2. A 2-character ISO 639 language code—for example, en for
English or ja for Japanese
3. (Optional) A 2-character ISO 3166 country code—for example,
US for the United States or JP for Japan
4. The Unicode value U+001B

The complete list of codes defined by ISO 639 and ISO 3166 can
be obtained from the International Organization for
Standardization (see the Bibliography).

So you can't specify UTF-8, but you can specify UTF-16BE if you
provide the 0xFEFF BOM.

-austin

Batch Convert HTML to UTF-8 Files	2	Oct 2, 2023
Problem with text_width and UTF-8 characters in PDF-writer	1	May 29, 2008
PDF::Writer with UTF-8	8	Mar 10, 2007
PDF::Writer Boggles	3	Jun 27, 2005
PDF::Writer and rubygems	3	May 14, 2008
PDF::Writer and Unicode	6	Feb 16, 2007
UTF-8 support - still stuck	9	Mar 5, 2011
[ANN] PDF::Writer 1.1.8	2	Mar 16, 2008

PDF Writer UTF-8 Support

Brian Schröder

Austin Ziegler

Brian Schröder

Austin Ziegler

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads