conversion of non-ascii characters with xslt?

K

klaus

Hello together,

I need to replace non-ascii characters in strings (of names) that
function as file names (only European characters are considered --> no
Asian characters etc.).

My approach now is to use the translate function with a list of
characters that should be replaced to the "simple form" (for example:
é --> e):

<xsl:variable name="f_name" select="translate(file_name,'
äÄöÖüÜßáÁàÀâÂéÉèÈêÊíÍìÌîÎóÓòÒôÔúÚùÙûÛ','_aAoOuUsaAaAaAeEeEeEiIiIiIoOoOoOuUuUuU')"/
problem 1:

This list is not complete! Of course, I could take references such as
http://de.selfhtml.org/html/referenz/zeichen.htm, but I am not sure if
everything is in it!

problem 2:

If there is Müller and Muller as file names they can't be
distinguished.
Thus, I would need a more general approach, that somehow keeps the non-
ascii characters "intact". Some browsers such as Mozilla seem to
replace special characters with some shortened hex-code (e.g. ö -->
%F6), which would be a nice solution.

But I have no idea how to do this!?!

I should think that I am not the only one having this kind of
problem. Can anybody tell me how he/she has overcome this problem?

Best regards and thanks in advance for any hint,
Klaus
 
J

Joseph Kesselman

XML supports full Unicode. All those characters can be represented. If
you want to convert them to other characters, it's up to you to define
and implement the process for doing so.
 
R

Richard Tobin

klaus said:
I need to replace non-ascii characters in strings (of names) that
function as file names (only European characters are considered --> no
Asian characters etc.).

If it's available in the XSLT implementation you have, the extension
function str:encode-uri from the EXSLT library may be useful.

-- Richard
 
A

Andreas Prilop

If there is Müller and Muller as file names they can't be
distinguished.
Thus, I would need a more general approach, that somehow keeps the non-
ascii characters "intact". Some browsers such as Mozilla seem to
replace special characters with some shortened hex-code (e.g. ö -->
%F6), which would be a nice solution.

%F6 is an example of "percent encoding" from RFC 3986.
Many bytes in URLs must be encoded. Note that URLs contain bytes,
not characters. So there is no way to include "o with diaeresis"
in a URL.

One possibility is to use only the encoding UTF-8.
Here are some files to play with:
http://www.unics.uni-hannover.de/nhtcapri/%/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,008
Messages
2,570,269
Members
46,870
Latest member
hemasindhura

Latest Threads

Top