ascii char 26

B

bob

Anyone know why ASCII char 26 is used in place of a hyphen in UTF-8?

I had to write this function to deal with this:

public static String convertToAscii(String html) {
html = html.replaceAll("\u2019", "'");
html = html.replaceAll("\u201D", "\"");
html = html.replaceAll("\u201C", "\"");

byte[] b = null;
try {
b = html.getBytes("US-ASCII");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}

// hyphen replace
for (int ctr = 0; ctr < b.length; ctr++)
if (b[ctr] == 26)
b[ctr] = 45;

html = new String(b);
return html;
}
 
A

Arne Vajhøj

Anyone know why ASCII char 26 is used in place of a hyphen in UTF-8?

I had to write this function to deal with this:

public static String convertToAscii(String html) {
html = html.replaceAll("\u2019", "'");
html = html.replaceAll("\u201D", "\"");
html = html.replaceAll("\u201C", "\"");

byte[] b = null;
try {
b = html.getBytes("US-ASCII");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}

// hyphen replace
for (int ctr = 0; ctr< b.length; ctr++)
if (b[ctr] == 26)
b[ctr] = 45;

html = new String(b);
return html;
}

ASCII code 26 is not in general replaced with hyphen.

If you are asking why some code may do it, then in
some contexts (usually on Windows platform) ASCII code
26 indicates EOF.

Arne
 
J

Joshua Cranmer

Anyone know why ASCII char 26 is used in place of a hyphen in UTF-8?

The US-ASCII encoder only properly encodes characters in the range of
0-127, i.e., the characters that are present in ASCII. Any other
character is replaced with some sort of substitution character; in this
case, it looks like the charset has chosen to use ^Z as the "I don't
know what this character is" character (I would have guessed '?'
instead, but I suppose they decided to go with the less-commonly used
variant).

My guess is your input is using one of the characters like the minus
sign, em dash, or perhaps an en dash instead (there may be others),
which are visually close in appearance to a hyphen but do not share the
same Unicode codepoint.
 
R

Roedy Green

Anyone know why ASCII char 26 is used in place of a hyphen in UTF-8?
html = html.replaceAll("\u201C", "\"");

\u0026 is replaced by an ampersand at compile time, as if you had
typed one into the source code.

I presume you are talking about

26 0x1a ^Z SUB, substitute

\u001a is not useful. It gets replaced by a ^z character, as if you
had typed it into the source text, possibly creating a syntax error.
If you want this char you probably want (char)0x001a

This is true for ascii, UTF and UTF-8. If you see a -, it might just
be some font's attempt to render a SUB char.

You can use ␚ in HTML or \u241a in Java to render a tiny SUB
glyph to represent the char.

see
http://mindprod.com/jgloss/ascii.html
http://mindprod.com/jgloss/unicode.html
http://mindprod.com/jgloss/utf.html
http://mindprod.com/jgloss/literal.html
--
Roedy Green Canadian Mind Products
http://mindprod.com
The modern conservative is engaged in one of man's oldest exercises in moral philosophy; that is,
the search for a superior moral justification for selfishness.
~ John Kenneth Galbraith (born: 1908-10-15 died: 2006-04-29 at age: 97)
 
E

Eric Sosman

The US-ASCII encoder only properly encodes characters in the range of
0-127, i.e., the characters that are present in ASCII. Any other
character is replaced with some sort of substitution character; in this
case, it looks like the charset has chosen to use ^Z as the "I don't
know what this character is" character (I would have guessed '?'
instead, but I suppose they decided to go with the less-commonly used
variant).

It makes more sense when you think of 26 not as ^Z, but as SUB.
 
B

Bent C Dalager

Anyone know why ASCII char 26 is used in place of a hyphen in UTF-8?

Unicode has multiple different hyphens and hyphen-like characters.

The traditional ASCII hyphen is the Unicode "hyphen-minus" which
encodes to 0x2d in utf-8.

http://www.fileformat.info/info/unicode/char/2d/index.htm suggests the
following additional hyphen-like characters that you may actually be
working with in your string, and that will probably be mapped to 26 in
your case:

hyphen U+2010
non-breaking hyphen U+2011
figure dash U+2012
en dash U+2013
minus sign U+2212
roman uncia sign U+10191

If hyphens are of particular interest to you it may be a better
approach to replace non-ASCII-supported hyphens from the above list
with "hyphen-minus", before you transcode to ASCII.

One would tend to think there ought to be a library function somewhere
to convert a unicode string to ASCII-supported variants of its various
characters where possible, that you should be using instead. I don't
know if such a function is easily available.

Cheers,
Bent D
 
J

Joshua Cranmer

One would tend to think there ought to be a library function somewhere
to convert a unicode string to ASCII-supported variants of its various
characters where possible, that you should be using instead. I don't
know if such a function is easily available.

This generally falls under the umbrella of Unicode normalization, which
can resolve, e.g., Ã… the Angstrom symbol and Ã… the Swedish letter to the
same representation (may require compatibility normalization). You can
do this in Java using the java.text.Normalizer class.
 
R

Retahiv Oopsiscame

Unicode has multiple different hyphens and hyphen-like characters.

The traditional ASCII hyphen is the Unicode "hyphen-minus" which
encodes to 0x2d in utf-8.

http://www.fileformat.info/info/unicode/char/2d/index.htmsuggests the
following additional hyphen-like characters that you may actually be
working with in your string, and that will probably be mapped to 26 in
your case:

hyphen U+2010
non-breaking hyphen U+2011
figure dash U+2012
en dash U+2013
minus sign U+2212
roman uncia sign U+10191

Wow, what a mess!
One would tend to think there ought to be a library function somewhere
to convert a unicode string to ASCII-supported variants of its various
characters where possible,

Indeed.
 
B

bob

You're right. I messed up, and it was the em dash. It turned into 26
after going thru 'b = html.getBytes("US-ASCII");'

Here's the new code:

public static String convertToAscii(String html) {
html = html.replaceAll("\u2019", "'");
html = html.replaceAll("\u201D", "\"");
html = html.replaceAll("\u201C", "\"");

// mdash
html = html.replaceAll("\u2014", "-");


byte[] b = null;
try {
b = html.getBytes("US-ASCII");

} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return html;
}

Also, I'm on Android 2.1, so import java.text.Normalizer; doesn't
work.
 
J

Joshua Cranmer

You're right. I messed up, and it was the em dash. It turned into 26
after going thru 'b = html.getBytes("US-ASCII");'

Here's the new code:

Hardcoding a list of tables is generally not a good thing; in
particular, I don't think it's going to solve your problems. I have seen
sites that use the Unicode ff and fi ligatures instead of relying on
fonts to automatically pick up on that as well.

If I may ask, why do you need to convert the string to US-ASCII as
opposed to UTF-8? That is going to cause major issues for the ~90% of
the world that doesn't speak English as their main language.
Also, I'm on Android 2.1, so import java.text.Normalizer; doesn't
work.

It shouldn't be that hard to find other Java Unicode normalization
libraries out there.
 
R

Roedy Green

Wow, what a mess!

See http://mindprod.com/jgloss/unicode.html It has a table showing
all those dashes rendered.
They don't all look the same. Further Unicode does not specify what
the glyphs look like, just the code's logical function. A font
designer is free to make all those different dashes visually distinct.
--
Roedy Green Canadian Mind Products
http://mindprod.com
The modern conservative is engaged in one of man's oldest exercises in moral philosophy; that is,
the search for a superior moral justification for selfishness.
~ John Kenneth Galbraith (born: 1908-10-15 died: 2006-04-29 at age: 97)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,994
Messages
2,570,222
Members
46,810
Latest member
Kassie0918

Latest Threads

Top