Read utf-8 char one by one

moonhkt · Jan 27, 2010

Hi All

how to read utf-8 char one by one ?

Below not work.

import java.nio.charset.Charset ;
import java.io.*;
import java.lang.String;
public class read_utf_char {
public static void main(String[] args) {
File aFile = new File("utf8_test.text");
try {
String str = "";
char[] ch = new char[];
BufferedReader in = new BufferedReader(
new InputStreamReader(new FileInputStream(aFile), "UTF8"));
while ( in.read(ch) != -1 )
{
System.out.print(ch);
}
} catch (UnsupportedEncodingException e) {
} catch (IOException e) {
}

Mayeul · Jan 27, 2010

moonhkt said:
Hi All

how to read utf-8 char one by one ?

Below not work.

As far as I know, it works if your utf-8 stream contains only BMP
characters (characters with code point 0xFFFF or below.)

But it is indeed incorrect in the general case where you can't assume
characters are all in the BMP. This is a known Java limitation.

In the general case, you just don't read unicode characters one by one
from a stream. Either you convert the stream to String first (and then
use a clever combination of String.codePointAt() and
Character.charCount(), read the JavaDoc.)
Either you read looking for your delimiters, but storing whatever is
*not* your delimiter, in a char buffer, untouched. You do not write it
directly. For instance, BufferedReader implements reading line by line.
I suppose other implementations enable to read using a different delimiter.

Lothar Kimmeringer · Jan 27, 2010

moonhkt said:
Below not work.
[...]

char[] ch = new char[];

Because it doesn't compile.

What exactly doesn't work. Do you get a wrong output, do you
get an exception (you ignore in the source you provided). A
bit more information would really help to be able to answer
more than "something will be wrong in your code".

Regards, Lothar
--
Lothar Kimmeringer E-Mail: (e-mail address removed)
PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

Always remember: The answer is forty-two, there can only be wrong
questions!

moonhkt · Jan 27, 2010

moonhkt said:
moonhkt said:

Below not work.
[...]

char[] ch = new char[];

Click to expand...

Because it doesn't compile.

What exactly doesn't work. Do you get a wrong output, do you
get an exception (you ignore in the source you provided). A
bit more information would really help to be able to answer
more than "something will be wrong in your code".

Regards, Lothar
--
Lothar Kimmeringer E-Mail: (e-mail address removed)
PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

Always remember: The answer is forty-two, there can only be wrong
questions!

Thank. I get below Example. But I can not get the UTF-8 char code.

class CodePointAtstring
{
public static void main(String[] args)
{
// Declaration of String
String a="\u00fc" + "\u34d7"+ "Welcome to Rose india";
//Displays the Actual String declared above
System.out.println("GIVEN STRING IS="+a);
// Returns the character (Unicode code point) at the specified
index.
System.out.println("Unicode code point at position 0 IN THE STRING
IS="+a.codePointAt(0));
System.out.println("Unicode code point at position 1 IN THE STRING
IS="+a.codePointAt(1));
System.out.println("Unicode code point at position 2 IN THE STRING
IS="+a.codePointAt(2));
System.out.println("Unicode code point at position 3 IN THE STRING
IS="+a.codePointAt(3));
System.out.println("Unicode code point at position 6 IN THE STRING
IS="+a.codePointAt(6));
}
}

Output
java CodePointAtstring
GIVEN STRING IS=³?Welcome to Rose india
Unicode code point at position 0 IN THE STRING IS=252
Unicode code point at position 1 IN THE STRING IS=13527
Unicode code point at position 2 IN THE STRING IS=87
Unicode code point at position 3 IN THE STRING IS=101
Unicode code point at position 6 IN THE STRING IS=111

RedGrittyBrick · Jan 27, 2010

moonhkt said:
moonhkt said:

Below not work. [...]

char[] ch = new char[];

Click to expand...

Because it doesn't compile.

What exactly doesn't work. Do you get a wrong output, do you
get an exception (you ignore in the source you provided). A
bit more information would really help to be able to answer
more than "something will be wrong in your code".

Regards, Lothar
--
Lothar Kimmeringer E-Mail: (e-mail address removed)
PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

Always remember: The answer is forty-two, there can only be wrong
questions!

Click to expand...

Thank. I get below Example. But I can not get the UTF-8 char code.

What do you mean by "UTF-8 char code"? Strictly speaking there is no
such thing. You might mean "Unicode code-point" or "sequence of octets
in UTF8-encoding"

class CodePointAtstring
{
public static void main(String[] args)
{
// Declaration of String
String a="\u00fc" + "\u34d7"+ "Welcome to Rose india";
//Displays the Actual String declared above
System.out.println("GIVEN STRING IS="+a);
// Returns the character (Unicode code point) at the specified
index.
System.out.println("Unicode code point at position 0 IN THE STRING
IS="+a.codePointAt(0));
System.out.println("Unicode code point at position 1 IN THE STRING
IS="+a.codePointAt(1));
System.out.println("Unicode code point at position 2 IN THE STRING
IS="+a.codePointAt(2));
System.out.println("Unicode code point at position 3 IN THE STRING
IS="+a.codePointAt(3));
System.out.println("Unicode code point at position 6 IN THE STRING
IS="+a.codePointAt(6));
}
}

Output
java CodePointAtstring
GIVEN STRING IS=Â³?Welcome to Rose india
Unicode code point at position 0 IN THE STRING IS=252
Unicode code point at position 1 IN THE STRING IS=13527
Unicode code point at position 2 IN THE STRING IS=87
Unicode code point at position 3 IN THE STRING IS=101
Unicode code point at position 6 IN THE STRING IS=111

That seems completely reasonable to me because 252 = 0x00fc and 13527 =
0x34d7.

Nothing in your program has anything to do with UTF-8 encoding.

moonhkt · Jan 28, 2010

Hi All
I want output the Character in the string one by one.
Now,codePointAt just print the Code points value.

moonhkt said:
moonhkt said:

moonhkt wrote:
Below not work.
[...]
Â Â char[] ch = new char[];
Because it doesn't compile.
What exactly doesn't work. Do you get a wrong output, do you
get an exception (you ignore in the source you provided). A
bit more information would really help to be able to answer
more than "something will be wrong in your code".
Regards, Lothar
--
Lothar Kimmeringer Â Â Â Â Â Â Â Â E-Mail: (e-mail address removed)
Â Â Â Â Â Â Â Â PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)
Always remember: The answer is forty-two, there can only be wrong
Â Â Â Â Â Â Â Â Â questions!

Click to expand...

Click to expand...

Thank. I get below Example. But I can not get the UTF-8 char code.

Click to expand...

What do you mean by "UTF-8 char code"? Strictly speaking there is no
such thing. You might mean "Unicode code-point" or "sequence of octets
in UTF8-encoding"

class CodePointAtstring
{
Â public static void main(String[] args)
Â {
Â Â // Declaration of String
Â Â String a="\u00fc" + "\u34d7"+ "Welcome to Rose india";
Â Â //Displays the Actual String declared above
Â Â System.out.println("GIVEN STRING IS="+a);
Â Â // Â Returns the character (Unicode code point) at the specified
index.
Â Â System.out.println("Unicode code point at position 0 IN THE STRING
IS="+a.codePointAt(0));
Â Â System.out.println("Unicode code point at position 1 IN THE STRING
IS="+a.codePointAt(1));
Â Â System.out.println("Unicode code point at position 2 IN THE STRING
IS="+a.codePointAt(2));
Â Â System.out.println("Unicode code point at position 3 IN THE STRING
IS="+a.codePointAt(3));
Â Â System.out.println("Unicode code point at position 6 IN THE STRING
IS="+a.codePointAt(6));
Â }
}

Click to expand...

Output
java CodePointAtstring
GIVEN STRING IS=Â³?Welcome to Rose india
Unicode code point at position 0 IN THE STRING IS=252
Unicode code point at position 1 IN THE STRING IS=13527
Unicode code point at position 2 IN THE STRING IS=87
Unicode code point at position 3 IN THE STRING IS=101
Unicode code point at position 6 IN THE STRING IS=111

Click to expand...

That seems completely reasonable to me because 252 = 0x00fc and 13527 =
0x34d7.

Nothing in your program has anything to do with UTF-8 encoding.

--
RGB- éš±è—è¢«å¼•ç”¨æ–‡å— -

- é¡¯ç¤ºè¢«å¼•ç”¨æ–‡å— -- éš±è—è¢«å¼•ç”¨æ–‡å— -

- é¡¯ç¤ºè¢«å¼•ç”¨æ–‡å— -

Lew · Jan 28, 2010

Please, do not top-post.

I want output the Character in the string one by one.
Now,codePointAt just print the Code points value.

'codePointAt()' doesn't print anything. How are you actually printing it?

'codePointAt()' returns an int, not a character.
<http://java.sun.com/javase/6/docs/api/java/lang/String.html#codePointAt(int)>

Most methods that output an int show the int value, not the equivalent
character. If you want to display an int as a character, you have to use a
method that will do that. I don't know offhand of a method in the standard
API that does that, but perusal of the Javadocs might reveal one, otherwise
you'll have to code one yourself or find a third-party library that already
has such.

Roedy Green · Jan 28, 2010

What do you mean by "UTF-8 char code"? Strictly speaking there is no
such thing. You might mean "Unicode code-point" or "sequence of octets
in UTF8-encoding"

The point of an encoding is it hides the details of how 16-chars are
inserted into an 8-bit stream. All you are interested in the 16-bit
Java char value or perhaps the java codepoint value if you have 32-bit
chars embedded as well.

RedGrittyBrick · Jan 28, 2010

moonhkt said:
RedGrittyBrick said:

moonhkt said:

Lothar Kimmeringer wrote:
moonhkt wrote:

Below not work.

[...]
Because it doesn't compile. What exactly doesn't work. Do you
get a wrong output, do you get an exception (you ignore in the
source you provided). A bit more information would really help
to be able to answer more than "something will be wrong in your
code". Regards,

Thank. I get below Example. But I can not get the UTF-8 char
code.

Click to expand...

What do you mean by "UTF-8 char code"? Strictly speaking there is
no such thing. You might mean "Unicode code-point" or "sequence of
octets in UTF8-encoding"

[...]

Nothing in your program has anything to do with UTF-8 encoding.

Click to expand...

Hi All I want output the Character in the string one by one.
Now,codePointAt just print the Code points value.

Why not use String's length() and CharAt() methods?

I assume you can disregard characters outside Unicode's Base
Multilingual Plane (BMP) - if not, I think you'll have to check for
surrogate pairs. Characters outside the BMP are too big for a char.

-------------------------------------8<-----------------------------------
public class UnicodeChars {
public static void main(String[] args)
throws UnsupportedEncodingException {

// I want console output in UTF-8
PrintStream sysout = new PrintStream(System.out, true, "UTF-8");

// \u00fc is LATIN SMALL LETTER U WITH DIAERESIS;
// \u34d7 is a character in CJK Unified Ideographs Extension A.
// \uD834\uDD1E" are the surrogate pair for character U+1D11E.
// U+1D11E is MUSICAL SYMBOL G CLEF;
String a = "\u00fc\u34d7Welcome to Rose India \uD834\uDD1E.";

int n = a.length();
sysout.println("GIVEN STRING IS=" + a);
sysout.printf("Length of string is %d%n", n);
sysout.printf("CodePoints in string is %d%n",
a.codePointCount(0,n));
for (int i = 0; i < n; i++) {
sysout.printf("Character[%d] is %s%n", i, a.charAt(i));
}
}
}
-------------------------------------8<-----------------------------------
GIVEN STRING IS=Ã¼ã“—Welcome to Rose India ð„ž.
Length of string is 27
CodePoints in string is 26
Character[0] is Ã¼
Character[1] is ã“—
Character[2] is W
Character[3] is e
Character[4] is l
Character[5] is c
Character[6] is o
Character[7] is m
Character[8] is e
Character[9] is
Character[10] is t
Character[11] is o
Character[12] is
Character[13] is R
Character[14] is o
Character[15] is s
Character[16] is e
Character[17] is
Character[18] is I
Character[19] is n
Character[20] is d
Character[21] is i
Character[22] is a
Character[23] is
Character[24] is ?
Character[25] is ?
Character[26] is .

moonhkt · Jan 28, 2010

Yes. This is my want.
But my output is not same with you. You are correct.

Run in Jcreator 4.5 version
--------------------Configuration: <Default>--------------------
GIVEN STRING IS=ç¾¹?î¢elcome to Rose India ??.
Length of string is 27
CodePoints in string is 26
Character[0] is ç¾¹
Character[1] is ??
Character[2] is W
Character[3] is e
Character[4] is l
Character[5] is c
Character[6] is o
Character[7] is m
Character[8] is e
Character[9] is
Character[10] is t
Character[11] is o
Character[12] is
Character[13] is R
Character[14] is o
Character[15] is s
Character[16] is e
Character[17] is
Character[18] is I
Character[19] is n
Character[20] is d
Character[21] is i
Character[22] is a
Character[23] is
Character[24] is ?
Character[25] is ?
Character[26] is .

Process completed.

RedGrittyBrick said:
RedGrittyBrick said:

moonhkt wrote:
Lothar Kimmeringer wrote:
moonhkt wrote:
Below not work.
[...]
Because it doesn't compile. What exactly doesn't work. Do you
get a wrong output, do you get an exception (you ignore in the
source you provided). A bit more information would really help
to be able to answer more than "something will be wrong in your
code". Regards,
Thank. I get below Example. But I can not get the UTF-8 char
code.
What do you mean by "UTF-8 char code"? Strictly speaking there is
no such thing. You might mean "Unicode code-point" or "sequence of
octets in UTF8-encoding"
[...]
Nothing in your program has anything to do with UTF-8 encoding.

Click to expand...

Click to expand...

Hi All I want output the Character in the string one by one.
Now,codePointAt just print the Code points value.

Click to expand...

Why not use String's length() and CharAt() methods?

I assume you can disregard characters outside Unicode's Base
Multilingual Plane (BMP) - if not, I think you'll have to check for
surrogate pairs. Characters outside the BMP are too big for a char.

-------------------------------------8<-----------------------------------
public class UnicodeChars {
Â Â public static void main(String[] args)
Â Â Â Â throws UnsupportedEncodingException {

Â Â Â // I want console output in UTF-8
Â Â Â PrintStream sysout = new PrintStream(System.out, true, "UTF-8");

Â Â Â // \u00fc is LATIN SMALL LETTER U WITH DIAERESIS;
Â Â Â // \u34d7 is a character in CJK Unified Ideographs Extension A.
Â Â Â // \uD834\uDD1E" are the surrogate pair for character U+1D11E.
Â Â Â // U+1D11E is MUSICAL SYMBOL G CLEF;
Â Â Â String a = "\u00fc\u34d7Welcome to Rose India \uD834\uDD1E.";

Â Â Â int n = a.length();
Â Â Â sysout.println("GIVEN STRING IS=" + a);
Â Â Â sysout.printf("Length of string is %d%n", n);
Â Â Â sysout.printf("CodePoints in string is %d%n",
Â Â Â Â Â a.codePointCount(0,n));
Â Â Â for (int i = 0; i < n; i++) {
Â Â Â Â sysout.printf("Character[%d] is %s%n", i, a.charAt(i));
Â Â Â }
Â Â }}

-------------------------------------8<-----------------------------------
GIVEN STRING IS=Ã¼ã“—Welcome to Rose India ð„ž.
Length of string is 27
CodePoints in string is 26
Character[0] is Ã¼
Character[1] is ã“—
Character[2] is W
Character[3] is e
Character[4] is l
Character[5] is c
Character[6] is o
Character[7] is m
Character[8] is e
Character[9] is
Character[10] is t
Character[11] is o
Character[12] is
Character[13] is R
Character[14] is o
Character[15] is s
Character[16] is e
Character[17] is
Character[18] is I
Character[19] is n
Character[20] is d
Character[21] is i
Character[22] is a
Character[23] is
Character[24] is ?
Character[25] is ?
Character[26] is .

RedGrittyBrick · Jan 28, 2010

PLEASE DON'T TOP-POST, PLEASE PUT YOUR REPLY AT THE BOTTOM, BELOW ANY
QUOTED TEXT. THANKS!

RedGrittyBrick said:
RedGrittyBrick said:

moonhkt said:

Hi All I want output the Character in the string one by one.
Now,codePointAt just print the Code points value.

Click to expand...

Why not use String's length() and CharAt() methods?

I assume you can disregard characters outside Unicode's Base
Multilingual Plane (BMP) - if not, I think you'll have to check for
surrogate pairs. Characters outside the BMP are too big for a char.

-------------------------------------8<-----------------------------------
public class UnicodeChars {
public static void main(String[] args)
throws UnsupportedEncodingException {

// I want console output in UTF-8
PrintStream sysout = new PrintStream(System.out, true, "UTF-8");

// \u00fc is LATIN SMALL LETTER U WITH DIAERESIS;
// \u34d7 is a character in CJK Unified Ideographs Extension A.
// \uD834\uDD1E" are the surrogate pair for character U+1D11E.
// U+1D11E is MUSICAL SYMBOL G CLEF;
String a = "\u00fc\u34d7Welcome to Rose India \uD834\uDD1E.";

int n = a.length();
sysout.println("GIVEN STRING IS=" + a);
sysout.printf("Length of string is %d%n", n);
sysout.printf("CodePoints in string is %d%n",
a.codePointCount(0,n));
for (int i = 0; i < n; i++) {
sysout.printf("Character[%d] is %s%n", i, a.charAt(i));
}
}}

-------------------------------------8<-----------------------------------
GIVEN STRING IS=Ã¼ã“—Welcome to Rose India ð„ž.
Length of string is 27
CodePoints in string is 26
Character[0] is Ã¼
Character[1] is ã“—
Character[2] is W
Character[3] is e [...]
Character[23] is
Character[24] is ?
Character[25] is ?
Character[26] is .

Click to expand...

Yes. This is my want.

Click to expand...

But my output is not same with you. You are correct.

Run in Jcreator 4.5 version

I am using Eclipse. To display UTF-8 encoded Unicode characters written
to the console, I had to configure Eclipse. Perhaps you need to
configure JCreator so that you can display Unicode characters?

GIVEN STRING IS=ç¾¹?î¢elcome to Rose India ??.
Length of string is 27
CodePoints in string is 26
Character[0] is ç¾¹
Character[1] is ??
Character[2] is W
Character[3] is e [...]
Character[23] is
Character[24] is ?
Character[25] is ?
Character[26] is .

You used Google Groups to post. It seems Google Groups uses
quoted-printable to encode non-ASCII characters.
E.g. =3D=E7=BE=B9?=EE=A2=ADelcome ...
I find it hard to fathom how that sequence of octets was derived.
AFAIK \u00fc\uc3c should encode to octets c3 bc e3 93 97.
Perhaps Google Groups is hampering communications - As you seem to be a
user of Mozilla Firebird, have you tried using Mozilla Thunderbird to
read this newsgroup directly from your ISPs NNTP service?

I suspect your remaining problems are due to the configuration of
JCreator or your operating system.

Lew · Jan 28, 2010

PLEASE DON'T TOP-POST, PLEASE PUT YOUR REPLY AT THE BOTTOM, BELOW ANY
QUOTED TEXT. THANKS!

Actually, it's better to post inline, with comments interspersed with
quoted material.

RedGrittyBrick · Jan 29, 2010

Lew said:
Actually, it's better to post inline, with comments interspersed with
quoted material.

One step at a time!

Lothar Kimmeringer · Jan 30, 2010

moonhkt said:
I want output the Character in the string one by one.

If you mean by "output" printing it out on the console,
you have to make sure that the console is actually capable
of printing unicode-characters.

The ? on the second position indicates that it isn't, so
there is no way to print it out that way. The way the first
character is given out the console most likey runs with
CP850 commonly used with DOS-boxes in Europe.

Regards, Lothar
--
Lothar Kimmeringer E-Mail: (e-mail address removed)
PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

Always remember: The answer is forty-two, there can only be wrong
questions!

Read utf-8 file return utf-16 coding hex string ?	18	Jan 29, 2010
Cyrillic text from file - set utf8 in cmd, unknown characters output anyway	0	Nov 11, 2022
split UTF-8 string to multi UTF8-file	2	Jan 26, 2010
change ISO8859-1 to GB2312	17	May 19, 2010
Display Byte value for GB2123 Character	3	May 26, 2010
Read utf-8 file	1	Mar 18, 2013
Revised Question on File Processing	2	Jan 27, 2013
HTTP request with trailer	0	Mar 22, 2024

Read utf-8 char one by one

moonhkt

Mayeul

Lothar Kimmeringer

moonhkt

RedGrittyBrick

moonhkt

Lew

Roedy Green

RedGrittyBrick

moonhkt

RedGrittyBrick

Lew

RedGrittyBrick

Lothar Kimmeringer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads