scanning UTF-8 characters

  • Thread starter Kamal R. Prasad
  • Start date
K

Kamal R. Prasad

Hello,

I am using a lexer (lex specification supplied to lex) to parse data,
and one of the requirements is to handle UTF-8 characters. My
understanding is that the first non-ascii character's byte will be >
0x7f in a UTF-8 character If I look for the same in yytext -will that
suffice? Is there some std function that one can use to operate on the
input stream? I want my code to be locale agnostic.

thanks
-kamal
 
J

Jack Klein

Hello,

I am using a lexer (lex specification supplied to lex) to parse data,
and one of the requirements is to handle UTF-8 characters. My
understanding is that the first non-ascii character's byte will be >
0x7f in a UTF-8 character If I look for the same in yytext -will that
suffice? Is there some std function that one can use to operate on the
input stream? I want my code to be locale agnostic.

thanks
-kamal

Neither lex nor UTF-8 is defined by the C language. Information on
UTF-8 can be obtained from http://www.unicode.org. Questions about
lex can be asked in
 
M

Micah Cowan

Kamal R. Prasad said:
Hello,

I am using a lexer (lex specification supplied to lex) to parse data,
and one of the requirements is to handle UTF-8 characters. My
understanding is that the first non-ascii character's byte will be >
0x7f in a UTF-8 character If I look for the same in yytext -will that
suffice? Is there some std function that one can use to operate on the
input stream? I want my code to be locale agnostic.

Not really topical here in clc and clcm, I'm afraid. I've redirected
to comp.unix.programmer, where I believe you'll find more people able
to answer your question.

The /first/ non-ascii character's byte will be > 0xC0. But, yeah, you
should test for the high-bit. /All/ of the bytes in a
non-single-byte-character will be greater than 0x7f. The first byte
also has encoded information about how many bytes there are, total,
for this character.

As to how this fits in with lex, I'm not really qualified to say
much. Is it sufficient to look for the high bit? It depends on what
you intend to do after you've found one. And to be locale agnostic,
you'll probably need something to convert the locale's encoding into
UTF8 before scanning.
 
Y

Yang Jiao

I don't know if a lexer (for me, I get flex in hand) could do anything to
identify the UTF-8 char, I m afraid u should do the job by ur own code.
 
J

Jasen Betts

["Followup-To:" header set to comp.lang.c.moderated.]
I am using a lexer (lex specification supplied to lex) to parse data,
and one of the requirements is to handle UTF-8 characters. My
understanding is that the first non-ascii character's byte will be >
0x7f in a UTF-8 character If I look for the same in yytext -will that
suffice?

in most cases, there's a thing called windowing that can IIRC substitute
other symbols into the 0x00 to 0x7f range.
Is there some std function that one can use to operate on the
input stream? I want my code to be locale agnostic.

if you treat characters above 7f as if they were ordinary letters and make
no assumption of word-length or display width you should be fairly safe,

if you're hoping to identify digits and punctuation in unusual scripts
(Chinese, Sinhala, Sanscrit, Klingon etc) you'll need to do convert your
UTF-8 stream to unicode glyphs and pass them to the lexer.


Bye.
Jasen
 
D

Douglas A. Gwyn

Kamal R. Prasad said:
I am using a lexer (lex specification supplied to lex) to parse data,
and one of the requirements is to handle UTF-8 characters. My
understanding is that the first non-ascii character's byte will be >
0x7f in a UTF-8 character If I look for the same in yytext -will that
suffice? Is there some std function that one can use to operate on the
input stream? I want my code to be locale agnostic.

You need to check that your version of "lex" supports wide characters,
which most do not. Otherwise you have to lex every possible character
into a token, which is almost certainly not what you want to do.

In most situations, it is easier to hand-code a lexer than to use "lex",
and here is a case where this is even more likely to be the case.

Convert the UTF-8 to 31-bit "Unicode" points and handle characters
solely as "wide" characters throughout.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,239
Members
46,827
Latest member
DMUK_Beginner

Latest Threads

Top