printf and UTF-8 in linux

sas · Sep 18, 2009

I have a problem printing cyrillic text to stdout in Linux. I know
that it has to be UTF-8. I'm trying to read a symbol, guess that it is
cyrillic encoded as CP1251, and if so output it as cyrillic in UTF-8,
but my code so far doesn't work.

#include <stdio.h>
#include <stdlib.h>
#include <locale.h>

int
main()
{
if (!setlocale(LC_CTYPE, "")) {
fprintf(stderr, "Can't set the specified locale! "
"Check LANG, LC_CTYPE, LC_ALL.\n");
return 1;
}

while (!feof(stdin)) {
wchar_t c = fgetc(stdin);

// 'á'-'ñ'
if (c >= 0xc0 && c <= 0xdf)
{
c -= 0xc0;
c += 0x410;
}

// 'Á'-'Ñ'
if (c >= 0xe0 && c <= 0xff)
{
c -= 0xe0;
c += 0x430;
}

printf("%lc", c);
}

return 0;
}

Morris Keesan · Sep 18, 2009

I don't know enough about Cyrillic character sets,and international
character sets in C in general, to be able to help with the question
you're asking, but this:

while (!feof(stdin)) {
wchar_t c = fgetc(stdin);

doesn't do what you expect. feof(file) only returns true if the
end-of-file indicator has been set for the file, the that indicator
only gets set after you've tried to read one char past the end of
the file. So you'll always try to process c one extra time, when
its value is equal to (wchar_t)EOF.

A common idiom is
int c;
while ((c = fgetc(stdin)) != EOF)
{
...

Ben Bacarisse · Sep 18, 2009

sas said:
I have a problem printing cyrillic text to stdout in Linux. I know
that it has to be UTF-8. I'm trying to read a symbol, guess that it is
cyrillic encoded as CP1251, and if so output it as cyrillic in UTF-8,
but my code so far doesn't work.

The problem is quite likely to be with the input. If the locate is
set for UTF-9 output, how are you going to enter a CP1251 character?

The program actually works fine (despite one oddity) is the input is
as expected. To test it I had to generate a CP1251 file first and use
redirection to get the program to read it.

#include <stdio.h>
#include <stdlib.h>
#include <locale.h>

int
main()
{
if (!setlocale(LC_CTYPE, "")) {
fprintf(stderr, "Can't set the specified locale! "
"Check LANG, LC_CTYPE, LC_ALL.\n");
return 1;
}

while (!feof(stdin)) {
wchar_t c = fgetc(stdin);

fgetc returns an int so this is a little odd. CP1251 is a single-byte
character set so you don't need a wchar_t to hold it.

Also it is better to test for EOF after trying to read a character or
you will process the EOF:

int ch;
while ((ch = fgetc(stding)) != EOF) { ... }

is the usual pattern.

Now you do need a wchar_t to hold the wide character:

wchar_t c = ch;

sas · Sep 19, 2009

The problem is quite likely to be with the input. If the locate is
set for UTF-9 output, how are you going to enter a CP1251 character?

That is my problem! I have no idea how locales work, or even which
locale I should use, can you give me some information?

The program actually works fine (despite one oddity) is the input is
as expected. To test it I had to generate a CP1251 file first and use
redirection to get the program to read it.

Yes, it works fine in the console, showing the correct cyrillic
letters, but mplayer (the player I use in linux) still shows garbled
text. I want to convert *.srt files that I have in CP1251 to something
that's usable under Linux. Does mplayer use a different locale?

Ben Bacarisse · Sep 19, 2009

sas said:
The problem is quite likely to be with the input. Â If the locale is
set for UTF-8 output, how are you going to enter a CP1251
character?

Click to expand...

[some typos corrected]

That is my problem! I have no idea how locales work, or even which
locale I should use, can you give me some information?

Someone (who actually knows) could write a book on that. The locale
setting determines various things about your C program. C itself says
very little about exactly what happens, so most of it is off topic
here. However, you did the C bits pretty much correctly.

Yes, it works fine in the console, showing the correct cyrillic
letters, but mplayer (the player I use in linux) still shows garbled
text. I want to convert *.srt files that I have in CP1251 to something
that's usable under Linux. Does mplayer use a different locale?

That's nothing to do with C. I think you need to ask what encodings
mplayer understands but I can't suggest the best place for that.

Why are you writing this? My first though would have been someone
must have written this already. man iconv.

sas · Sep 20, 2009

Why are you writing this? My first though would have been someone
must have written this already. man iconv.

I didn't know about this program, thanks. I googled about displaying
cyrillic subtitles in Linux, and when I couldn't find anything that
works, thought it would be faster to try and make a small program
myself. But iconv works great, so I don't need that anymore. Thanks
for letting me know about this program.

Nobody · Sep 20, 2009

I didn't know about this program, thanks. I googled about displaying
cyrillic subtitles in Linux, and when I couldn't find anything that
works, thought it would be faster to try and make a small program
myself. But iconv works great, so I don't need that anymore. Thanks
for letting me know about this program.

iconv is primarily a library function, although many implementations also
provide a program by that name (on Linux, both the function and the
program are provded by GNU libc).

An alternative is the ANSI C functions mbstowcs() and wcstombs() ("wcs"
stands for "wide character string", mbs for "multi-byte string"). These
convert to and from the encoding of the current locale (as set by
setlocale(LC_CTYPE, ...)). This can make them easier to use than iconv()
(if you just need to convert between Unicode and the locale's encoding) or
harder (if you need to convert between arbitrary encodings).

Why printf() does not care of my locale settings ?	5	Jul 1, 2009
how to use unicode in c under linux?	9	Sep 13, 2008
Can not `setlocale(3)' more than once in Linux	7	Apr 25, 2008
wcstombs() problem	16	Feb 23, 2012
Can not read VCD file in Linux	8	Feb 1, 2011
Problems with UTF-8 on Windows	1	Jan 11, 2007
Wide character input/output	14	Feb 23, 2008
wchar_t -> UTF-8?	2	Feb 8, 2004

printf and UTF-8 in linux

sas

Morris Keesan

Ben Bacarisse

sas

Ben Bacarisse

sas

Nobody

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads