how to read a Unicode file

S

starffly

I want to read a xml file in Unicode, UTF-8 or a native encoding
into a wchar_t type string, so i write a routine as follows, however,
sometimes a Unicode file including Chinese character cannot be read
completely. and I cannot tell where its root located, so NEED your
help, GIVE me a hand please.
THX.
static Status LoadXMLFile2String(const char *filename, wchar_t *text){
FILE *f;
if(!(f = fopen(filename, "r"))){
__printDebugA("Input file %s cannot be opened.", filename);
return ERROR;
}
char *encoding;
//transform routine: other --> unicode --> other
const unsigned char UTF_8_HEAD[3] = {239, 187, 191};
const unsigned char UNICODE_HEAD[2] = {255, 254};
const unsigned char UNICODE_BIGENDIAN_HEAD[2] = {254, 255};
unsigned char head[3];
fread(head, 1, 3, f);
if(!memcmp(head, UNICODE_HEAD, 2)){
encoding = "UNICODE";
}
else if(!memcmp(head, UNICODE_BIGENDIAN_HEAD, 2)){
encoding = "UNICODE_BIGENDIAN";
}
else if(!memcmp(head, UTF_8_HEAD, 3)){
encoding = "UTF_8";
}
else{
encoding = "ANSI";
}
char *str = (char *) malloc((MAXXMLFILESIZE + 1) * sizeof(char));
int i = 0;
if(!strcmp(encoding, "ANSI")){
str[0] = head[0];
str[1] = head[1];
str[2] = head[2];
i = 3;
}
else if(!strcmp(encoding, "UNICODE") || !strcmp(encoding,
"UNICODE_BIGENDIAN")){
str[0] = head[2];
i = 1;
}
while(!feof(f)){
if(i >= MAXXMLFILESIZE){
db_error(L"The file is too large.");
return ERROR;
}
str = fgetc(f);
i++;
}
str = '\0';
if(!strcmp(encoding, "UNICODE")){
for(int j = 0; j < i - 1; j++){
if(j % 2){
text[j/2] += ((unsigned char) str[j]) << 8;
}
else{
text[j/2] = (unsigned char) str[j];
}
}
text[j/2] = 0;
//db_debug(L"%d", wcslen(text));
}
else if(!strcmp(encoding, "UNICODE_BIGENDIAN")){
for(int j = 0; j < i; j++){
if(j % 2){
text[j/2] = (text[j/2] << 8) + (unsigned char) str[j];
}
else{
text[j/2] = (unsigned char) str[j];
}
}
text[j/2] = 0;
}
else if(!strcmp(encoding, "UTF_8")){
UTF2Unicode(str, text);
}
else if(!strcmp(encoding, "ANSI")){
setlocale(LC_CTYPE, "");
mbstowcs(text, str, MAXXMLFILESIZE + 1);
}
else{
assert(FALSE);
}
free(str);
fclose(f);
return OK;
}
 
?

=?iso-8859-1?q?Kirit_S=E6lensminde?=

I want to read a xml file in Unicode, UTF-8 or a native encoding
into a wchar_t type string, so i write a routine as follows, however,
sometimes a Unicode file including Chinese character cannot be read
completely. and I cannot tell where its root located, so NEED your
help, GIVE me a hand please.
THX.
[code sniped]

This code is horrible on so many levels. Mostly I suspect because it is
in C rather than C++.

You will have something much easier to work with if you reformulate
this in C++ and apply some more useful abstractions to it.

As for your error, you are only checking a few encodings and assuming
that there is a BOM to tell you which to use. You need to check the XML
prolog. It may be that the Chinese file is using a different encoding.


K
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,814
Latest member
SpicetreeDigital

Latest Threads

Top