Multiple Language Website

G

GS

Hi there. I hope this is the right place, to what should be a simple
question.

I have a website that is in English and now in Arabic. I am creating the
Arabic language content now, and am having a few problems getting the
content to display properly.

When I edit the files with the Arabic characters on my Windows box, in say
Notepad, the Arabic gets striped unless I save it as a Unicode document
(ANSI strips the Arabic and converts the chars into question marks). Now,
when I upload the Unicode document to my webserver, instead of parsing the
document normally, it is just displaying the actual contents of the file,
literally (it is a PHP page, so you see the <??> and other actual code being
displayed). Any idea what I am doing wrong? I am not sure what the problem
might be (i.e. file format, ftp transfer mode, web-server config, etc) so I
thought I would start here.

I am using the meta tag:
<meta http-equiv="Content-Type" content="text/html;charset=windows-1252">

Should I be using:
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"> ?

Will this cure the code display issue?

Thank you for any help you can offer,

GS
 
J

Jukka K. Korpela

GS said:
When I edit the files with the Arabic characters on my Windows box,
in say Notepad, the Arabic gets striped unless I save it as a
Unicode document

Why do you use Notepad? There are nice multilingual editors available,
with much better features.
(ANSI strips the Arabic and converts the chars
into question marks).

No, the American National Standards Institute does not strip anything.
But Microsoft software, which falsely calls a Microsoft proprietary
encoding "ANSI", does something like that, since that encoding has no
codes for any Arabic letters.
Now, when I upload the Unicode document to
my webserver, instead of parsing the document normally, it is just
displaying the actual contents of the file, literally (it is a PHP
page, so you see the <??> and other actual code being displayed).

If you want real help, post a real URL. It will not tell anything,
especially when PHP is involved, but it is a start. Also please specify
the browser(s) you used for testing.
Any idea what I am doing wrong? I am not sure what the problem
might be (i.e. file format, ftp transfer mode, web-server config,
etc) so I thought I would start here.

Well, we cannot even know what the FTP transfer mode was. Surely it
should have been binary.
I am using the meta tag:
<meta http-equiv="Content-Type"
content="text/html;charset=windows-1252">

This may matter, or it may not, depending on the actual HTTP headers.
It is certainly wrong, anyway, if the encoding is UTF-8 and not
windows-1252. _Why_ do you use it?
Should I be using:
<meta http-equiv="Content-Type" content="text/html;charset=utf-8">
?

Will this cure the code display issue?

You mean you did not test that before posting?

Of course, testing would not prove much. But if your document is, in
fact, UTF-8 encoded, as it sounds, then surely it should not contain a
meta tag that says otherwise. On the other hand, a meta tag is neither
necessary nor sufficient - it will be overridden by actual HTTP
headers, if they specify the encoding.
 
G

GS

Jukka K. Korpela said:
Why do you use Notepad? There are nice multilingual editors available,
with much better features.

Simply because I only had access to a locked-down machine that I was unable
to install a better editor on. Any suggestions?
No, the American National Standards Institute does not strip anything.
But Microsoft software, which falsely calls a Microsoft proprietary
encoding "ANSI", does something like that, since that encoding has no
codes for any Arabic letters.

My appologies, I meant Microsoft ANSI then.
If you want real help, post a real URL. It will not tell anything,
especially when PHP is involved, but it is a start. Also please specify
the browser(s) you used for testing.

Browsers: IE 6.x, Firefox 1.03

Don't have a URL right now, as I took down the test page due to the code
being shown.
Well, we cannot even know what the FTP transfer mode was. Surely it
should have been binary.

FTP mode was indeed binary, sorry for not mentioning. As I did mentioned, I
am just starting to try to figure this out. I imagined someone in here had
at one time had this exact problem and would know exactly what was going on.
This may matter, or it may not, depending on the actual HTTP headers.
It is certainly wrong, anyway, if the encoding is UTF-8 and not
windows-1252. _Why_ do you use it?

I use windows-1252 because I have seen in other places where this should be
used to alert the browsers of incoming text that may have many different
character variations, including right-to-left. Looking at many different
Arabic websites, they seem to make use of this meta tag as well.
You mean you did not test that before posting?

I did, but it made no difference at the time, but I was not sure if this was
needed. This should have been broken out into a second question. I should
have asked:

If I want to display English and Arabic on the same page, which meta tag
will be more appropriate, and does this meta tag override what the webserver
sends for a header (which you answered below, thank you)?

Currently, my Apache webserver is sending
Content-Type:·text/html;·charset=iso-8859-1. Is this an appropriate header
for displaying Arabic, etc.?
 
A

Andreas Prilop

T

Toby Inkster

GS said:
Should I be using:
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"> ?
Perhaps.

Will this cure the code display issue?

No.

I'm guessing that you have a file naming or server configuration issue.
 
N

N Cook

GS said:
Hi there. I hope this is the right place, to what should be a simple
question.

I have a website that is in English and now in Arabic. I am creating the
Arabic language content now, and am having a few problems getting the
content to display properly.

When I edit the files with the Arabic characters on my Windows box, in say
Notepad, the Arabic gets striped unless I save it as a Unicode document
(ANSI strips the Arabic and converts the chars into question marks). Now,
when I upload the Unicode document to my webserver, instead of parsing the
document normally, it is just displaying the actual contents of the file,
literally (it is a PHP page, so you see the <??> and other actual code being
displayed). Any idea what I am doing wrong? I am not sure what the problem
might be (i.e. file format, ftp transfer mode, web-server config, etc) so I
thought I would start here.

I am using the meta tag:
<meta http-equiv="Content-Type" content="text/html;charset=windows-1252">

Should I be using:
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"> ?

Will this cure the code display issue?

Thank you for any help you can offer,

GS

Probably related to the prob. i had and now solved

Foreign unicode script on a file which corrupted the Google cached version
of otherwise English page.
I downloaded Hex Editor XVI32 from
http://www.chmaas.handshake.de/delphi/freeware/xvi32/xvi32.htm
That allowed me to remove the 2 characters ÿþ / hex FE,FF / ASCII 255,266 /
y diaresis and p with
ascender that clogs up the front of the file, which you cannot see let alone
edit out in Word or Notepad.
Apparently this is appended to denote the file contains unicode,
the BOM Byte Order Mark and also Zero Width Non-Breaking
Space (ZWNBSP) . Google cached interprets this as inter-character spaces
throughout
the cached version and consequential loss of HTML action. The preview pane
on Google
is also corrupted because of the spaces mangling HTML. I'm surprised there
is
nothing on Google FAQs pages about this. Putting "ÿþ" and "h t m l" in
Google
produced 206,000 hits. Randomly selecting 5x10 of those showed 44 were
mangled so
perhaps about 180,00 such affected files.
With Hex editor also "Replace All " inter-character 00 to (blank/empty)
which also
reduces the file size by half.
Then a matter of converting the foreign code characters like hex
code [ 05D2 ] to decimal code [ & # 1 4 9 0 (no spaces) ] which Google
Cached seems to like and also
browsers. For smallish amounts of text for conversion: - in Word convert all
end of line ^p to * , to concattenate and then break up to lines of about
100 characters.
Submit each line in turn to Google ( much more than 100 is a Google illegal
op)
and it returns search string as &#....; form, highlight and copy back.
In Word convert back * to ^p , saving as non-unicode text in a non-unicode
HTML file
and compare the result when viewed on a browser with a .png, .gif ,
or .jpg form of the script to check. Then add to English file.
For a load of foreign text use the block routine in XVI32
and copy Hex to Word as a .txt file after removing FE,FF and converting all
the 00 to 0D0A
and any spaces/punctuation to 2020 or whatever as 4 characters.
Giving a file of lines of 4 characters after converting 0D0A to ^p.
Then make a macro for converting adjascent
quad alphanumeric characters to decimal numeric. Finally changing ^p to ;&#
and tidying up punctuation etc.
I used this Yale file as a model which part reads correctly as foreign
script on a browser and is
cached by Google correctly
http://pclt.cis.yale.edu/pclt/encoding/
and a bare minimum of HTML eg not even LANG designation.
So with hindsight just save the foreign Hex text as unicode file and convert
to decimal form before adding to full English file and then can continue to
save
as ANSI and retain correct caching of HTML on Google.
For anyone else with this problem but where they have no foreign
text on their file and accidently saved their file as Unicode.
Without a Hex Editor you will not see the ÿþ or double zeros that Google
sees.
Suggestion: rename your file from XYZ.htm to XYZ_old.htm
View it in Internet Explorer and click View / Source,
"Select All" the text and copy to notepad and name
the file XYZ.htm saving as ANSI and not Unicode.
If you want to check the file then download the XVI32 hex editor
( link above) - its only about 500KByte so
only takes a couple of minutes and compare the two versions of your file.
 
J

Jukka K. Korpela

GS said:
Simply because I only had access to a locked-down machine that I
was unable to install a better editor on. Any suggestions?

I think you should try and find a computer that you have some control
over, if you wish to create Arabic Web pages seriously, or any Web
pages seriously. Ultimately it's a matter of your convenience only, but
still.
Don't have a URL right now, as I took down the test page due to the
code being shown.

Umm... the URL would have let us see what the server really sends.
I use windows-1252 because I have seen in other places where this
should be used to alert the browsers of incoming text that may have
many different character variations, including right-to-left.

Pardon? Where? Windows-1252 means Windows Latin 1, which has no Arabic
letters, so either you misunderstood something, or those sites do
something that overrides this error.
If I want to display English and Arabic on the same page, which
meta tag will be more appropriate,

This is a whole new question. As a rule, don't mix languages. There are
millions of people who know English but no Arabic, or vice versa. Why
would you throw a foreign language at them? There are some excuses,
most notably a link to an Arabic version of the page in the English
version, or vice versa.

Mixing English and Arabic isn't really much of a problem at the
encoding level, since any encoding that lets you use Arabic letters
lets you use English letters as well. It would be more difficult if you
wanted to combine French and Arabic, for example.

Forget meta tags, at least for now. Select an encoding, and specify it
in HTTP headers. It could be UTF-8, or it could be ISO-8859-6, for
example. Other things being equal, use UTF-8.
Currently, my Apache webserver is sending
Content-Type:·text/html;·charset=iso-8859-1. Is this an appropriate
header for displaying Arabic, etc.?

No, because the ISO-8859-1 repertoire is a subset of the windows-1252
(or "Microsoft ANSI") repertoire and thus does not contain any Arabic
letters. The server should be configured to send e.g.
Content-Type:·text/html; charset=utf-8
if your files are UTF-8 encoded. If you cannot do that, check if you
can make the server send _no_ charset parameter in that header; _then_
you can effectively specify the encoding in a meta tag. If you cannot
do even that, i.e. the server persistently claims that everything is
ISO-8859-1, then your only option (apart from getting a better server)
for writing Arabic pages is to write all Arabic characters using
character references, like ا. It's possible, but awkward, at
least if you no nice tool that lets you write normal Arabic and then
converts it to a format with character references.
 
T

Toby Inkster

Jukka said:
The server should be configured to send e.g.
Content-Type:·text/html; charset=utf-8
if your files are UTF-8 encoded. If you cannot do that, check if you
can make the server send _no_ charset parameter in that header

The OP has already stated that he's using PHP. In which case, sending an
appropriate header is as simple as putting this in an include file (say
"headers.php":

<?php

$ua = $_SERVER['HTTP_USER_AGENT'];

if (preg_match('/^Mosaic/',$ua))
{
header("Content-Type: text/html");
}

else
{
header("Content-Type: text/html; charset=utf-8");
}

?>

and then including it at the top of every file like this:

<?php require_once "headers.php"; ?>
<!DOCTYPE ....
 
D

Dan

N said:
Apparently this is appended to denote the file contains unicode,
the BOM Byte Order Mark and also Zero Width Non-Breaking
Space (ZWNBSP) . Google cached interprets this as inter-character spaces
throughout
the cached version and consequential loss of HTML action.

Sounds like the page was encoded in a 16-bit encoding such as UTF-16LE
(where every character takes two bytes) rather than a variable-size
encoding where the characters in the US-ASCII range take only one byte.
Perhaps the server wasn't sending proper headers to indicate this
encoding.
nothing on Google FAQs pages about this. Putting "ÿþ" and "h t m l" in
Google
produced 206,000 hits.

I looked at one of the sites reachable by this, and the server was
sending the proper header of UTF-16LE, but the HTML document had a
bogus meta tag incorrectly claiming the encoding was iso-8859-1. By
the standards, browsers will ignore the meta tag when there's an actual
HTTP header, but perhaps it confuses search engines.
Then a matter of converting the foreign code characters like hex
code [ 05D2 ] to decimal code [ & # 1 4 9 0 (no spaces) ] which Google
Cached seems to like and also
browsers.

Actually, you should include a semicolon at the end of numeric
character references.
For smallish amounts of text for conversion: - in Word convert all
end of line ^p to * , to concattenate and then break up to lines of about
100 characters.
Submit each line in turn to Google ( much more than 100 is a Google illegal
op)
and it returns search string as &#....; form, highlight and copy back.
In Word convert back * to ^p , saving as non-unicode text in a non-unicode
HTML file
and compare the result when viewed on a browser with a .png, .gif ,
or .jpg form of the script to check. Then add to English file.

That sounds like a really clumsy way of doing it compared to using a
decent editor that lets you choose what character encoding to save as.
And I wouldn't let MS Word touch in any way a document I intend on
placing on the Web; that program (and anything else from Microsoft) is
bad news for standards compliance.
Suggestion: rename your file from XYZ.htm to XYZ_old.htm

I prefer the extension .html myself, not the dumbed-down three-letter
version designed to be compatible with 10-year-obsolete operating
systems that can't handle longer filenames.
View it in Internet Explorer and click View / Source,

Or, you can use a *decent* browser instead. I use Mozilla.
 
N

N Cook

N said:
Apparently this is appended to denote the file contains unicode,
the BOM Byte Order Mark and also Zero Width Non-Breaking
Space (ZWNBSP) . Google cached interprets this as inter-character spaces
throughout
the cached version and consequential loss of HTML action.

Sounds like the page was encoded in a 16-bit encoding such as UTF-16LE
(where every character takes two bytes) rather than a variable-size
encoding where the characters in the US-ASCII range take only one byte.
Perhaps the server wasn't sending proper headers to indicate this
encoding.
nothing on Google FAQs pages about this. Putting "ÿþ" and "h t m l" in
Google
produced 206,000 hits.

I looked at one of the sites reachable by this, and the server was
sending the proper header of UTF-16LE, but the HTML document had a
bogus meta tag incorrectly claiming the encoding was iso-8859-1. By
the standards, browsers will ignore the meta tag when there's an actual
HTTP header, but perhaps it confuses search engines.
Then a matter of converting the foreign code characters like hex
code [ 05D2 ] to decimal code [ & # 1 4 9 0 (no spaces) ] which Google
Cached seems to like and also
browsers.

Actually, you should include a semicolon at the end of numeric
character references.
For smallish amounts of text for conversion: - in Word convert all
end of line ^p to * , to concattenate and then break up to lines of about
100 characters.
Submit each line in turn to Google ( much more than 100 is a Google illegal
op)
and it returns search string as &#....; form, highlight and copy back.
In Word convert back * to ^p , saving as non-unicode text in a non-unicode
HTML file
and compare the result when viewed on a browser with a .png, .gif ,
or .jpg form of the script to check. Then add to English file.

That sounds like a really clumsy way of doing it compared to using a
decent editor that lets you choose what character encoding to save as.
And I wouldn't let MS Word touch in any way a document I intend on
placing on the Web; that program (and anything else from Microsoft) is
bad news for standards compliance.
Suggestion: rename your file from XYZ.htm to XYZ_old.htm

I prefer the extension .html myself, not the dumbed-down three-letter
version designed to be compatible with 10-year-obsolete operating
systems that can't handle longer filenames.
View it in Internet Explorer and click View / Source,

Or, you can use a *decent* browser instead. I use Mozilla.

--
Dan


This was the 'reply' (human/bot?) I got back from emailing Google help

______

Thank you for your note.

Thank you for your reply. We're happy to hear that this problem has been
resolved. If we can assist you in the future, please don't hesitate to
write.

Regards,
The Google Team

Regards,
The Google Team
 
A

Alan J. Flavell

Probably related to the prob. i had and now solved

let's see...
I downloaded Hex Editor XVI32 from
http://www.chmaas.handshake.de/delphi/freeware/xvi32/xvi32.htm
That allowed me to remove the 2 characters ÿþ / hex FE,FF / ASCII 255,266 /
y diaresis and p with
ascender that clogs up the front of the file,

Hang on. That's not "two characters", that's a two-byte sequence which you
can read about (well, from what follows maybe you already did) at the
Unicode FAQ, http://www.unicode.org/faq/utf_bom.html#22

FE,FF designates the data as being in UTF-16BE format, so if the data is
not in fact in UTF-16BE format then something has gone unpleasantly wrong
with it before you got it, and I'd recommend finding out what went wrong,
because messing around with corrupt data after the event is not a very
robust way to engineer anything.

Btw, I would not recommend serving out data in any utf-16 format to the
web even if you did have it... (if you need a Unicode format for the web
then utf-8 is the recommended choice).
Apparently this is appended to denote the file contains unicode,
the BOM Byte Order Mark and also Zero Width Non-Breaking
Space (ZWNBSP) .

Well, it means one or the other, yes, although it certainly doesn't mean
ZWNBSP if the coding is iso-8859-1 or windows-1252 or utf-8. I think the
FAQ tries to clear this up a bit.

In short it seems your server was saying one thing and the data was trying
to say something else, and some recipients (such as Google) were
understandably confused.


metacomment: my normal news server does not take alt.* groups, so I rarely
participate here. I'd be happy to continue any interesting features of
the discussion in relevant comp.* groups such as
comp.infosystems.www.authoring.html

But I think you've had enough input from A.Prilop and J.Korpela to take
things forward already.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top