persian languages charset, and what DOCTYPE?

S

Simon

Hi,

I was asked to have a look at a page that apparently does not display
Persian language.
The obvious 2 problems is that the pages does not have doctype or Charest.

But if I add a Charest and/or DOCTYPE, (any of them), in the page then the
whole page changes, (the width changes).

http://journalhome.com/razavi

I have tried it with FF and IE and they both look ok without DOCTYPE and
Charest.

So what default are been used? Because whatever I add it does not display
properly.

Also the user claims that we don't support "persian languages". But I can
see everything fine, (I don't understand what it says, but it 'looks' ok).

So, what Charest/DOCTYPE should I add without breaking the current display?
And why would "persian language" not work form some users?

Simon
 
A

Alan J. Flavell

I was asked to have a look at a page that apparently does not display
Persian language. ....
http://journalhome.com/razavi

Most if it looks plausible to me, although I don't read Persian
(Farsi).
The obvious 2 problems is that the pages does not have doctype or
Charest.

Absence of DOCTYPE is no reason for a browser to fail to display,
although it'll mean the page is rendered in quirks mode by browsers
which do that sort of thing.

Well, a glance at the source indicates that it's been extruded by some
MS Office tool, so I wouldn't expect much.
I have tried it with FF and IE and they both look ok without DOCTYPE
and Charest.

You seem to be consistent in mis-typing that MIME attribute name :-}
So what default are been used?

Most of the content appears to have been included as
references instead of actual coded characters; so specifying any
character encoding (charset=) which includes us-ascii would be
sufficient to get that rendered correctly.
Because whatever I add it does not display properly.

You're not giving very much of a clue as to what kind of "improperly"
you/they are seeing.

There aren't many actual coded characters in the document, which makes
it hard to do diagnostics on that aspect. I can't find an encoding
which is consistent with them all.

My guess is that it's not all in the same encoding, and, as such, is
hopelessly broken. My hunch is that it's in a mixture of Windows-1256
and utf-8, but as I can't actually read Farsi, I could be wrong.
Also the user claims that we don't support "persian languages".

Pardon? Who's "we", and why should that be a limitation on alt.html?

As for DOCTYPE, there isn't one that fits the kind of garbage that
gets extruded by MS. Whichever of the W3C DOCTYPEs you use, you're
going to get handfuls of validation errors against it. If their
software doesn't supply one, I'd recommend leaving it that way - well,
what I would *really* recommend is changing to some software that's
capable of generating valid HTML, but presumably that isn't an option
for you.
 
S

Simon

Absence of DOCTYPE is no reason for a browser to fail to display,
although it'll mean the page is rendered in quirks mode by browsers
which do that sort of thing.

Well, a glance at the source indicates that it's been extruded by some
MS Office tool, so I wouldn't expect much.

I guess so.
You seem to be consistent in mis-typing that MIME attribute name :-}

Sorry, I didn't check my spell checker.
Most of the content appears to have been included as
references instead of actual coded characters; so specifying any
character encoding (charset=) which includes us-ascii would be
sufficient to get that rendered correctly.


You're not giving very much of a clue as to what kind of "improperly"
you/they are seeing.

I am not certain what else to say really, if I add any doctype the width of
the document changes, (with horizontal scrollbar).
If I add any charset the same happens.
If I have neither charset or Doctype the display is as you see it.
There aren't many actual coded characters in the document, which makes
it hard to do diagnostics on that aspect. I can't find an encoding
which is consistent with them all.

My guess is that it's not all in the same encoding, and, as such, is
hopelessly broken. My hunch is that it's in a mixture of Windows-1256
and utf-8, but as I can't actually read Farsi, I could be wrong.

So, at best i could use "Windows-1256" and that might work. I would have to
ask the user to try as it is their template.
Pardon? Who's "we", and why should that be a limitation on alt.html?

We, http://www.journalhome.com as the host, nothing to do with alt.html. I
am only asking here for help here.
I am just suprised that it displays the code on some machine, (by the looks
of it yours and mine), and it does not work on other machines.
I am guessing that the user browser understands the &#; but the machine does
not have the fonts to actually display them.
As for DOCTYPE, there isn't one that fits the kind of garbage that
gets extruded by MS. Whichever of the W3C DOCTYPEs you use, you're
going to get handfuls of validation errors against it. If their
software doesn't supply one, I'd recommend leaving it that way - well,
what I would *really* recommend is changing to some software that's
capable of generating valid HTML, but presumably that isn't an option
for you.

A bit strange that both browsers seem to display ok without a DOCTYPE, what
do they use?

Thanks

Simon
 
J

Jukka K. Korpela

Simon said:
I am not certain what else to say really, if I add any doctype the width of
the document changes, (with horizontal scrollbar).

Really _any_ doctype? Anyway, adding a doctype that throws some browsers
into non-quirks (or "standard") mode may surely change something in the
layout.
If I add any charset the same happens.

That sounds rather odd. _Any_ charset? Anyway, there's the _content_
problem that some of the content is apparently distorted, since it's
data in some strange and unspecified encoding. This should have higher
priority in the repair list.
So, at best i could use "Windows-1256" and that might work. I would have to
ask the user to try as it is their template.

What exactly are you working with? Trying to fix the page, or to help
someone view it despite its being broken? In the latter case, you need
to know the language used on the page and try different encodings and
see if some of them looks right. In the former case, the information
producer should be requested to specify the encoding or to convert the
data to format.

If you are the host, then it is your responsibility to inform authors
about the way(s) to make your server send the correct Content-Type
information, with a charset parameter as specified by the author. As the
second best approach, send no charset information (as now) and allow
authors to use .htaccess or similar technique.

It is _not_ your responsibility as a service provided to find out the
encoding of a document or even to help authors to decide on the encoding
they'll use - assuming, of course, that you have not promised such a
service. It might be a good idea to offer some general guidance, as
courtesy, but surely you need know about such matters well before being
able to help others.
I am just suprised that it displays the code on some machine, (by the looks
of it yours and mine), and it does not work on other machines.

Which "it" displays which "code" in which sense?
I am guessing that the user browser understands the &#; but the machine does
not have the fonts to actually display them.

That's quite possible, but how does that relate to the other problems
you have mentioned? It's a user-side problem, and authors may wish to
consider them at a general level when making their own decisions.
A bit strange that both browsers seem to display ok without a DOCTYPE, what
do they use?

Browsers don't use DOCTYPEs for anything but misguided guesses on
whether they should display the page in an intentionally broken manner
(i.e., DOCTYPE sniffing).

As a service provider, you don't need to worry about DOCTYPEs (except of
course on your own pages). They are to be provided by authors. You just
need to take care so that your server software does not add any extra
stuff at the start of the document, as some "free" providers do, thereby
messing up DOCTYPE detection. It seems that this is not a problem in
your case.
 
H

Harlan Messinger

Simon said:
Hi,

I was asked to have a look at a page that apparently does not display
Persian language.
The obvious 2 problems is that the pages does not have doctype or Charest.

DOCTYPE has nothing to do with character representation. If the document
is served with a correct HTTP content type header, then a content-type
META tag is irrelevant.

Your page looks mostly fine in my Firefox, which thinks that your page
is encoded as Windows 1252, which lacks Arabic/Persian support
altogether. But it doesn't matter what encoding is claimed, as long as
ASCII is a subset of it, because the characters are encoded as numeric
character references. The only flaws are a number of question marks that
were obviously meant to be something else, and the appearance in two
places of "تست2", once after the date at the top, and once as the
first item in list of Recent Posts. The first one appears in the page
source as "تست2" and the second appears as
"تست2", the character entity
representation of the same thing.
 
A

Alan J. Flavell

else, and the appearance in two places of "تست2", once after the
date at the top, and once as the first item in list of Recent Posts.
The first one appears in the page source as "تست2"

Yes, I'd spotted that, and noted that if interpreted as utf-8 it turns
out as Arabic-script characters, which made it seem as if that part
had been inserted into it incorrectly.
and the second appears as
"تست2", the character entity
representation of the same thing.

Blimey, so it does! I hadn't spotted that at first look. So it's
worse than just broken!!

Furthermore, I now see loads of hrefs like these:

http://journalhome.com/razavi/21877/تست2.html

*Shudder*

For what it's worth - coming back to the تست2 which we saw, if I
convert[1] that from utf-8 to us-ascii encoding then the result reads:

تست2

which can be decoded e.g with my trusty decoding ring (;-) at
http://ppewww.ph.gla.ac.uk/~flavell/unicode/unidata06.html


At this kind of third-hand remove from the original complainant, and
with me only understanding the theory of the character representation,
without being able to read Farsi - nor have the slightest inclination
to tangle with the mess that comes out of MS's attempts to extrude
something resembling HTML, I'm afraid I can't go much further than to
say that these pages seem to be dreadfully broken; it's a wonder that
anything comes out as intended.

good luck (you-all will need it!)

[1] by "convert" I mean, in Seamonkey (nee Mozilla), manually set
View> Encoding to utf-8, then File> Edit Page, then in Composer,
"Save and change character encoding". Unfortunately it doesn't
offer us-ascii as an option, but any 8-bit encoding which doesn't
cover Arabic would suffice for this purpose - e.g Armenian, Thai,
whatever you like. (Perhaps we should ask the Mozilla folks to
support saving in us-ascii explicitly?).
 
N

Neredbojias

I am not certain what else to say really, if I add any doctype the
width of the document changes, (with horizontal scrollbar).
If I add any charset the same happens.
If I have neither charset or Doctype the display is as you see it.

One issue is that the lack of a doctype puts browsers in "quirks mode".
The markup of the page is definitely not "strict" markup and there are
errors as well. Your main problem is archaic and invalid html.
 
A

Alan J. Flavell

Yes, I'd spotted that,

I've just noticed that Goo-groups has completely garbled this part of
the thread, as displayed in its normal view, by re-interpreting the
above string of seven Latin-1 characters as posted and seen by us,
into a string of four utf-8 characters - as we would assume had been
intended (but not actually achieved) by the original web page. Thus
completely obscuring the problem which we were discussing!!! Grrrrrr.

Curiously, if the usenet postings are viewed in goo-groups by using
their "Show original" option, instead of their default thread display,
then they come out in the way that you (and I) posted, i.e exhibiting
the problem that we were discussing. Their unsolicited
re-interpretation of the character encoding in their thread view,
*IGNORING* the explicit specification of charset=iso-8859-1 which
appears in both of our posting headers, gives us yet another reason to
advocate that anyone seriously considering goo-groups as a usenet
interface would be better advised to Get a Real News Reader.

Here are the characters again, but this time interleaved with spaces.
Let's see how goo-groups will garble this: "Ø ª Ø ³ Ø ª 2".

sigh.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top