J
JD
I frequently receive website copy in the form of Word documents. If I
copy and paste the content directly from Word into my text editor, I
often find that my web pages fail to validate due to "non SGML character
number n" errors.
I decided to write a little tool in C that reads in the copy and
substitutes character entity references for any characters that will
cause the above error. However, I'm confused about what to include in
this program and what to leave out. For example, even though there's an
entity reference for the copyright symbol, I've found I can put this
symbol directly in the source and the page still validates. In that
case, why use the entity reference at all?
Is there a definitive list somewhere of which characters need to be
encoded and which do not?
I use the HTML 4.01 Strict doctype and my documents have ISO-8859-1
encoding according to 'Page Info' in FF3.
copy and paste the content directly from Word into my text editor, I
often find that my web pages fail to validate due to "non SGML character
number n" errors.
I decided to write a little tool in C that reads in the copy and
substitutes character entity references for any characters that will
cause the above error. However, I'm confused about what to include in
this program and what to leave out. For example, even though there's an
entity reference for the copyright symbol, I've found I can put this
symbol directly in the source and the page still validates. In that
case, why use the entity reference at all?
Is there a definitive list somewhere of which characters need to be
encoded and which do not?
I use the HTML 4.01 Strict doctype and my documents have ISO-8859-1
encoding according to 'Page Info' in FF3.