printing character ' and " in asp using vbscript

P

Paul Randall

Thanks for posting one of your problematic characters. I think one of the problems is that a number of distinct concepts, such as charset, font, and locale, are being blurred together. At first you mentioned having problems with the single and double quote characters, and defined them as apostrophe character ' and double quote ". Recently you posted the ‘ character, which I assume is what you meant by an apostrophe character. It certainly looks a lot like what I would call a single quote, but if you put my single quote ('), and your single quote together, you can see that they are different: (‘'). Well, maybe you can see the difference, and maybe you can't. It all depends on what font the characters are being displayed in. I think in general, if a font does not contain a glyph for a character, then it displays a square or rectangular box for that character. I think most fonts contain glyphs for all characters in the range Chr(32) to Chr(127). Many fonts contain glyphs for characters in the range Chr(128) to Chr(255) too. Many fonts also include glyphs for some characters in the range ChrW(256) to ChrW(65535), which are Unicode characters. My knowledge of Unicode is limited, so some of my terminology may not be technically correct, and I would appreciate being corrected.

Copy the code below into a .vbs file and run it. You will get two message boxes. The first message box will contain two lines:

Hello *'΄‘* Unicode
΄‘

The first line contains a mixture of what might be considered Unicode and non-Unicode characters. The three characters between the asterisks (*) might all be considered single quotes, but only the first one is Chr(39), the character I consider a single quote. The second one is ChrW(900), and the third one is your single quote, ChrW(8216).

The second line displays what is left of the first line after removing all characters whose AscW value is less than 255.

I included the ChrW(900) character because it illustrates how differently certain characters may be handled.

The second message box contains info about the two Unicode characters:

1 ΄ 63 ? 900 ΄
2 ‘ 145 ‘ 8216 ‘

The six columns contain the following:
1) i (position within the string)
2) Mid(s, i, 1) the character at position i.
3) Asc(Mid(s, i, 1)) value of the character, sometimes and sometimes not.
4) Chr(Asc(Mid(s, i, 1))) Character associated with the reported Asc value.
5) AscW(Mid(s, i, 1)) Unicode value of the character.
6) ChrW(AscW(Mid(s, i, 1))) Character associated with the reported AscW value.

The Asc function almost always returns an 8-bit value, and AscW returns a 16-bit value. For certain Locales, Asc returns the same 16-bit value as AscW. See the scripting help file for info on the GetLocale and SetLocale functions. The thing to note is that depending on Locale, for some Unicode characters, the Asc function returns returns 63, a value that corresponds to a question mark, and for others it returns a value under 256 that displays the same character as is displayed by the Unicode character. So ChrW(900) maps to a question mark but ChrW(8216) maps to Chr(145). I don't have any examples that would produce the inverted question mark you talked about in your early posts.

Your posts talk about a number of code pages and charsets, like 65001 and utf-8 and iso-8859-1. I believe that charset 65001 represents all characters as fixed-length two-byte values, so it can handle all the thousands of standard Unicode characters. UTf-8 is a variable length encoding that uses one to four bytes to represent a character. It can handle all the characters that charset 65001 can handle. Charset iso-8859-1 can only handle 256 8-bit characters.

I think you should build a little standalone VBScript that displays many of your problematic characters in something like the six columns I did above, and post the result. Perhaps we can figure out a way to fix the problem after you show us what the problem is. It might help if you tell us your Locale number too. Control-C can be used to copy the text from a message box.

Option Explicit
Dim i, j, s, sMsg
s = "Hello *'" & ChrW(900) & "‘* Unicode"
msgbox s & vbcrlf & sKeepOnlyUnicode(s)
s = sKeepOnlyUnicode(s)

For i = 1 To Len(s)
sMsg = sMsg & i & vbTab & Mid(s, i, 1) & vbTab & _
Asc(Mid(s, i, 1)) & vbTab & Chr(Asc(Mid(s, i, 1))) & vbTab & _
AscW(Mid(s, i, 1)) & vbTab & ChrW(AscW(Mid(s, i, 1))) & vbCrLf
Next 'i
MsgBox sMsg

Function sKeepOnlyUnicode(sAnyString)
'Returns sAnyString with only Unicode [actually, all
' characters outside the range ChrW(0) to
' ChrW(255)] being kept. VBScript strings are made
' up of 16-bit characters so they can handle a
' lot of Unicode stuff.
With New RegExp
.Global = True
.Pattern = "[\u0000-\u00FF]"
sKeepOnlyUnicode = .Replace(sAnyString, "")
End With
End Function 'sKeepOnlyUnicode(sAnyString)


-Paul Randall
i changed the codepage tp 65001 and charset to utf-8, then the question mark ? showing earlier, has changed to the rectangle as shown below.
‘
the database field also shows the same character stored in it.
please help.

My guess is that they are not " " but are ‘ “ †typically cut'n'pasted in from Microsoft Word.

These are still in the Windows-1252 range of characters but are not strictly in the iso-8859-1 set.

Don't use http-equiv meta tags use real headers instead.

IOW ditch the meta tags and include this:-

<%Response.CharSet = "Windows-1252"%>

I'm not hopeful because you are probably using IE and IE will treat ISO-8859-1 as Windows-1252 anyway.

Always use Server.HtmlEncode on values retrieved from the Database. Stop mucking about with any other approach.

If that doesn't work view the html source from the browser. What is the server actually sending.

Another alternative is stop using Windows-1252.

Save your pages as UTF-8 change the codepage at the top of the page to 65001 and include Response.CharSet = "UTF-8" in your page.

BTW, Have you looked at the field content directly using the DB management tool?


--
Anthony Jones - MVP ASP/ASP.NET
i am attaching the sample code. actually i am printing from a field in access database. the text entered in the database contains single quotes and double quotes. when i try to print them using response.write, the double quotes are getting replaced with question marks. i have tried the method of

DataPrep = Replace(DataPrep, """", "&quot;")

still problem remains.

i also tried
response.write(server.htmlencode(myrs(3))) ' where myrs is adodb recordset

still the problem remains

i am also attaching the header lines from my asp page

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<%@LANGUAGE="VBSCRIPT" CODEPAGE="1252"%>


<HTML><HEAD>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<meta http-equiv="Content-Language" content="en-us" />

the problem is still not solved

please help
 
D

Daniel Crichton

From your posts, I'd say that wordcleaner isn't doing what you expect -
every quote you've posted as a paste from your text is a curly open quote
which is what I'd expect a copy and paste direct from Word to include.

Dan

S wrote on Mon, 24 Mar 2008 21:40:13 +0530:
you have guessed it right, i am copying the text from ms word but am
cleaning wordhtml using wordcleaner 3.
further, i checked using
Response.CharSet = "UTF-8"
in this case the ? characters appears on every newline including the
places where it was appearing earlier.
when i use <%Response.CharSet = "Windows-1252"%>
still the problem of question marks remain. but it appears only as was
appearing earlier (in place of " and not on every new line)
i checked the view source- the server is sending ? character itself to
the browser.
when i checked the database field, it is showing in invalid character
in the shape of a rectangle stored where i want the double quote "
printed.
please help.
My guess is that they are not " " but are ' " " typically
cut'n'pasted in from Microsoft Word.
These are still in the Windows-1252 range of characters but are not
strictly in the iso-8859-1 set.
Don't use http-equiv meta tags use real headers instead.
IOW ditch the meta tags and include this:-
<%Response.CharSet = "Windows-1252"%>
I'm not hopeful because you are probably using IE and IE will treat
ISO-8859-1 as Windows-1252 anyway.
Always use Server.HtmlEncode on values retrieved from the Database.
Stop mucking about with any other approach.
If that doesn't work view the html source from the browser. What is
the server actually sending.
Another alternative is stop using Windows-1252.
Save your pages as UTF-8 change the codepage at the top of the page
to 65001 and include Response.CharSet = "UTF-8" in your page.
BTW, Have you looked at the field content directly using the DB
management tool?

--
in message i am attaching the sample code. actually i am printing from a field
in access database. the text entered in the database contains single
quotes and double quotes. when i try to print them using
response.write, the double quotes are getting replaced with question
marks. i have tried the method of
DataPrep = Replace(DataPrep, """", "&quot;")
still problem remains.
i also tried response.write(server.htmlencode(myrs(3))) '
where myrs is adodb recordset
still the problem remains
i am also attaching the header lines from my asp page
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<%@LANGUAGE="VBSCRIPT" CODEPAGE="1252"%>

<HTML><HEAD>
<meta http-equiv="Content-Type" content="text/html;
charset=iso-8859-1" />
<meta http-equiv="Content-Language" content="en-us" />
the problem is still not solved
please help


Thanks for the vote of confidence Bob but it baffles me. ;)
Since " is within the lower ascii range 0-127 the only encoding that
could screw this up would be UTF-16. But if the browser thought it
was getting say Windows-1252 and yet the server was encoding to
UTF-16 (or vice versa)
the content would be completely garbled.
I suspect that what the OP thinks is happening and what actually is
are very different. Like Dan says I think we would need to see some
actual code to make sense of this.
 
S

S N

i have checked the access database. when i open the access table, there also i am finding the rectangular block whereever i expect apostrophe .

also i have started using server.htmlencode for retrieving values from the database. but it displays the new line characters and paragraph characters (<BR> and <p> notations) stored in the text field as such. meaning instead of using these characters as commands for new line it is displaying them as it is, ie as "<BR>" and "<p>". in this way the paragraph boundaries has gone.

please help me with the above two problems.

Thanks for posting one of your problematic characters. I think one of the problems is that a number of distinct concepts, such as charset, font, and locale, are being blurred together. At first you mentioned having problems with the single and double quote characters, and defined them as apostrophe character ' and double quote ". Recently you posted the ‘ character, which I assume is what you meant by an apostrophe character. It certainly looks a lot like what I would call a single quote, but if you put my single quote ('), and your single quote together, you can see that they are different: (‘'). Well, maybe you can see the difference, and maybe you can't. It all depends on what font the characters are being displayed in. I think in general, if a font does not contain a glyph for a character, then it displays a square or rectangular box for that character. I think most fonts contain glyphs for all characters in the range Chr(32) to Chr(127). Many fonts contain glyphs for characters in the range Chr(128) to Chr(255) too. Many fonts also include glyphs for some characters in the range ChrW(256) to ChrW(65535), which are Unicode characters. My knowledge of Unicode is limited, so some of my terminology may not be technically correct, and I would appreciate being corrected.

Copy the code below into a .vbs file and run it. You will get two message boxes. The first message box will contain two lines:

Hello *'΄‘* Unicode
΄‘

The first line contains a mixture of what might be considered Unicode and non-Unicode characters. The three characters between the asterisks (*) might all be considered single quotes, but only the first one is Chr(39), the character I consider a single quote. The second one is ChrW(900), and the third one is your single quote, ChrW(8216).

The second line displays what is left of the first line after removing all characters whose AscW value is less than 255.

I included the ChrW(900) character because it illustrates how differently certain characters may be handled.

The second message box contains info about the two Unicode characters:

1 ΄ 63 ? 900 ΄
2 ‘ 145 ‘ 8216 ‘

The six columns contain the following:
1) i (position within the string)
2) Mid(s, i, 1) the character at position i.
3) Asc(Mid(s, i, 1)) value of the character, sometimes and sometimes not.
4) Chr(Asc(Mid(s, i, 1))) Character associated with the reported Asc value.
5) AscW(Mid(s, i, 1)) Unicode value of the character.
6) ChrW(AscW(Mid(s, i, 1))) Character associated with the reported AscW value.

The Asc function almost always returns an 8-bit value, and AscW returns a 16-bit value. For certain Locales, Asc returns the same 16-bit value as AscW. See the scripting help file for info on the GetLocale and SetLocale functions. The thing to note is that depending on Locale, for some Unicode characters, the Asc function returns returns 63, a value that corresponds to a question mark, and for others it returns a value under 256 that displays the same character as is displayed by the Unicode character. So ChrW(900) maps to a question mark but ChrW(8216) maps to Chr(145). I don't have any examples that would produce the inverted question mark you talked about in your early posts.

Your posts talk about a number of code pages and charsets, like 65001 and utf-8 and iso-8859-1. I believe that charset 65001 represents all characters as fixed-length two-byte values, so it can handle all the thousands of standard Unicode characters. UTf-8 is a variable length encoding that uses one to four bytes to represent a character. It can handle all the characters that charset 65001 can handle. Charset iso-8859-1 can only handle 256 8-bit characters.

I think you should build a little standalone VBScript that displays many of your problematic characters in something like the six columns I did above, and post the result. Perhaps we can figure out a way to fix the problem after you show us what the problem is. It might help if you tell us your Locale number too. Control-C can be used to copy the text from a message box.

Option Explicit
Dim i, j, s, sMsg
s = "Hello *'" & ChrW(900) & "‘* Unicode"
msgbox s & vbcrlf & sKeepOnlyUnicode(s)
s = sKeepOnlyUnicode(s)

For i = 1 To Len(s)
sMsg = sMsg & i & vbTab & Mid(s, i, 1) & vbTab & _
Asc(Mid(s, i, 1)) & vbTab & Chr(Asc(Mid(s, i, 1))) & vbTab & _
AscW(Mid(s, i, 1)) & vbTab & ChrW(AscW(Mid(s, i, 1))) & vbCrLf
Next 'i
MsgBox sMsg

Function sKeepOnlyUnicode(sAnyString)
'Returns sAnyString with only Unicode [actually, all
' characters outside the range ChrW(0) to
' ChrW(255)] being kept. VBScript strings are made
' up of 16-bit characters so they can handle a
' lot of Unicode stuff.
With New RegExp
.Global = True
.Pattern = "[\u0000-\u00FF]"
sKeepOnlyUnicode = .Replace(sAnyString, "")
End With
End Function 'sKeepOnlyUnicode(sAnyString)


-Paul Randall
i changed the codepage tp 65001 and charset to utf-8, then the question mark ? showing earlier, has changed to the rectangle as shown below.
‘
the database field also shows the same character stored in it.
please help.

My guess is that they are not " " but are ‘ “ †typically cut'n'pasted in from Microsoft Word.

These are still in the Windows-1252 range of characters but are not strictly in the iso-8859-1 set.

Don't use http-equiv meta tags use real headers instead.

IOW ditch the meta tags and include this:-

<%Response.CharSet = "Windows-1252"%>

I'm not hopeful because you are probably using IE and IE will treat ISO-8859-1 as Windows-1252 anyway.

Always use Server.HtmlEncode on values retrieved from the Database. Stop mucking about with any other approach.

If that doesn't work view the html source from the browser. What is the server actually sending.

Another alternative is stop using Windows-1252.

Save your pages as UTF-8 change the codepage at the top of the page to 65001 and include Response.CharSet = "UTF-8" in your page.

BTW, Have you looked at the field content directly using the DB management tool?


--
Anthony Jones - MVP ASP/ASP.NET
i am attaching the sample code. actually i am printing from a field in access database. the text entered in the database contains single quotes and double quotes. when i try to print them using response.write, the double quotes are getting replaced with question marks. i have tried the method of

DataPrep = Replace(DataPrep, """", "&quot;")

still problem remains.

i also tried
response.write(server.htmlencode(myrs(3))) ' where myrs is adodb recordset

still the problem remains

i am also attaching the header lines from my asp page

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<%@LANGUAGE="VBSCRIPT" CODEPAGE="1252"%>


<HTML><HEAD>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<meta http-equiv="Content-Language" content="en-us" />

the problem is still not solved

please help
 
A

Anthony Jones

i have checked the access database. when i open the access table,
there also i am finding the rectangular block whereever i expect apostrophe ..
also i have started using server.htmlencode for retrieving values from the
database. but it displays the new line characters and paragraph characters
(<BR> and <p> notations) stored in the text field as such. meaning instead of
using these characters as commands for new line it is displaying them as it is,
ie as "<BR>" and "<p>". in this way the paragraph boundaries has gone.

If access is showing the wrong character that indicates the data is corrupt.

If the field contains HTML (which it appears it does if it has <br> and <p>
elements that you expect to be honors) then you should not be using
Server.HTMLEncode. It has to be assumed that a field containing HTML is
already HTML encoded.

This is a long thread, I can't remember if you indicated how the data
arrived in the DB in the first place.
 
S

S N

i had copied the data from a word file and using control-c i had pasted it
into a richtext textbox in my asp form.

also please tell me how to ensure that regardless of whether the data is
html encoded or not, my server.htmlencode should work alright.
 
M

Mike Brind [MVP]

i had copied the data from a word file and using control-c i had pasted it
into a richtext textbox in my asp form.

also please tell me how to ensure that regardless of whether the data is
html encoded or not, my server.htmlencode should work alright.

I haven't read the whole thread, but pasting directly from Word into a rich
text box is asking for trouble. I recommend pasting from Word into Notepad,
then taking the result and pasting it into the Rich text box. Word
sometimes does odd things with what should be "double quotes". And it
retains a whole load of Word-specific formatting - often over-riding your
carefully crafted css.
 
S

S N

also please tell me how to ensure that regardless of whether the data is
html encoded or not, my server.htmlencode should work alright.
 
M

Mike Brind [MVP]

also please tell me how to ensure that regardless of whether the data is
html encoded or not, my server.htmlencode should work alright.

As Anthony said, if you are entering html code into the database with the
idea that this takes effect when you pull it back to a web page, you do not
want to server.htmlencode it. Since you are using a Rich Text Editor, I am
assuming that this will apply html tags to the text on entry, and you want
them to act on the output.

What you really want to do is to make sure no javascript or clientside
vbscript gets injected. One way to do this is just to reject any input that
contains the string "<script>" in it during your server-side validation.
 
S

S N

you have guessed it very correctly that i am entering html code into the database (like table tags <td> <tr> in particular) with the
idea that this takes effect when it is pulled back to a web page, and hence i dont want to server.htmlencode it.

can you suggest a server side validation script which does as indicated below by you. else can you suggest an alternate method of achieving the above (ensuring the table tags get translated into tables on the client side).

please help.
 
M

Mike Brind [MVP]

you have guessed it very correctly that i am entering html code into the
database (like table tags <td> > <tr> in particular) with the
idea that this takes effect when it is pulled back to a web page, and
hence i dont want to > server.htmlencode it.

can you suggest a server side validation script which does as indicated
below by you. else can you > suggest an alternate method of achieving
the above (ensuring the table tags get translated into tables > on the
client side).

'input is the posted content from the Rich Text Editor

If InStr(input, "<string>") > 0 Then
'reject it
Else
'process it
End If
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,079
Messages
2,570,574
Members
47,207
Latest member
HelenaCani

Latest Threads

Top