Scraped content via WebRequest: Fixing mis-rendered characters systemically?

K

Ken Fine

I have a portion of a web page that I am scraping via .NET's WebRequest
object. Code and page URL is below. Some characters are being mis-rendered
when the string representing the page portion is returned: these are various
entity characters that do not translate correctly into renderable HTML.Can
someone suggest a systemic way that is built into the .NET framework's Text
classes to fix this so it renders correctly on a web page?

Thanks,
-KF

public partial class UweekHome : System.Web.UI.Page
{
protected void Page_Load(object sender, EventArgs e)
{
litHTMLfromScrapedPage.Text = GetHtmlPage("http://uweek.org");
}

public String GetHtmlPage(string strURL)
{
// the html retrieved from the page
String strResult;
WebResponse objResponse;
WebRequest objRequest = System.Net.HttpWebRequest.Create(strURL);
objResponse = objRequest.GetResponse();

using (StreamReader sr =
new StreamReader(objResponse.GetResponseStream()))
{
strResult = sr.ReadToEnd();
int pos1 = strResult.IndexOf("<slstart>", 0);
int pos2 = strResult.IndexOf("<storylist>", pos1);
int pos3 = strResult.IndexOf("</storylist>", pos2);
strResult = strResult.Substring(pos2 + 11, pos3 - pos2 + 11);
sr.Close();
}



return strResult;
}
}
 
G

Guest

Some characters are being mis-rendered
when the string representing the page portion is returned: these are various
entity characters that do not translate correctly into renderable HTML.

Can you give an example?
 
K

Ken Fine

Em-dashes, En-dashes, curly quotes, and the like:

Didja hear the one about the economist who became a stand-up comic? Yoram
Bauman is an instructor for the Program on the Environment, but most Tuesday
nights find him at the Comedy Underground, cracking wise about ? no joke ?
economics.

When the Husky men?s basketball team heads to Greece Aug. 27 for a series of
exhibition games, they?ll be traveling with Socrates. That?s because in
their off-court time, they?ll take part in a classics class that focuses on
the man who is often called the father of western philosophy.
 
G

Guest

Em-dashes, En-dashes, curly quotes, and the like:

Didja hear the one about the economist who became a stand-up comic? Yoram
Bauman is an instructor for the Program on the Environment, but most Tuesday
nights find him at the Comedy Underground, cracking wise about ? no joke ?
economics.

When the Husky men?s basketball team heads to Greece Aug. 27 for a series of
exhibition games, they?ll be traveling with Socrates. That?s because in
their off-court time, they?ll take part in a classics class that focuses on
the man who is often called the father of western philosophy.







- Show quoted text -

Hi Ken

try to set encoding

Encoding enc = Encoding.GetEncoding(1252);
using (StreamReader sr =
new StreamReader(objResponse.GetResponseStream(), enc)) {
.....

Hope it helps
 
W

Walter Wang [MSFT]

Hi Ken,

I agree with Alexey here that you need to specify the encoding to read the
response since it's by default using UTF-8.

However, using 1252 will only work for ASCII encoding (for example, Western
encoding). The more reliable way is to get the correct encoding from
HttpWebResponse. Please refer to following discussion thread for more
information:

http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=58840&SiteID=1

Regards,
Walter Wang ([email protected], remove 'online.')
Microsoft Online Community Support

==================================================
When responding to posts, please "Reply to Group" via your newsreader so
that others may learn and benefit from your issue.
==================================================

This posting is provided "AS IS" with no warranties, and confers no rights.
 
G

Guest

Hi Ken,

I agree with Alexey here that you need to specify the encoding to read the
response since it's by default using UTF-8.

However, using 1252 will only work for ASCII encoding (for example, Western
encoding). The more reliable way is to get the correct encoding from
HttpWebResponse. Please refer to following discussion thread for more
information:

http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=58840&SiteID=1

the page he is trying to get is in 1252
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,226
Members
46,815
Latest member
treekmostly22

Latest Threads

Top