HttpWebResponse.GetResponseStream returns incomplete stream

T

ThePants

Hi, given the following code, I've been successful in grabbing pages
for parsing, but for a certain page template (containing a particular
piece of code) the stream always ends right after that code. If you try
this with just about any type of url (incuding urls from the same site
without that piece of code) it works fine, but with urls containing the
piece of code, the stream is returned only up to that point.

Dim sURL as String
' Works (along with 1000's of other sites/templates/servers):
sURL = "http://www.msnbc.msn.com/id/14191819"
' Doesn't work:
sURL =
"http://www.time.com/time/business/article/0,8599,1226309,00.html"

Dim oSR As StreamReader = getPageContent(sURL)
' If you do oSR.ReadToEnd here, you'll see the page broken at the
wrong place

Private Function getPageContent(ByVal URL As String) As StreamReader
Dim oResponse As HttpWebResponse = Nothing
Dim oSR As StreamReader = Nothing
Dim oRequest As HttpWebRequest
Try
oRequest = WebRequest.Create(URL)
oResponse = CType(oRequest.GetResponse, HttpWebResponse)
oSR = New StreamReader(oResponse.GetResponseStream())
Catch ex As Exception

End Try
Return oSR
End Function

The stream for the time.com pages ends *every time* right after:
<strong>SUBSCRIBE TO TIME MAGAZINE FOR JUST $1.99</strong></a>

.... and the number of characters varies depending on the story, but
each time the "Subscribe" link is there, the response stream dies right
after it. If you view the source of those pages, you'll see a single
blank character, and then an html comment ( <!--cm_searchtext end-->).

So I'm stuck, is it possible that the single character between the </a>
and the comment is breaking the stream? Could it be the server thinking
(correctly) that I'm parsing it and choosing that as the location each
time to cut me off? (Changing the UserAgent property of the
HttpWebRequest doesn't affect the outcome at all). I've played with
several properties of HttpWebRequest, including spoofing a UserAgent,
setting KeepAlive to true, SendChunked, and ProtocolVersion... but
nothing I do seems to keep this from happening.

Any help would be appreciated.
Thanks!
STA
 
J

Joerg Jooss

Thus wrote ThePants,
Hi, given the following code, I've been successful in grabbing pages
for parsing, but for a certain page template (containing a particular
piece of code) the stream always ends right after that code. If you
try this with just about any type of url (incuding urls from the same
site without that piece of code) it works fine, but with urls
containing the piece of code, the stream is returned only up to that
point. [...]
... and the number of characters varies depending on the story, but
each time the "Subscribe" link is there, the response stream dies
right after it. If you view the source of those pages, you'll see a
single blank character, and then an html comment ( <!--cm_searchtext
end-->).

So I'm stuck, is it possible that the single character between the
</a> and the comment is breaking the stream? Could it be the server
thinking (correctly) that I'm parsing it and choosing that as the
location each time to cut me off? (Changing the UserAgent property of
the HttpWebRequest doesn't affect the outcome at all). I've played
with several properties of HttpWebRequest, including spoofing a
UserAgent, setting KeepAlive to true, SendChunked, and
ProtocolVersion... but nothing I do seems to keep this from happening.

That's a nasty one. At the point where the text is being truncated, there
is a NULL (0x00) character in the page. It's actually the Encoding object
that breaks here, not the response stream. Unfortunately, specifying a DecoderFallback
doesn't work -- seems to be a bug. As a work around, buffer the entire response
in MemoryStream, remove all NULL characters, and decode the buffer with an
Encoding instance.

Cheers,
 
T

ThePants

Joerg said:
That's a nasty one. At the point where the text is being truncated, there
is a NULL (0x00) character in the page. It's actually the Encoding object
that breaks here, not the response stream. Unfortunately, specifying a DecoderFallback
doesn't work -- seems to be a bug. As a work around, buffer the entire response
in MemoryStream, remove all NULL characters, and decode the buffer with an
Encoding instance.

Cheers,
--

Thanks very much for the reply, Joerg. This did the trick! Thank you
very very much for your help.
 
J

Joerg Jooss

Thus wrote (e-mail address removed),
hi...

could you show me how you did this?

OK, assuming you have a byte array "bytes" containing the entire response
all you need to do is:

using(MemoryStream buffer = new MemoryStream(bytes.Length)) {
foreach(byte b in bytes) {
if(b > 0x0) {
buffer.WriteByte(b);
}
}
bytes = buffer.ToArray();
}

// Assuming UTF-8 encoding here...
string response = Encoding.UTF8.GetString(bytes);

Cheers,
 
J

jake.oh

hi..
i have no words to show you how much i am appreciating your help.
but, i couldn't figure out how to capture the stream (from webrequest)
in byte arrays
could you help me out with this too?

best regards ^^
 
J

Joerg Jooss

Thus wrote (e-mail address removed),
hi..
i have no words to show you how much i am appreciating your help.
but, i couldn't figure out how to capture the stream (from webrequest)
in byte arrays
could you help me out with this too?
best regards ^^

That's System.IO 101 ;-)

Here's a method that sends a HttpWebRequest and copies its response to an
arbitrary Stream object. If you pass a MemoryStream as "outStream", you'll
get what you want.

private void SendRequest(HttpWebRequest request, Stream outStream) {
Debug.Assert(outStream.CanWrite);

using(HttpWebResponse response = (HttpWebResponse) request.GetResponse())
using(Stream responseStream = response.GetResponseStream()) {
byte[] buffer = new byte[0x1000];
int bytes;
while((bytes = responseStream.Read(buffer, 0, buffer.Length)) > 0) {
outStream.Write(buffer, 0, bytes);
}
}
}

Cheers,
 
T

ThePants

Here's my Function in vb.net. Probably not terribly efficient, but I
needed to copy the string back to a memorystream as output. Thanks
again to Joerg for the suggestion.

Private Function getPageContent(ByVal URL As String) As MemoryStream
Dim oResponse As HttpWebResponse = Nothing
Dim oSB As New StringBuilder
Dim oRequest As HttpWebRequest
Try
oRequest = WebRequest.Create(URL)
oRequest.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows
NT 5.2; .NET CLR 2.0.50727; .NET CLR 1.1.4322)"
oResponse = CType(oRequest.GetResponse, HttpWebResponse)
Dim oStreamResponse As Stream = oResponse.GetResponseStream()
Dim oStreamRead As New StreamReader(oStreamResponse)
Dim readBuff(256) As [Char]
Dim nCount As Integer = oStreamRead.Read(readBuff, 0, 256)
While nCount > 0
Dim outputData As New [String](readBuff, 0, nCount)
outputData = Replace(outputData, vbNullChar, "")
oSB.Append(outputData)
nCount = oStreamRead.Read(readBuff, 0, 256)
End While
oStreamResponse.Close()
oStreamRead.Close()
Catch ex As Exception

End Try
Dim oWorkStream As New MemoryStream
Dim oEnc As Encoding = Encoding.GetEncoding(1252)
Dim oSW1 As New StreamWriter(oWorkStream, oEnc)
oSW1.Write(oSB.ToString)
oSW1.Flush()
oWorkStream.Position = 0
Return oWorkStream
End Function
 
J

jake.oh

Hi.. joerg

once again,,,
thank you for your time,,, and attention, i wish i could invite you a
beer someday.
well.. here is the problem.

i tried the code,,, but it's still throwing the same result..
try this particular url,,
"http://www.altavista.com/web/results?itag=ody&q=tire&kgs=1&kls=0"

if you put this url in IE and run it,,, probably you will get the
result with SIDE Sponser section (right side of page, under sponsered
match)

but if you run this from .net,,, and display the stream (after
processing it),,
you will only see the result without SIDE SPONSERED MATCH section

I don't know if i am explaining well
could you see the problem here?

what i need is display the whole page including every sponsed link..

best regards

jake

Joerg said:
Thus wrote (e-mail address removed),
hi..
i have no words to show you how much i am appreciating your help.
but, i couldn't figure out how to capture the stream (from webrequest)
in byte arrays
could you help me out with this too?
best regards ^^

That's System.IO 101 ;-)

Here's a method that sends a HttpWebRequest and copies its response to an
arbitrary Stream object. If you pass a MemoryStream as "outStream", you'll
get what you want.

private void SendRequest(HttpWebRequest request, Stream outStream) {
Debug.Assert(outStream.CanWrite);

using(HttpWebResponse response = (HttpWebResponse) request.GetResponse())
using(Stream responseStream = response.GetResponseStream()) {
byte[] buffer = new byte[0x1000];
int bytes;
while((bytes = responseStream.Read(buffer, 0, buffer.Length)) > 0) {
outStream.Write(buffer, 0, bytes);
}
}
}

Cheers,
 
J

Joerg Jooss

Thus wrote (e-mail address removed),
Hi.. joerg

once again,,,
thank you for your time,,, and attention, i wish i could invite you a
beer someday.
well.. here is the problem.
i tried the code,,, but it's still throwing the same result.. try this
particular url,,
"http://www.altavista.com/web/results?itag=ody&q=tire&kgs=1&kls=0"

if you put this url in IE and run it,,, probably you will get the
result with SIDE Sponser section (right side of page, under sponsered
match)

but if you run this from .net,,, and display the stream (after
processing it),,
you will only see the result without SIDE SPONSERED MATCH section
I don't know if i am explaining well
could you see the problem here?
what i need is display the whole page including every sponsed link..

best regards

Guess what, I don't even get that sidebar in IE... but at least I seem to
get the exact same content with HttpWebRequest.

Usually, when web applications or web sites behave strangely while being
accessed through your own client application, that is caused by HTTP headers
the site uses to personalize content which are missing in your request --
such as User-Agent to identify the browser, or Accept-Language to identify
your locale. To be on the safe side, you should consider sending these headers:

// request is a HttpWebRequest
request.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;
..NET CLR 1.1.4322; InfoPath.1; .NET CLR 2.0.50727)";
request.Accept = "en-us";
request.Headers["Acccept-Language"] = "*/*;

This way, you're pretending to be a US IE 6 SP1 that likes any content --
exactly what the real IE sends.

Cheers,
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,990
Messages
2,570,211
Members
46,796
Latest member
SteveBreed

Latest Threads

Top