L
little_mm
Hi All
Perhaps someone knows the answer to this problem. I open a connection
to a URL and read lines one at a time from the URL using a
InputStreamReader and a BufferedReader:
// Open connection to URL
URLConnection conn =
(URLConnection)pageURL.openConnection();
conn.setReadTimeout(timeout);
conn.setConnectTimeout(timeout);
conn.setUseCaches(false);
InputStream pageStream = conn.getInputStream();
BufferedReader reader = new BufferedReader(new
InputStreamReader(pageStream));
String line;
StringBuffer pageBuffer = new StringBuffer();
while ((line = reader.readLine()) != null)
{
System.out.println(line);
pageBuffer.append(line);
}
return pageBuffer.toString();
However, the actual text I get back from the URL is different from that
saved out of a browser from the same URL. Particularly, the browser
saves £ characters, whereas the lines read in Java are missing
these characters altogether. Also, some of the characters have actually
been deleted in the Java lines. I have tried using different character
encodings in the second argument of the InputStreamReader, this has
virtually no effect, except using UTF-16 which returns a large number
of "?" characters in the stream. The content type header of the page
says it is ISO-8859-1, but this character encoding string with the
InputStreamReader changes nothing in the Java code: the £ symbol is
still missing.
In the browser, if I change the character encoding to "UTF-8" then the
£ symbol is still properly displayed in the browser. In other words,
it looks like I am receiving different data from the server depending
upon whether I use the browser or the code. I'm not sure if it has
anything to do with the encoding, but I'm just guessing.
Thanks,
Nubs.
Perhaps someone knows the answer to this problem. I open a connection
to a URL and read lines one at a time from the URL using a
InputStreamReader and a BufferedReader:
// Open connection to URL
URLConnection conn =
(URLConnection)pageURL.openConnection();
conn.setReadTimeout(timeout);
conn.setConnectTimeout(timeout);
conn.setUseCaches(false);
InputStream pageStream = conn.getInputStream();
BufferedReader reader = new BufferedReader(new
InputStreamReader(pageStream));
String line;
StringBuffer pageBuffer = new StringBuffer();
while ((line = reader.readLine()) != null)
{
System.out.println(line);
pageBuffer.append(line);
}
return pageBuffer.toString();
However, the actual text I get back from the URL is different from that
saved out of a browser from the same URL. Particularly, the browser
saves £ characters, whereas the lines read in Java are missing
these characters altogether. Also, some of the characters have actually
been deleted in the Java lines. I have tried using different character
encodings in the second argument of the InputStreamReader, this has
virtually no effect, except using UTF-16 which returns a large number
of "?" characters in the stream. The content type header of the page
says it is ISO-8859-1, but this character encoding string with the
InputStreamReader changes nothing in the Java code: the £ symbol is
still missing.
In the browser, if I change the character encoding to "UTF-8" then the
£ symbol is still properly displayed in the browser. In other words,
it looks like I am receiving different data from the server depending
upon whether I use the browser or the code. I'm not sure if it has
anything to do with the encoding, but I'm just guessing.
Thanks,
Nubs.