How to slurp/get the content of a URI?

S

Stefan Ram

I wonder what the best/canonical/Javaish way to get/slurp
(i.e., read the whole content into a CharSequence) a URI is.
In Perl, there is:

use LWP::Simple; $content = get( "http://example.com/" );

Say, one wanted to implement LWP::Simple::get in Java.

What is the best way to do so?

I currently do this as follows (omitting some details, like
exceptions, encodings, and close()-operations):

Connect via an HttpURLConnection object:

final java.net.URL url = new java.net.URL( uri.toString() );

final java.net.HttpURLConnection httpURLConnection
=( java.net.HttpURLConnection )url.openConnection();

httpURLConnection.connect();

Then, filling a StringBuilder from it:

final java.io.InputStreamReader inputStreamReader
= new java.io.InputStreamReader
( httpURLConnection.getInputStream(), "UTF-8" );

final java.io.BufferedReader bufferedReader
= new java.io.BufferedReader( inputStreamReader );

java.lang.String line; while(( line = bufferedReader.readLine() )!= null )
{ stringBuilder.append( line ); stringBuilder.append( '\n' ); }

Is this the best/usual/canonical/Javaish way to do it,
or should I use anything else?
 
S

Stefan Ram

new java.io.InputStreamReader
( httpURLConnection.getInputStream(), "UTF-8" );

A more specific question:

Shouldn't I use the document encoding instead of »UTF-8«?

But I will only know this after I have read the response!
(Or, at least part of it.)

So, should I adopt a two-pass read:
Open with US-ASCII to get the document encoding,
then open again with the document encoding?
 
M

Mark Space

Stefan said:
A more specific question:

Shouldn't I use the document encoding instead of »UTF-8«?

The default for HTTP is "8859_1" (that's the Java charset name).
There's a special protocol for negotiating a different charset, which
you won't support because your get is to primitive.

The server will either send you 8859.1 if it can, or it'll close the
connection, I think.
 
M

Mark Space

Mark said:
The default for HTTP is "8859_1" (that's the Java charset name). There's
a special protocol for negotiating a different charset, which you won't
support because your get is to primitive.

The server will either send you 8859.1 if it can, or it'll close the
connection, I think.

P.S. the openStream() method for URL seems to open the type of
connection you need directly.

BufferedReader bin = null;

URL url = new URL( arg[0] );
bin = new BufferedReader(
new InputStreamReader( url.openStream() ));


I think. Better check that. It's fewer lines though.
 
A

Arne Vajhøj

Mark said:
The default for HTTP is "8859_1" (that's the Java charset name). There's
a special protocol for negotiating a different charset, which you won't
support because your get is to primitive.

The server will either send you 8859.1 if it can, or it'll close the
connection, I think.

What ?

HttpURLConnection and its InputStream fetches bytes from the
server. No negotiations possible.

When the client needs to interpret the bytes it needs to
decide on an encoding.

The code snippet above creates an InputStreamReader expecting
UTF-8 encoding.

If it is known that is the encoding then it is fine. If the encoding
is unknown it should be based on HTTP header and HTML META tag info.

There are no default ISO-8859-1 in neither HTTP or Java. HTTP is
always explicit and Java default is system specific.

Arne
 
M

Mark Space

Arne said:
HttpURLConnection and its InputStream fetches bytes from the
server. No negotiations possible.

I think that's what I'm saying. Although I'm no longer sure that
HttpURLConnection doesn't fully support HTTP character sets. It might.

There are no default ISO-8859-1 in neither HTTP or Java. HTTP is
always explicit and Java default is system specific.

For a socket, yes, there is no default encoding. For HTTP, I think that
is not true. 8859-1 is the default if nothing is specified, and it is
legal to leave out the charset encoding -- in both the GET and the response.

I think, anyway. I could be all wrong about that.

Stefan has a valid question: If the content type isn't specified until
you read the header, and you don't know the content type, how do you
know what to open the stream as? The answer I think is that it's
defined to be 8859-1 by default.

Let me see if I can dig something up...

Content Negotiation for HTTP:
<http://en.wikipedia.org/wiki/Content_negotiation>

Some info on "Missing Charset" in the RFC:
<http://tools.ietf.org/html/rfc2616>
Search for 8859.


Back to Java: Also, URLConnection() looks like it will allow one to read
things like the content type and mime type before getting a Java
InputStream to the content:

URLConnection c = url.openConnection();
String mimeType = c.getContentType();
System.out.println( mimeType );

And similarly for getContentEncoding();

I gotta run. I hope I didn't booger things up too badly replying to
Stefan. Apologies if I did.
 
A

Arne Vajhøj

Mark said:
For a socket, yes, there is no default encoding. For HTTP, I think that
is not true. 8859-1 is the default if nothing is specified, and it is
legal to leave out the charset encoding -- in both the GET and the
> response.
> Let me see if I can dig something up...
>
> Content Negotiation for HTTP:
> <http://en.wikipedia.org/wiki/Content_negotiation>
>
> Some info on "Missing Charset" in the RFC:
> <http://tools.ietf.org/html/rfc2616>
> Search for 8859.

You are right. If nothing is specified it means ISO-8859-1. Which
is rather bad since the world is moving from ISO-8859-1 to UTF-8.
Stefan has a valid question: If the content type isn't specified until
you read the header, and you don't know the content type, how do you
know what to open the stream as? The answer I think is that it's
defined to be 8859-1 by default.

Back to Java: Also, URLConnection() looks like it will allow one to read
things like the content type and mime type before getting a Java
InputStream to the content:

URLConnection c = url.openConnection();
String mimeType = c.getContentType();
System.out.println( mimeType );

And similarly for getContentEncoding();

Encoding in HTTP header is easy, because the headers are US-ASCII, so
the client can read the headers and determine the encoding before
reading the body.

Encoding in HTML META tag is not so nice.

Arne
 
M

Mark Space

Arne said:
Encoding in HTTP header is easy, because the headers are US-ASCII, so
the client can read the headers and determine the encoding before
reading the body.

Encoding in HTML META tag is not so nice.

Yes, HTML != HTTP. Sorry if the original question was about HTML
instead of HTTP, I may be out in left field here.
 
M

Mark Space

Stefan said:
Shouldn't I use the document encoding instead of »UTF-8«?

But I will only know this after I have read the response!
(Or, at least part of it.)

So I'm no expert, and I hope I'm not wasting your time by blathering,
but the question is interesting to me so I did a bit of work on it.
Here's what I have so far.


static void method4() throws MalformedURLException, IOException {
String TEST_URL =
"http://cnn.com";
URL url = new URL(TEST_URL);
URLConnection c = url.openConnection();
String type = c.getContentType();
System.out.println("Mime type: " + type );
if( type == null || type.contains("text") )
{
String enc = c.getContentEncoding();
System.out.println( "Encoding: " + enc );
if( enc == null )
{
enc = "ISO-8859-1";
}
InputStreamReader inr = new InputStreamReader(

c.getInputStream(),
enc ); // I have no idea if http encoding
strings // will work here
List<CharBuffer> result = new ArrayList<CharBuffer>();
int byteCount = 0;
for( ;; )
{
int read;
CharBuffer cb = CharBuffer.allocate( 4 * 1024 );
if( ( read = inr.read( cb )) != -1 )
{
byteCount += read;
result.add( cb );
}
else
{
break;
}
}
System.out.println( "Read: " + byteCount );
}
else // binary
{
System.out.println("binary...");
}
}

Some other thoughts:

1. If the URL string depends on user input, you may have to use
URLEncoder if the user input goes in the parameter part of the URL.

2. Don't forget that other protocols besides HTTP exist. The Java API
also supports FTP and JAR I believe. You might get one of those instead
of HTTP. You may wish to check the protocol expressly if you don't set
it yourself.

3. Both mime type and the character encoding may be null. The defaults
are "text" and ISO-8859-1 respectively, but there are also "guess"
methods in the URLConnection object.

4. If you don't have text, you might have an image. It might be nice to
return an Image in that case. I didn't get that far though.

5. I can't find any expandable buffers for Java. StringBuilder or
StringWriter seem like a good idea. I made my own by stuffing
CharBuffers into a List. The idea is to avoid testing each character
for an end-of-line, which readLine() must do. Hopefully the CharBuffer
is faster.

6. You could also read the data raw (ByteBuffer) and decide what to do
with it later. This might be more in the spirit of a "slurp" operation.

7. I looked for a way to get a channel from the URLConnection and didn't
find one. I think this is a defect in the Java API, myself. Using
direct buffers might be a big performance win here. You'll need a raw
socket for that I guess.
 
T

Tom Anderson

My understanding is that the server may, in pretty much any situation,
send whatever charset it likes, as long as it declares it in the
content-type header.
P.S. the openStream() method for URL seems to open the type of connection
you need directly.

BufferedReader bin = null;

URL url = new URL( arg[0] );
bin = new BufferedReader(
new InputStreamReader( url.openStream() ));

I think. Better check that.

You're absolutely right.

A slightly more correct approach (which might have been expounded
downthread already) would be to use a URLConnection, get the content-type,
parse it to identify a charset, and then use that to configure the
InputStreamReader correctly.

Sadly, and shockingly, there doesn't seem to be anything to parse
content-type headers in the standard library. There is a
javax.mail.internet.ContentType in J2EE, though, and it's not too hard to
write yourself.

There's also an intriguing getContent() method that sounds like it should
be even closer to what Stefan wanted - it downloads the bytes, then uses
the content-type to convert them into an object. However, it's not
entirely clear exactly what kind of object you're supposed to get, which
makes it more or less useless. In practice, getting HTML text gives you an
InputStream, and getting an image gives you a
java.awt.image.ImageProducer. That's not enormously useful here.

tom
 
S

Stefan Ram

Tom Anderson said:
Sometimes it takes a madman like Iggy Pop before you can SEE

I am wondering whether I should attend his concert
with the Stooges at the end of the next month.

Regarding the charset parameter of MIME types, there is:

java.awt.datatransfer.MimeTypeParameterList

But it is not a public class. So much for reuse.
 
M

Mark Space

Stefan said:
In spite of its name, getContentEncoding() does /not/
designate the content character encoding.

Yup, I shoulda read the docs better. I'll correct my example, thanks.
 
A

Arne Vajhøj

Mark said:
So I'm no expert, and I hope I'm not wasting your time by blathering,
but the question is interesting to me so I did a bit of work on it.
Here's what I have so far.

static void method4() throws MalformedURLException, IOException {
String TEST_URL =
"http://cnn.com";
URL url = new URL(TEST_URL);
URLConnection c = url.openConnection();
String type = c.getContentType();
System.out.println("Mime type: " + type );
if( type == null || type.contains("text") )
{
String enc = c.getContentEncoding();
System.out.println( "Encoding: " + enc );
if( enc == null )
{
enc = "ISO-8859-1";
}
InputStreamReader inr = new InputStreamReader(
c.getInputStream(),
enc ); // I have no idea if http encoding
strings // will work here
List<CharBuffer> result = new ArrayList<CharBuffer>();
int byteCount = 0;
for( ;; )
{
int read;
CharBuffer cb = CharBuffer.allocate( 4 * 1024 );
if( ( read = inr.read( cb )) != -1 )
{
byteCount += read;
result.add( cb );
}
else
{
break;
}
}
System.out.println( "Read: " + byteCount );
}
else // binary
{
System.out.println("binary...");
}
}

You need to also handle the META HTTP-EQUIV way of specifying charset.

My suggestion for code:

import java.io.IOException;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class HttpDownloadCharset {
private static Pattern encpat =
Pattern.compile("charset=([A-Za-z0-9-]+)", Pattern.CASE_INSENSITIVE);
private static String parseContentType(String contenttype) {
Matcher m = encpat.matcher(contenttype);
if(m.find()) {
return m.group(1);
} else {
return "ISO-8859-1";
}
}
private static Pattern metaencpat =
Pattern.compile("<META\\s+HTTP-EQUIV\\s*=\\s*[\"']Content-Type[\"']\\s+CONTENT\\s*=\\s*[\"']([^\"']*)[\"']>",
Pattern.CASE_INSENSITIVE);
private static String parseMetaContentType(String html, String
defenc) {
Matcher m = metaencpat.matcher(html);
if(m.find()) {
return parseContentType(m.group(1));
} else {
return defenc;
}
}
private static final int DEFAULT_BUFSIZ = 1000000;
public static String download(String urlstr) throws IOException {
URL url = new URL(urlstr);
HttpURLConnection con = (HttpURLConnection)url.openConnection();
con.connect();
if (con.getResponseCode() == HttpURLConnection.HTTP_OK) {
String enc = parseContentType(con.getContentType());
int bufsiz = con.getContentLength();
if(bufsiz < 0) {
bufsiz = DEFAULT_BUFSIZ;
}
byte[] buf = new byte[bufsiz];
InputStream is = con.getInputStream();
int ix = 0;
int n;
while((n = is.read(buf, ix, buf.length - ix)) > 0) {
ix += n;
}
is.close();
con.disconnect();
String temp = new String(buf, "US-ASCII");
enc = parseMetaContentType(temp, enc);
return new String(buf, enc);
} else {
con.disconnect();
throw new IllegalArgumentException("URL " + urlstr + "
returned " + con.getResponseMessage());
}
}
}

Arne
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,981
Messages
2,570,188
Members
46,731
Latest member
MarcyGipso

Latest Threads

Top