Couple java.net questions

T

Twisted

I'm encountering a couple of bogosities here, both of which probably
stem from not handling some corner case involving HTTP and URLs.

The first of those is that some browsing has turned up URLs in links
that look like http://www.foo.com/bar/baz?x=mumble&y=frotz#xyz#uvw

after the conversion with URLDecoder. I don't think the part with the
#s is meant to be a ref, not when there's sometimes two or more as in
the sample. Perhaps these URLs are meant to be passed to the server
without URLDecoder decoding to de-% them? (Currently when making URLs
from links I extract the "http://..." string, pass it through
URLDecoder.decode(linkString, "UTF-8"), and then pass the result to the
URL constructor. Is this wrong?)

Secondly, I'm occasionally getting missing images; attempting to
display them by pasting the link into Sphaera's address bar generates a
bogus "Web page" full of hash, apparently the binary data of an image
file being treated as if it were "text/html". It looks like the remote
servers are sometimes getting the content-type wrong, or not setting it
at all, which is resulting in this behavior.

Should I include code to try to guess missing content-types? There's a
ready-made method to guess it from file extension, but it may be
problematic -- I've seen links like
http://foo.bar.com/cgi-bin?get=quux.jpg that return a Web page with an
ad banner at the top or navigation links or some such, quux.jpg in the
center, and a copyright notice at the bottom, and similar cases. If I
assume that every link ending in .jpg with no server-supplied
content-type header is an image, these will render incorrectly. As
things stand, it assumes that every link with no server-supplied
content-type header is HTML and sometimes actual jpegs render
incorrectly. It doesn't seem there's any way to be sure, short of
actually reading the file the way it's currently done, detecting its
content-type is bogus (maybe by noticing a lot of chars with the 7th
bit set?), and then reinterpreting the thing using guessContentType ...
which seems rather awkward. Then again, I *could* just make it detect
questionable "Web pages" with lots of high-ASCII and complain to the
user that the server they went to is broken. >;-> Unfortunately that
might cause problems with international pages, or something of the
sort. Is there at minimum a safer way to detect binary files
masquerading as text? Maybe counting null chars up to a threshold?
Binaries are usually full of NUL and other low-ascii control chars
other than \n, \r, and \t, the only three that seem to be common in
real text files, as well as high-ascii.
 
D

Daniel Pitts

Twisted said:
I'm encountering a couple of bogosities here, both of which probably
stem from not handling some corner case involving HTTP and URLs.

The first of those is that some browsing has turned up URLs in links
that look like http://www.foo.com/bar/baz?x=mumble&y=frotz#xyz#uvw

after the conversion with URLDecoder. I don't think the part with the
#s is meant to be a ref, not when there's sometimes two or more as in
the sample. Perhaps these URLs are meant to be passed to the server
without URLDecoder decoding to de-% them? (Currently when making URLs
from links I extract the "http://..." string, pass it through
URLDecoder.decode(linkString, "UTF-8"), and then pass the result to the
URL constructor. Is this wrong?)

Secondly, I'm occasionally getting missing images; attempting to
display them by pasting the link into Sphaera's address bar generates a
bogus "Web page" full of hash, apparently the binary data of an image
file being treated as if it were "text/html". It looks like the remote
servers are sometimes getting the content-type wrong, or not setting it
at all, which is resulting in this behavior.

Should I include code to try to guess missing content-types? There's a
ready-made method to guess it from file extension, but it may be
problematic -- I've seen links like
http://foo.bar.com/cgi-bin?get=quux.jpg that return a Web page with an
ad banner at the top or navigation links or some such, quux.jpg in the
center, and a copyright notice at the bottom, and similar cases. If I
assume that every link ending in .jpg with no server-supplied
content-type header is an image, these will render incorrectly. As
things stand, it assumes that every link with no server-supplied
content-type header is HTML and sometimes actual jpegs render
incorrectly. It doesn't seem there's any way to be sure, short of
actually reading the file the way it's currently done, detecting its
content-type is bogus (maybe by noticing a lot of chars with the 7th
bit set?), and then reinterpreting the thing using guessContentType ...
which seems rather awkward. Then again, I *could* just make it detect
questionable "Web pages" with lots of high-ASCII and complain to the
user that the server they went to is broken. >;-> Unfortunately that
might cause problems with international pages, or something of the
sort. Is there at minimum a safer way to detect binary files
masquerading as text? Maybe counting null chars up to a threshold?
Binaries are usually full of NUL and other low-ascii control chars
other than \n, \r, and \t, the only three that seem to be common in
real text files, as well as high-ascii.

I don't know for certain, but I think that URL decoding before passing
to the URL constructor is not the proper sequence. You should probably
just pass the URL in unmodified. If you get a MalformedUrlException,
then the URL isn't valid anyway.
 
E

EJP

Daniel said:
I don't know for certain, but I think that URL decoding before passing
to the URL constructor is not the proper sequence.

Exactly, in fact it's 180 degress back to front. A human-readable URL
with spaces etc, should be *URLEncoded* before passing to the
constructor, and it can be URLDecoded when you want a more
human-readable version.
 
T

Twisted

EJP said:
Exactly, in fact it's 180 degress back to front. A human-readable URL
with spaces etc, should be *URLEncoded* before passing to the
constructor, and it can be URLDecoded when you want a more
human-readable version.

Hrm. OK. Ones found in HTML code don't need either I take it.

Fixed. And some of the exceptions/oddities I was seeing are gone,
thanks.

Here's a few that remain.

Sphaera is throwing exceptions out of imageio.read() still -- I've had
IllegalArgumentExceptions saying "empty region!" and recently an
IndexOutOfBounds exception all come flying up out of deep inside
library code, where the only user code action was to invoke
imageio.read() on a retrieved image stored into a cache file. The image
may be b0rked, but in that case wouldn't an IOException be preferable?
RuntimeExceptions as a rule should be used where buggy code rather than
bad externally-supplied data is the cause, and when you have a security
exception.

I'm getting NPEs in PriorityQueue.poll() with objects that have a
well-behaved comparator that accesses only final fields in the receiver
and the other object. The comparator is inconsistent with equals but
this isn't apparently supposed to make this particular type of queue
misbehave. The tracebacks sometimes lead to the first field access of
the argument to my comparator and sometimes to the library code that
calls the comparator but no deeper. The evidence suggests that nulls
are creeping into the queue, but I'm damn certain it isn't my code
that's putting them there. The null seems to be removed when the
exception is thrown, as subsequent polls of the queue behave themselves
by and large -- until the next time it hurls chunks, often dozens of
polls later.

I still can't reliably separate images, other binaries, and text.

Lastly, I'm seeing some wonky threading behavior. The associated
inline-image fetching threads linger for a while after a tab is closed
-- the thread spends most of its time in Thread.sleep(), in snatches of
a few hundred msec between polling to see if there's a fresh URL to
retrieve. (There's four per tab, grabbing one file at a time, to avoid
overloading servers by snarfing every image at once on a page with
dozens - standard practise with Web browser code, as I understand it,
being to retrieve up to four at a time and block on the others.)
Closing a tab sends the four of them a thread.interrupt() to pop them
out of their slumber; the exception handler leads to exiting the run()
method (gracefully). Logically, hitting the X should result in four
dead threads in very short order then, but sometimes they linger for as
long as a whole minute, as I've noticed when there's an attached
debugger which provides status readouts on such things.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,240
Members
46,830
Latest member
HeleneMull

Latest Threads

Top