T
Twisted
I'm encountering a couple of bogosities here, both of which probably
stem from not handling some corner case involving HTTP and URLs.
The first of those is that some browsing has turned up URLs in links
that look like http://www.foo.com/bar/baz?x=mumble&y=frotz#xyz#uvw
after the conversion with URLDecoder. I don't think the part with the
#s is meant to be a ref, not when there's sometimes two or more as in
the sample. Perhaps these URLs are meant to be passed to the server
without URLDecoder decoding to de-% them? (Currently when making URLs
from links I extract the "http://..." string, pass it through
URLDecoder.decode(linkString, "UTF-8"), and then pass the result to the
URL constructor. Is this wrong?)
Secondly, I'm occasionally getting missing images; attempting to
display them by pasting the link into Sphaera's address bar generates a
bogus "Web page" full of hash, apparently the binary data of an image
file being treated as if it were "text/html". It looks like the remote
servers are sometimes getting the content-type wrong, or not setting it
at all, which is resulting in this behavior.
Should I include code to try to guess missing content-types? There's a
ready-made method to guess it from file extension, but it may be
problematic -- I've seen links like
http://foo.bar.com/cgi-bin?get=quux.jpg that return a Web page with an
ad banner at the top or navigation links or some such, quux.jpg in the
center, and a copyright notice at the bottom, and similar cases. If I
assume that every link ending in .jpg with no server-supplied
content-type header is an image, these will render incorrectly. As
things stand, it assumes that every link with no server-supplied
content-type header is HTML and sometimes actual jpegs render
incorrectly. It doesn't seem there's any way to be sure, short of
actually reading the file the way it's currently done, detecting its
content-type is bogus (maybe by noticing a lot of chars with the 7th
bit set?), and then reinterpreting the thing using guessContentType ...
which seems rather awkward. Then again, I *could* just make it detect
questionable "Web pages" with lots of high-ASCII and complain to the
user that the server they went to is broken. >;-> Unfortunately that
might cause problems with international pages, or something of the
sort. Is there at minimum a safer way to detect binary files
masquerading as text? Maybe counting null chars up to a threshold?
Binaries are usually full of NUL and other low-ascii control chars
other than \n, \r, and \t, the only three that seem to be common in
real text files, as well as high-ascii.
stem from not handling some corner case involving HTTP and URLs.
The first of those is that some browsing has turned up URLs in links
that look like http://www.foo.com/bar/baz?x=mumble&y=frotz#xyz#uvw
after the conversion with URLDecoder. I don't think the part with the
#s is meant to be a ref, not when there's sometimes two or more as in
the sample. Perhaps these URLs are meant to be passed to the server
without URLDecoder decoding to de-% them? (Currently when making URLs
from links I extract the "http://..." string, pass it through
URLDecoder.decode(linkString, "UTF-8"), and then pass the result to the
URL constructor. Is this wrong?)
Secondly, I'm occasionally getting missing images; attempting to
display them by pasting the link into Sphaera's address bar generates a
bogus "Web page" full of hash, apparently the binary data of an image
file being treated as if it were "text/html". It looks like the remote
servers are sometimes getting the content-type wrong, or not setting it
at all, which is resulting in this behavior.
Should I include code to try to guess missing content-types? There's a
ready-made method to guess it from file extension, but it may be
problematic -- I've seen links like
http://foo.bar.com/cgi-bin?get=quux.jpg that return a Web page with an
ad banner at the top or navigation links or some such, quux.jpg in the
center, and a copyright notice at the bottom, and similar cases. If I
assume that every link ending in .jpg with no server-supplied
content-type header is an image, these will render incorrectly. As
things stand, it assumes that every link with no server-supplied
content-type header is HTML and sometimes actual jpegs render
incorrectly. It doesn't seem there's any way to be sure, short of
actually reading the file the way it's currently done, detecting its
content-type is bogus (maybe by noticing a lot of chars with the 7th
bit set?), and then reinterpreting the thing using guessContentType ...
which seems rather awkward. Then again, I *could* just make it detect
questionable "Web pages" with lots of high-ASCII and complain to the
user that the server they went to is broken. >;-> Unfortunately that
might cause problems with international pages, or something of the
sort. Is there at minimum a safer way to detect binary files
masquerading as text? Maybe counting null chars up to a threshold?
Binaries are usually full of NUL and other low-ascii control chars
other than \n, \r, and \t, the only three that seem to be common in
real text files, as well as high-ascii.