Need more info about problem resolving entity reference

D

David Karr

I have a Cygwin Perl script makes numerous REST api calls to a local service, parses the results from those, and makes other calls with that data. It also runs some of these calls in multiple threads, using LWP::UserAgent.

It mostly works, but I sometimes get errors like this:

-----------------------
caught error:
500 Can't connect to www.w3.org:80 (Operation now in progress) http://www.w3.org/TR/html4/strict.dtd
Handler couldn't resolve external entity at line 1, column 90, byte 92
error in processing external entity reference at line 1, column 90, byte 92:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
=========================================================================================^
<html>
<head>
at /usr/lib/perl5/vendor_perl/5.14/i686-cygwin-threads-64int/XML/Parser.pm line 187 thread 2
 
D

David Karr

This error comes from XML::parser. I assume you are invoking that

directly, to parse the REST response? What's happening is that

XML::parser sees a DOCTYPE declaration like



<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"

"http://www.w3.org/TR/html4/strict.dtd">



and, like a good little SGML-derived XML parser, tries to fetch the DTD

(using LWP) so it can validate the rest of the file. For some reason,

when it tries to connect to www.w3.org to download the DTD file, the

connection is failing with EINPROGRESS. Since LWP isn't expecting that

error code, it throws an error.



So, what's the real problem? Well, first, that's an HTML doctype. You

can't, in general, parse HTML with an XML parser, so are you sure you're

getting the responses you expect? REST services are usually pretty good

about getting their Content-types right, so you ought to be able to

check for an XML Content-type before passing the data to XML::parser.

I'm completely certain that in these anomalous cases, I'm definitely not getting the response I expect. The problem with this error message is that it gives me absolutely no clue where in the script this is happening. I'm guessing that our back-end server gets confused in some cases, but it's hardto diagnose when I don't know what URL was being attempted, or where in the script it was done.
Second, you really don't want to keep fetching the DTDs like that. Does

the XML you're actually trying to parse use external DTDs? If not, then

you want to pass the NoLWP option to XML::parser, so that it doesn't

even try to fetch DTDs from the network. In the case of a public DTD

like HTML the attempt to load it as a local file will fail, of course,

but the parsing wasn't going to succeed anyway, because it wasn't XML.

That "NoLWP" option sounds useful, but it's somewhat moot here.
However, I'm slightly confused here, because the XML::parser

documentation seems to say it doesn't parse external DTDs by default.

It's possible I'm misunderstanding; I don't think I've used XML::parser

myself. Are you passing ParseParamEnt, and if so, why?

I don't know what "ParseParamEnt" is, so I imagine I'm not.
Third, you probably don't want to be using XML::parser at all. As you

can see, it's old and rather cronky, and while it's extremely solid code

it also takes a rather SGMLish approach to parsing XML. Most of the

time, with modern XML use, DTDs are not used, and instead the XML just

needs to be well-formed and properly namespaced. For this sort of thing

(small documents) I would use XML::LibXML (which, incidentally, also

includes a reasonable HTML parser); if a streaming model is more

appropriate, either because your documents may be ridiculously large or

simply because your program is structured that way, I would use one of

the SAX modules.

The funny thing about searching in CPAN is that there are no packages (I'm guessing) that say "do not use this, use something better". I'll take a look at XML::LibXML to see what it does for me.
Finally, fourth, I have no idea where that EINPROGRESS is coming from.

That error is supposed to be returned if a socket is connected while in

non-blocking mode, and the connection cannot be completed without

blocking; it's basically the equivalent of EAGAIN for connect(). This

means it shouldn't be possible to get that error without having asked

for it by setting nonblocking mode on the socket, which LWP does not

(normally) do.



Are you doing something peculiar which might cause this to happen?

Alternatively, it's possible this is some sort of Cygwin peculiarity,

which unfortunately may be difficult to track down; if you can isolate

the conditions where the error occurs it would be useful. (For instance,

does it tend to occur when the network goes down? When the network is

overloaded? When the DNS doesn't respond promptly?)

The script runs for perhaps 30-40 minutes, basically walking the entire data model of a REST api. It sends hundreds of requests to the (load-balanced) service, some from multiple threads. This kind of error happens several times during the run of the script, which means that the vast majority workwell enough. I ended up putting a hack into my "sendGet" sub that just checks for "DOCTYPE HTML" in the output and simply tries again, with a reasonable limit of retries. Almost all of the calls that detect this once or twice eventually get good data.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,965
Messages
2,570,148
Members
46,710
Latest member
FredricRen

Latest Threads

Top