Would a lack of line breaks in a doc cause parsing problems ?

C

charliefortune

I am fetching some product feeds with PHP like this

$merch = substr($key,1);
$feed = file_get_contents($_POST['data_url'.$merch]);
$fp = fopen("./feeds/feed".$merch.".txt","w+");
fwrite ($fp,$feed);
fclose ($fp);

and then parsing them with PHP's native parsing functions. This is
succesful for most of the feeds, but a couple of them claim to be
empty when I know they are not. On inspection, the thing these failing
feeds have in common is that they contain no line breaks at all.
Should this cause a problem ?
 
A

Asger Jørgensen

HI Charlie

charliefortune said:
I am fetching some product feeds with PHP like this

$merch = substr($key,1);
$feed = file_get_contents($_POST['data_url'.$merch]);
$fp = fopen("./feeds/feed".$merch.".txt","w+");
fwrite ($fp,$feed);
fclose ($fp);

and then parsing them with PHP's native parsing functions. This is
succesful for most of the feeds, but a couple of them claim to be
empty when I know they are not. On inspection, the thing these failing
feeds have in common is that they contain no line breaks at all.
Should this cause a problem ?

Just to be sure, You are talking about Xml files ?

Whitespaces in a xml file are discarded, so it shouldn't matter, but
hey I'm just a newbee, so maybe someone else knows more.

But You should post a little sample of the xml file that cources the
error. that would make it easier to give adwise.

Kind regards
Asger
 
R

Richard Tobin

Asger Jørgensen said:
Whitespaces in a xml file are discarded

This is not true. Whitespace in markup (e.g. between attributes) is
unimportant, but XML parsers must return all whitespace in text
content, even the possibly-unimportant whitespace between elements in
element-only content.

XML doesn't require any line breaks in a document. I wouldn't be
surprised if some cheap-and-cheerful XML processors have problems with
huge documents all on one line, though.

-- Richard
 
M

Martin Honnen

charliefortune said:
I am fetching some product feeds with PHP like this

$merch = substr($key,1);
$feed = file_get_contents($_POST['data_url'.$merch]);
$fp = fopen("./feeds/feed".$merch.".txt","w+");
fwrite ($fp,$feed);
fclose ($fp);

and then parsing them with PHP's native parsing functions. This is
succesful for most of the feeds, but a couple of them claim to be
empty when I know they are not. On inspection, the thing these failing
feeds have in common is that they contain no line breaks at all.
Should this cause a problem ?

The code you show above is not XML or feed specific, you are simply
using PHP file IO to fetch contents and write it to a file.
If you have trouble parsing such a file later with "PHP's native
parsing" functions then show us that code and show us a sample of a feed
where it does not work as you want it to work.
 
C

charliefortune

This is not true. Whitespace in markup (e.g. between attributes) is
unimportant, but XML parsers must return all whitespace in text
content, even the possibly-unimportant whitespace between elements in
element-only content.

XML doesn't require any line breaks in a document. I wouldn't be
surprised if some cheap-and-cheerful XML processors have problems with
huge documents all on one line, though.

-- Richard

Thank you for your help.

Here is one of the ones that doesn't work

http://fcshirtshop.com/feeds/feed1058.txt.

My parsing code is this...

$furl = "http://fcshirtshop.com/feeds/feed1058.txt";
$xml_parser = xml_parser_create();
xml_set_element_handler($xml_parser, "startElement",
"endElement");
xml_set_character_data_handler($xml_parser, "characterData");

xml_parser_set_option($xml_parser,XML_OPTION_TARGET_ENCODING,'UTF-8');
$fp = fopen($furl,"r") or die("Error reading XML data.");
while ($data = fread($fp, 4096))
xml_parse($xml_parser, $data, feof($fp))
or die(sprintf("XML error: %s at line %d",
xml_error_string(xml_get_error_code($xml_parser)),
xml_get_current_line_number($xml_parser)));
fclose($fp);
xml_parser_free($xml_parser);

The characterData, startElement and endElement handlers follow, and I
know these function correctly because they work for lots of other
feeds I use this for. I am stuck and the only thing I have been able
to isolate is the fact that it is on a single line.
 
R

Richard Tobin

charliefortune said:
Here is one of the ones that doesn't work

http://fcshirtshop.com/feeds/feed1058.txt.

That appears to be a perfectly legal XML document.
My parsing code is this...

I'm not familiar with the system you're using so I won't comment on that.
I am stuck and the only thing I have been able
to isolate is the fact that it is on a single line.

Have you tried splitting the file up into multiple lines (e.g.
by putting line breaks in the start tags) and seeing if it
works then?

-- Richard
 
J

Joseph Kesselman

Not for any compliant XML parser. Excess line breaks in the wrong places
might.

Since I don't use PHP, I can't speak to whether it has any limitations.
Contact its authors or support group?
 
A

Asger Jørgensen

Hi Richard

Richard Tobin said:
This is not true. Whitespace in markup (e.g. between attributes) is
unimportant, but XML parsers must return all whitespace in text
content, even the possibly-unimportant whitespace between elements in
element-only content.

You are absolutely right, whitespace can be importent, but since the
question
was about line breaks, I wasn't that much off..;-)

Thanks for pointing it out.

Kind regards
Asger
 
A

Asger Jørgensen

Hi Charlie

"charliefortune" <[email protected]> skrev i en meddelelse
Here is one of the ones that doesn't work

I can't find anything wrong wrong with the file either.
but I cant find any indications that the encoding should be UTF-8 either
unless thats the accepted default (XML Newbee You know)
At least the file don't say anything and there is no BOM.
The file must be a dificult on though, IE was working for 15 minutes
on the file and it was only half way throught when I closed it
it didn't report any errors though.

One trick You could try though.

You could do a searc and replace on the file before You parse it:

Look for: "<!" and replace it with "\n<!"
\n being new line

There is a <![CDATA in allmost every element
so that would give You enough linebreaks for Your
parser to cope with it. If it is the missing
linebreaks that gives the trouble.

Kind regards
Asger
 
R

Richard Tobin

Asger Jørgensen said:
I can't find anything wrong wrong with the file either.
but I cant find any indications that the encoding should be UTF-8

It only contains ascii characters, so as long as you don't use ebcdic...

-- Richard
 
J

Joe Kesselman

Asger said:
but I cant find any indications that the encoding should be UTF-8 either
unless thats the accepted default (XML Newbee You know)

Most XML parsers will autodetect UTF-16, and will fall back to UTF8. The
XML spec describes how to do that.
There is a <![CDATA in allmost every element

Haven't looked, but I'd bet failure to balance those properly is what's
causing the problems. CDATA Sections are almost always bad practice, for
that reason among many others.
 
J

Joe Kesselman

Took a quick look. That is one gawdawful long line (about 4MB, .5MB of
which is wasted on the CDATA Section delimiters).

I suspect you've overloaded some buffering limit in your parser. A
decent parser shouldn't have any trouble with it. A sloppy parser may be
reading stuff line-by-line and making unwarranted assumptions about the
longest expected line length, or may have limits on the maximum size of
its in-memory data structures.

Contact your parser's authors and ask them, or switch parsers and see if
the problem persists. There's nothing obviously wrong with the file, and
if there was it'd be the parser's responsibility to tell you that rather
than to give up and pretend the document was empty.
 
A

Asger Jørgensen

Hi Joe

Joe Kesselman said:
Most XML parsers will autodetect UTF-16, and will fall back to UTF8. The
XML spec describes how to do that.

Yea, I've seen that done in some source code, that is not the easiest thing
to do, but from a look in the PHP manual it sais nothing about selfdetecting
encoding and it states that PHP defaults to ISO-8859-1.
but I gues nothing will happen if You try to decode
a US ASCII as UTF-8, except for the waste of time.
There is a <![CDATA in allmost every element

Haven't looked, but I'd bet failure to balance those properly is what's
causing the problems. CDATA Sections are almost always bad practice, for
that reason among many others.

Well, those CDATA sections are quite simple, in this file they are simply
used
whenever the element contain text.
Which by the way got me thinking about my own hobby parser, I just throw
those CDATA sections away, I gues thats not the best way <g>

Kind regards
Asger
 
A

Andy Dingley

That appears to be a perfectly legal XML document.

Although the file has a ".txt" extension and the web server (possibly
as a result) serves it as text/plain.
Is the OP sure that it's actually being recognised and parsed as an
XML document?

There's also a lot of CDATA sections in there, in contexts where it's
far from necessary to use them. Could they be what's confusing the
parser?
 
R

Richard Tobin

Andy Dingley said:
There's also a lot of CDATA sections in there, in contexts where it's
far from necessary to use them. Could they be what's confusing the
parser?

Matching up brackets in huge strings is supposed to be just the sort
of thing computers are good at! I think a line-length limit is the
most likely explanation.

-- Richard
 
J

Joseph Kesselman

Asger said:
Yea, I've seen that done in some source code, that is not the easiest thing
to do

XML actually makes it pretty easy by requiring that the file start with
the XML Declaration, with nothing before it with the byte order mark. If
folks actually follow that rule, examining the first few bytes of the
file is generally enough to figure out byte order and character size
with or without the BOM. That in turn is usually enough to let you read
the XML Declaration and see whether it says anything more about
encodings. Again, the XML spec actually includes a note describing this.
Well, those CDATA sections are quite simple, in this file they are simply
used whenever the element contain text.

Which is gawdawful sloppy file generation, but sometimes that's out of
our control...
Which by the way got me thinking about my own hobby parser, I just throw
those CDATA sections away, I gues thats not the best way <g>

Nope. If they're there, you have to be prepared to deal with the fact
that they may be escaping things that would otherwise disrupt XML
parsing, such as the < character.
 
J

Joseph Kesselman

Richard said:
Matching up brackets in huge strings is supposed to be just the sort
of thing computers are good at!

But humans aren't, and in my experience CDATA Sections tend to get used
because humans are creating the file or are expected to edit it.
 
R

Richard Tobin

Matching up brackets in huge strings is supposed to be just the sort
of thing computers are good at!
[/QUOTE]
But humans aren't, and in my experience CDATA Sections tend to get used
because humans are creating the file or are expected to edit it.

Yes, but the CDATA sections in this particular file are correct, so
there is no reason why they should cause problems for a parser.
Having lots of correct CDATA sections is not a plausible reason for a
parser to fail.

-- Richard
 
J

Joseph Kesselman

Having lots of correct CDATA sections is not a plausible reason for a
> parser to fail.

Having taken a quick look, I see no plausible reason for a parser to
fail on that file, period. So we're stuck with looking for non-plausible
reasons... or saying "check your code, check with the parser's authors;
either you did something wrong or they did or both."
 
P

Peter Flynn

Joe said:
Took a quick look. That is one gawdawful long line (about 4MB, .5MB of
which is wasted on the CDATA Section delimiters).

I suspect you've overloaded some buffering limit in your parser.

Test this by turning all space characters into newlines, eg

get http://fcshirtshop.com/feeds/feed1058.txt | tr '\040' '\012' | parse

or whatever syntax your system uses.

///Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,008
Messages
2,570,271
Members
46,874
Latest member
CyberGateway

Latest Threads

Top