Would a lack of line breaks in a doc cause parsing problems ?

charliefortune · Sep 26, 2007

I am fetching some product feeds with PHP like this

$merch = substr($key,1);
$feed = file_get_contents($_POST['data_url'.$merch]);
$fp = fopen("./feeds/feed".$merch.".txt","w+");
fwrite ($fp,$feed);
fclose ($fp);

and then parsing them with PHP's native parsing functions. This is
succesful for most of the feeds, but a couple of them claim to be
empty when I know they are not. On inspection, the thing these failing
feeds have in common is that they contain no line breaks at all.
Should this cause a problem ?

Asger Jørgensen · Sep 26, 2007

HI Charlie

charliefortune said:
I am fetching some product feeds with PHP like this

$merch = substr($key,1);
$feed = file_get_contents($_POST['data_url'.$merch]);
$fp = fopen("./feeds/feed".$merch.".txt","w+");
fwrite ($fp,$feed);
fclose ($fp);

and then parsing them with PHP's native parsing functions. This is
succesful for most of the feeds, but a couple of them claim to be
empty when I know they are not. On inspection, the thing these failing
feeds have in common is that they contain no line breaks at all.
Should this cause a problem ?

Just to be sure, You are talking about Xml files ?

Whitespaces in a xml file are discarded, so it shouldn't matter, but
hey I'm just a newbee, so maybe someone else knows more.

But You should post a little sample of the xml file that cources the
error. that would make it easier to give adwise.

Kind regards
Asger

Richard Tobin · Sep 26, 2007

Asger Jørgensen said:
Whitespaces in a xml file are discarded

This is not true. Whitespace in markup (e.g. between attributes) is
unimportant, but XML parsers must return all whitespace in text
content, even the possibly-unimportant whitespace between elements in
element-only content.

XML doesn't require any line breaks in a document. I wouldn't be
surprised if some cheap-and-cheerful XML processors have problems with
huge documents all on one line, though.

-- Richard

Martin Honnen · Sep 26, 2007

charliefortune said:
I am fetching some product feeds with PHP like this

$merch = substr($key,1);
$feed = file_get_contents($_POST['data_url'.$merch]);
$fp = fopen("./feeds/feed".$merch.".txt","w+");
fwrite ($fp,$feed);
fclose ($fp);

and then parsing them with PHP's native parsing functions. This is
succesful for most of the feeds, but a couple of them claim to be
empty when I know they are not. On inspection, the thing these failing
feeds have in common is that they contain no line breaks at all.
Should this cause a problem ?

The code you show above is not XML or feed specific, you are simply
using PHP file IO to fetch contents and write it to a file.
If you have trouble parsing such a file later with "PHP's native
parsing" functions then show us that code and show us a sample of a feed
where it does not work as you want it to work.

charliefortune · Sep 26, 2007

This is not true. Whitespace in markup (e.g. between attributes) is
unimportant, but XML parsers must return all whitespace in text
content, even the possibly-unimportant whitespace between elements in
element-only content.

XML doesn't require any line breaks in a document. I wouldn't be
surprised if some cheap-and-cheerful XML processors have problems with
huge documents all on one line, though.

-- Richard

Thank you for your help.

Here is one of the ones that doesn't work

http://fcshirtshop.com/feeds/feed1058.txt.

My parsing code is this...

$furl = "http://fcshirtshop.com/feeds/feed1058.txt";
$xml_parser = xml_parser_create();
xml_set_element_handler($xml_parser, "startElement",
"endElement");
xml_set_character_data_handler($xml_parser, "characterData");

xml_parser_set_option($xml_parser,XML_OPTION_TARGET_ENCODING,'UTF-8');
$fp = fopen($furl,"r") or die("Error reading XML data.");
while ($data = fread($fp, 4096))
xml_parse($xml_parser, $data, feof($fp))
or die(sprintf("XML error: %s at line %d",
xml_error_string(xml_get_error_code($xml_parser)),
xml_get_current_line_number($xml_parser)));
fclose($fp);
xml_parser_free($xml_parser);

The characterData, startElement and endElement handlers follow, and I
know these function correctly because they work for lots of other
feeds I use this for. I am stuck and the only thing I have been able
to isolate is the fact that it is on a single line.

Richard Tobin · Sep 26, 2007

charliefortune said:
Here is one of the ones that doesn't work

http://fcshirtshop.com/feeds/feed1058.txt.

That appears to be a perfectly legal XML document.

My parsing code is this...

I'm not familiar with the system you're using so I won't comment on that.

I am stuck and the only thing I have been able
to isolate is the fact that it is on a single line.

Have you tried splitting the file up into multiple lines (e.g.
by putting line breaks in the start tags) and seeing if it
works then?

-- Richard

Joseph Kesselman · Sep 26, 2007

Not for any compliant XML parser. Excess line breaks in the wrong places
might.

Since I don't use PHP, I can't speak to whether it has any limitations.
Contact its authors or support group?

Asger Jørgensen · Sep 26, 2007

Hi Richard

Richard Tobin said:
This is not true. Whitespace in markup (e.g. between attributes) is
unimportant, but XML parsers must return all whitespace in text
content, even the possibly-unimportant whitespace between elements in
element-only content.

You are absolutely right, whitespace can be importent, but since the
question
was about line breaks, I wasn't that much off..;-)

Thanks for pointing it out.

Kind regards
Asger

Asger Jørgensen · Sep 26, 2007

Hi Charlie

"charliefortune" <[email protected]> skrev i en meddelelse

Here is one of the ones that doesn't work

Click to expand...

http://fcshirtshop.com/feeds/feed1058.txt.

Click to expand...

I can't find anything wrong wrong with the file either.
but I cant find any indications that the encoding should be UTF-8 either
unless thats the accepted default (XML Newbee You know)
At least the file don't say anything and there is no BOM.
The file must be a dificult on though, IE was working for 15 minutes
on the file and it was only half way throught when I closed it
it didn't report any errors though.

One trick You could try though.

You could do a searc and replace on the file before You parse it:

Look for: "<!" and replace it with "\n<!"
\n being new line

There is a <![CDATA in allmost every element
so that would give You enough linebreaks for Your
parser to cope with it. If it is the missing
linebreaks that gives the trouble.

Kind regards
Asger

Richard Tobin · Sep 26, 2007

Asger Jørgensen said:
I can't find anything wrong wrong with the file either.
but I cant find any indications that the encoding should be UTF-8

It only contains ascii characters, so as long as you don't use ebcdic...

-- Richard

Joe Kesselman · Sep 27, 2007

Asger said:
but I cant find any indications that the encoding should be UTF-8 either
unless thats the accepted default (XML Newbee You know)

Most XML parsers will autodetect UTF-16, and will fall back to UTF8. The
XML spec describes how to do that.

There is a <![CDATA in allmost every element

Haven't looked, but I'd bet failure to balance those properly is what's
causing the problems. CDATA Sections are almost always bad practice, for
that reason among many others.

Joe Kesselman · Sep 27, 2007

Took a quick look. That is one gawdawful long line (about 4MB, .5MB of
which is wasted on the CDATA Section delimiters).

I suspect you've overloaded some buffering limit in your parser. A
decent parser shouldn't have any trouble with it. A sloppy parser may be
reading stuff line-by-line and making unwarranted assumptions about the
longest expected line length, or may have limits on the maximum size of
its in-memory data structures.

Contact your parser's authors and ask them, or switch parsers and see if
the problem persists. There's nothing obviously wrong with the file, and
if there was it'd be the parser's responsibility to tell you that rather
than to give up and pretend the document was empty.

Asger Jørgensen · Sep 27, 2007

Hi Joe

Joe Kesselman said:
Most XML parsers will autodetect UTF-16, and will fall back to UTF8. The
XML spec describes how to do that.

Yea, I've seen that done in some source code, that is not the easiest thing
to do, but from a look in the PHP manual it sais nothing about selfdetecting
encoding and it states that PHP defaults to ISO-8859-1.
but I gues nothing will happen if You try to decode
a US ASCII as UTF-8, except for the waste of time.

There is a <![CDATA in allmost every element

Click to expand...

Haven't looked, but I'd bet failure to balance those properly is what's
causing the problems. CDATA Sections are almost always bad practice, for
that reason among many others.

Well, those CDATA sections are quite simple, in this file they are simply
used
whenever the element contain text.
Which by the way got me thinking about my own hobby parser, I just throw
those CDATA sections away, I gues thats not the best way <g>

Kind regards
Asger

Andy Dingley · Sep 27, 2007

That appears to be a perfectly legal XML document.

Although the file has a ".txt" extension and the web server (possibly
as a result) serves it as text/plain.
Is the OP sure that it's actually being recognised and parsed as an
XML document?

There's also a lot of CDATA sections in there, in contexts where it's
far from necessary to use them. Could they be what's confusing the
parser?

Richard Tobin · Sep 27, 2007

Andy Dingley said:
There's also a lot of CDATA sections in there, in contexts where it's
far from necessary to use them. Could they be what's confusing the
parser?

Matching up brackets in huge strings is supposed to be just the sort
of thing computers are good at! I think a line-length limit is the
most likely explanation.

-- Richard

Joseph Kesselman · Sep 27, 2007

Asger said:
Yea, I've seen that done in some source code, that is not the easiest thing
to do

XML actually makes it pretty easy by requiring that the file start with
the XML Declaration, with nothing before it with the byte order mark. If
folks actually follow that rule, examining the first few bytes of the
file is generally enough to figure out byte order and character size
with or without the BOM. That in turn is usually enough to let you read
the XML Declaration and see whether it says anything more about
encodings. Again, the XML spec actually includes a note describing this.

Well, those CDATA sections are quite simple, in this file they are simply
used whenever the element contain text.

Which is gawdawful sloppy file generation, but sometimes that's out of
our control...

Which by the way got me thinking about my own hobby parser, I just throw
those CDATA sections away, I gues thats not the best way <g>

Nope. If they're there, you have to be prepared to deal with the fact
that they may be escaping things that would otherwise disrupt XML
parsing, such as the < character.

Joseph Kesselman · Sep 27, 2007

Richard said:
Matching up brackets in huge strings is supposed to be just the sort
of thing computers are good at!

But humans aren't, and in my experience CDATA Sections tend to get used
because humans are creating the file or are expected to edit it.

Richard Tobin · Sep 27, 2007

Matching up brackets in huge strings is supposed to be just the sort
of thing computers are good at!

[/QUOTE]

But humans aren't, and in my experience CDATA Sections tend to get used
because humans are creating the file or are expected to edit it.

Yes, but the CDATA sections in this particular file are correct, so
there is no reason why they should cause problems for a parser.
Having lots of correct CDATA sections is not a plausible reason for a
parser to fail.

-- Richard

Joseph Kesselman · Sep 27, 2007

Having lots of correct CDATA sections is not a plausible reason for a

> parser to fail.

Having taken a quick look, I see no plausible reason for a parser to
fail on that file, period. So we're stuck with looking for non-plausible
reasons... or saying "check your code, check with the parser's authors;
either you did something wrong or they did or both."

Peter Flynn · Sep 27, 2007

Joe said:
Took a quick look. That is one gawdawful long line (about 4MB, .5MB of
which is wasted on the CDATA Section delimiters).

I suspect you've overloaded some buffering limit in your parser.

Test this by turning all space characters into newlines, eg

get http://fcshirtshop.com/feeds/feed1058.txt | tr '\040' '\012' | parse

or whatever syntax your system uses.

///Peter

In the Matter of Herb Schildt: a Detailed Analysis of "C: TheComplete Nonsense"	109	Apr 3, 2010
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004

Would a lack of line breaks in a doc cause parsing problems ?

charliefortune

Asger Jørgensen

Richard Tobin

Martin Honnen

charliefortune

Richard Tobin

Joseph Kesselman

Asger Jørgensen

Asger Jørgensen

Richard Tobin

Joe Kesselman

Joe Kesselman

Asger Jørgensen

Andy Dingley

Richard Tobin

Joseph Kesselman

Joseph Kesselman

Richard Tobin

Joseph Kesselman

Peter Flynn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads