B
bfrederi
I am using lxml iterparse and running into a very obscure error. When
I run iterparse on a file, it will occasionally return an element that
has a element.text == None when the element clearly has text in it.
I copy and pasted the problem xml into a python string, used StringIO
to create a file-like object out of it, and ran a test using iterparse
with expected output, and it ran perfectly fine. So it only happens
when I try to run iterparse on the actual file.
So then I tried opening the file, reading the data, turning that data
into a file-like object using StringIO, then running iterparse on it,
and the same problem (element.text == None) occurred.
I even tried this:
f = codecs.open(abbyy_filename, 'r', encoding='utf-8')
file_data = f.read()
file_like_object = StringIO.StringIO(file_data)
for event, element in iterparse(file_like_object, events=("start",
"end")):
And I got this Traceback:
Traceback (most recent call last):
File "abbyyParser/parseAbbyy.py", line 391, in <module>
extension=options.extension,
File "abbyyParser/parseAbbyy.py", line 103, in __init__
self.generate_output_files()
File "abbyyParser/parseAbbyy.py", line 164, in generate_output_files
AbbyyDocParse(abby_filename, self.extension, self.output_types)
File "abbyyParser/parseAbbyy.py", line 239, in __init__
self.parse_doc(abbyy_filename)
File "abbyyParser/parseAbbyy.py", line 281, in parse_doc
for event, element in iterparse(file_like_object, events=("start",
"end")):
File "iterparse.pxi", line 484, in lxml.etree.iterparse.__next__
(src/lxml/lxml.etree.c:86333)
TypeError: reading file objects must return plain strings
If I do this:
file_data = f.read().encode("utf-8")
iterparse will run on it, but I still get elements.text with a value
of None when I should not.
My XML file does have diacritics in it, but I've put the proper
encoding at the head of the XML file (<?xml version="1.0"
encoding="UTF-8"?>). I've also tried using elementree's iterparse, and
I get even more of the same problem with the same files. Any idea what
the problem might be?
I run iterparse on a file, it will occasionally return an element that
has a element.text == None when the element clearly has text in it.
I copy and pasted the problem xml into a python string, used StringIO
to create a file-like object out of it, and ran a test using iterparse
with expected output, and it ran perfectly fine. So it only happens
when I try to run iterparse on the actual file.
So then I tried opening the file, reading the data, turning that data
into a file-like object using StringIO, then running iterparse on it,
and the same problem (element.text == None) occurred.
I even tried this:
f = codecs.open(abbyy_filename, 'r', encoding='utf-8')
file_data = f.read()
file_like_object = StringIO.StringIO(file_data)
for event, element in iterparse(file_like_object, events=("start",
"end")):
And I got this Traceback:
Traceback (most recent call last):
File "abbyyParser/parseAbbyy.py", line 391, in <module>
extension=options.extension,
File "abbyyParser/parseAbbyy.py", line 103, in __init__
self.generate_output_files()
File "abbyyParser/parseAbbyy.py", line 164, in generate_output_files
AbbyyDocParse(abby_filename, self.extension, self.output_types)
File "abbyyParser/parseAbbyy.py", line 239, in __init__
self.parse_doc(abbyy_filename)
File "abbyyParser/parseAbbyy.py", line 281, in parse_doc
for event, element in iterparse(file_like_object, events=("start",
"end")):
File "iterparse.pxi", line 484, in lxml.etree.iterparse.__next__
(src/lxml/lxml.etree.c:86333)
TypeError: reading file objects must return plain strings
If I do this:
file_data = f.read().encode("utf-8")
iterparse will run on it, but I still get elements.text with a value
of None when I should not.
My XML file does have diacritics in it, but I've put the proper
encoding at the head of the XML file (<?xml version="1.0"
encoding="UTF-8"?>). I've also tried using elementree's iterparse, and
I get even more of the same problem with the same files. Any idea what
the problem might be?