A
Adam Funk
I'm converting JSON data to XML using the standard library's json and
xml.dom.minidom modules. I get the input this way:
input_source = codecs.open(input_file, 'rb', encoding='UTF-8', errors='replace')
big_json = json.load(input_source)
input_source.close()
Then I recurse through the contents of big_json to build an instance
of xml.dom.minidom.Document (the recursion includes some code to
rewrite dict keys as valid element names if necessary), and I save the
document:
xml_file = codecs.open(output_fullpath, 'w', encoding='UTF-8', errors='replace')
doc.writexml(xml_file, encoding='UTF-8')
xml_file.close()
I thought this would force all the output to be valid, but xmlstarlet
gives some errors like these on a few documents:
PCDATA invalid Char value 7
PCDATA invalid Char value 31
I guess I need to process each piece of PCDATA to clean out the
control characters before creating the text node:
text = doc.createTextNode(j)
root.appendChild(text)
What's the best way to do that, bearing in mind that there can be
multibyte characters in the strings? I found some suggestions on the
WWW involving filter with string.printable, which AFAICT isn't
unicode-friendly --- is there a unicode.printable or something like
that?
xml.dom.minidom modules. I get the input this way:
input_source = codecs.open(input_file, 'rb', encoding='UTF-8', errors='replace')
big_json = json.load(input_source)
input_source.close()
Then I recurse through the contents of big_json to build an instance
of xml.dom.minidom.Document (the recursion includes some code to
rewrite dict keys as valid element names if necessary), and I save the
document:
xml_file = codecs.open(output_fullpath, 'w', encoding='UTF-8', errors='replace')
doc.writexml(xml_file, encoding='UTF-8')
xml_file.close()
I thought this would force all the output to be valid, but xmlstarlet
gives some errors like these on a few documents:
PCDATA invalid Char value 7
PCDATA invalid Char value 31
I guess I need to process each piece of PCDATA to clean out the
control characters before creating the text node:
text = doc.createTextNode(j)
root.appendChild(text)
What's the best way to do that, bearing in mind that there can be
multibyte characters in the strings? I found some suggestions on the
WWW involving filter with string.printable, which AFAICT isn't
unicode-friendly --- is there a unicode.printable or something like
that?