stripping fields from xml file into a csv

H

Hal Styli

Hello,

Can someone please help.
I have a sed solution to the problems below but would like to rewrite
in python...

I need to strip out some data from a quirky xml file into a csv:

from something like this

< ..... cust="dick" .... product="eggs" ... quantity="12" .... >
< .... cust="tom" .... product="milk" ... quantity="2" ...>
< .... cust="harry" .... product="bread" ... quantity="1" ...>
< .... cust="tom" .... product="eggs" ... quantity="6" ...>
< ..... cust="dick" .... product="eggs" ... quantity="6" .... >

to this

dick,eggs,12
tom,milk,2
harry,bread,1
tom,eggs,6
dick,eggs,6

I am new to python and xml and it would be great to see some slick
ways of achieving the above by using python's XML capabilities to
parse the original file or python's regex to achive what I did using
sed.

Thanks for any constructive help given.

Hal
 
J

John Bokma

Hal Styli said:
Hello,

Can someone please help.
I have a sed solution to the problems below but would like to rewrite
in python...

I need to strip out some data from a quirky xml file into a csv:

from something like this

< ..... cust="dick" .... product="eggs" ... quantity="12" .... >
< .... cust="tom" .... product="milk" ... quantity="2" ...>
< .... cust="harry" .... product="bread" ... quantity="1" ...>
< .... cust="tom" .... product="eggs" ... quantity="6" ...>
< ..... cust="dick" .... product="eggs" ... quantity="6" .... >

to this

dick,eggs,12
tom,milk,2
harry,bread,1
tom,eggs,6
dick,eggs,6

I am new to python and xml and it would be great to see some slick
ways of achieving the above by using python's XML capabilities to
parse the original file or python's regex to achive what I did using
sed.

It's not clear how your XML actually looks, but (especially) if those
are all attributes of one element I probably would just use xml.sax

I strongly suggest to not use regex to parse XML.
 
Z

Zaphod

Hello,

Can someone please help.
I have a sed solution to the problems below but would like to rewrite in
python...

I need to strip out some data from a quirky xml file into a csv:

from something like this

< ..... cust="dick" .... product="eggs" ... quantity="12" .... > < ....
cust="tom" .... product="milk" ... quantity="2" ...> < .... cust="harry"
.... product="bread" ... quantity="1" ...> < .... cust="tom" ....
product="eggs" ... quantity="6" ...> < ..... cust="dick" ....
product="eggs" ... quantity="6" .... >

to this

dick,eggs,12
tom,milk,2
harry,bread,1
tom,eggs,6
dick,eggs,6

I am new to python and xml and it would be great to see some slick ways
of achieving the above by using python's XML capabilities to parse the
original file or python's regex to achive what I did using sed.

Thanks for any constructive help given.

Hal

Start here:

http://pyxml.sourceforge.net/topics/
 
H

Hai Vu

Hello,

Can someone please help.
I have a sed solution to the problems below but would like to rewrite
in python...

I need to strip out some data from a quirky xml file into a csv:

from something like this

< ..... cust="dick" .... product="eggs" ... quantity="12" .... >
< .... cust="tom" .... product="milk" ... quantity="2" ...>
< .... cust="harry" .... product="bread" ... quantity="1" ...>
< .... cust="tom" .... product="eggs" ... quantity="6" ...>
< ..... cust="dick" .... product="eggs" ... quantity="6" .... >

to this

dick,eggs,12
tom,milk,2
harry,bread,1
tom,eggs,6
dick,eggs,6

I am new to python and xml and it would be great to see some slick
ways of achieving the above by using python's XML capabilities to
parse the original file or python's regex to achive what I did using
sed.

Thanks for any constructive help given.

Hal

Here is a sample XML file (I named it data.xml):
--------------------------
<orders>
<order customer="john" product="eggs" quantity="12" />
<order customer="cindy" product="bread" quantity="1" />
<order customer="larry" product="tea bags" quantity="100" />
<order customer="john" product="butter" quantity="1" />
<order product="chicken" quantity="2" customer="derek" />
</orders>
--------------------------

Code:
--------------------------
import csv
import xml.sax

# Handle the XML file with the following structure:
# <orders>
# <order attributes... /> ...
# </orders>
class OrdersHandler(xml.sax.handler.ContentHandler):
def __init__(self, csvfile):
# Open a csv file for output
self.csvWriter = csv.writer(open(csvfile, 'w'))

def startElement(self, name, attributes):
# Only process the <order ... > element
if name == 'order':
# Construct a sorted list of attribute names in order to
# guarantee rows are written in the same order. We assume
# the XML elements contain the same attributes
attributeNames = attributes.getNames()
attributeNames.sort()

# Construct a row and write it to the csv file
row = []
for name in attributeNames:
row.append(attributes.getValue(name))
self.csvWriter.writerow(row)

def endDocument(self):
# Destroy the csv writer object to close the file
self.csvWriter = None

# Main
datafile = 'data.xml'
csvfile = 'data.csv'
ordersHandler = OrdersHandler(csvfile)
xml.sax.parse(datafile, ordersHandler)
--------------------------

To solve your problem, it is easier to use SAX than DOM. Basically,
use SAX to scan the XML file, if you encounter the element you like
(in this case <order ...>) then you process its attributes. In this
case, you sort the attributes, then write to a csv file.

--------------------------

References:

SAX Parser:
http://docs.python.org/library/xml.sax.html

SAX Content Handler:
http://docs.python.org/library/xml.sax.handler.html

Attributes Object:
http://docs.python.org/library/xml.sax.reader.html#attributes-objects
 
S

Stefan Behnel

Hal Styli, 27.02.2010 21:50:
I have a sed solution to the problems below but would like to rewrite
in python...

Note that sed (or any other line based or text based tool) is not a
sensible way to handle XML. If you want to read XML, use an XML parser.
They are designed to do exactly what you want in a standard compliant way,
and they can deal with all sorts of XML formatting and encoding, for example.

I need to strip out some data from a quirky xml file into a csv:

from something like this

< ..... cust="dick" .... product="eggs" ... quantity="12" .... >
< .... cust="tom" .... product="milk" ... quantity="2" ...>
< .... cust="harry" .... product="bread" ... quantity="1" ...>
< .... cust="tom" .... product="eggs" ... quantity="6" ...>
< ..... cust="dick" .... product="eggs" ... quantity="6" .... >

As others have noted, this doesn't tell much about your XML. A more
complete example would be helpful.

to this

dick,eggs,12
tom,milk,2
harry,bread,1
tom,eggs,6
dick,eggs,6

I am new to python and xml and it would be great to see some slick
ways of achieving the above by using python's XML capabilities to
parse the original file or python's regex to achive what I did using
sed.

It's funny how often people still think that SAX is a good way to solve XML
problems. Here's an untested solution that uses xml.etree.ElementTree:

from xml.etree import ElementTree as ET

csv_field_order = ['cust', 'product', 'quantity']

clean_up_used_elements = None
for event, element in ET.iterparse("thefile.xml", events=['start']):
# you may want to select a specific element.tag here

# format and print the CSV line to the standard output
print(','.join(element.attrib.get(title, '')
for title in csv_field_order))

# safe some memory (in case the XML file is very large)
if clean_up_used_elements is None:
# this assigns the clear() method of the root (first) element
clean_up_used_elements = element.clear
clean_up_used_elements()

You can strip everything dealing with 'clean_up_used_elements' (basically
the last section) if your XML file is small enough to fit into memory (a
couple of MB is usually fine).

Stefan
 
S

Stefan Behnel

Roland Mueller, 28.02.2010 13:01:
The stylesheet is test.xsl and the insput data test.xml. The following
Python code the applies the stylesheet on the input data and puts the output
into foo.

Python code:
#!/usr/bin/python
import sys
import libxml2
import libxslt

styledoc = libxml2.parseFile("test.xsl")
style = libxslt.parseStylesheetDoc(styledoc)
doc = libxml2.parseFile("test.xml")
result = style.applyStylesheet(doc, None)
style.saveResultToFilename("foo", result, 0)

BR,
Roland

*Example run in Linux:*
roland@komputer:~/Desktop/XML/XSLT$ ./xslt_test.py

Note that the shorthand for the above is

$ xsltproc test.xsl test.xml > foo

Stefan
 
H

Hai Vu

Hal Styli, 27.02.2010 21:50:
I have a sed solution to the problems below but would like to rewrite
in python...

Note that sed (or any other line based or text based tool) is not a
sensible way to handle XML. If you want to read XML, use an XML parser.
They are designed to do exactly what you want in a standard compliant way,
and they can deal with all sorts of XML formatting and encoding, for example.
I need to strip out some data from a quirky xml file into a csv:
from something like this
< ..... cust="dick" .... product="eggs" ... quantity="12" .... >
< .... cust="tom" .... product="milk" ... quantity="2" ...>
< .... cust="harry" .... product="bread" ... quantity="1" ...>
< .... cust="tom" .... product="eggs" ... quantity="6" ...>
< ..... cust="dick" .... product="eggs" ... quantity="6" .... >

As others have noted, this doesn't tell much about your XML. A more
complete example would be helpful.
to this

I am new to python and xml and it would be great to see some slick
ways of achieving the above by using python's XML capabilities to
parse the original file or python's regex to achive what I did using
sed.

It's funny how often people still think that SAX is a good way to solve XML
problems. Here's an untested solution that uses xml.etree.ElementTree:

    from xml.etree import ElementTree as ET

    csv_field_order = ['cust', 'product', 'quantity']

    clean_up_used_elements = None
    for event, element in ET.iterparse("thefile.xml", events=['start']):
        # you may want to select a specific element.tag here

        # format and print the CSV line to the standard output
        print(','.join(element.attrib.get(title, '')
                       for title in csv_field_order))

        # safe some memory (in case the XML file is very large)
        if clean_up_used_elements is None:
            # this assigns the clear() method of the root (first) element
            clean_up_used_elements = element.clear
        clean_up_used_elements()

You can strip everything dealing with 'clean_up_used_elements' (basically
the last section) if your XML file is small enough to fit into memory (a
couple of MB is usually fine).

Stefan

This solution is so beautiful and elegant. Thank you. Now I am off to
learn ElementTree.

By the way, Stefan, I am using Python 2.6. Do you know the differences
between ElementTree and cElementTree?
 
H

Hal Styli

Hai Vu, 28.02.2010 17:41:


Use cElementTree, it's implemented in C and a lot faster and more memory
friendly.

http://effbot.org/zone/celementtree.htm#benchmarkshttp://codespeak.net/lxml/performance.html

Stefan

Thanks for the responses so far, most enlightening.

Stefan, I was happy to see such concise code.
Your python worked with only very minor modifications.

Hai's test xml data *without* the first and last line is close enough
to the data I am using:

<order customer="john" product="eggs" quantity="12" />
<order customer="cindy" product="bread" quantity="1" />
<order customer="larry" product="tea bags" quantity="100" />
<order customer="john" product="butter" quantity="1" />
<order product="chicken" quantity="2" customer="derek" />

.... quirky.

I get a large file given to me in this format. I believe it is
created by something like:
grep 'customer=' *.xml, where there are a large number of xml files.

I had to edit the data to include the first and last lines, <orders>
and </orders>,
to get the python code to work. It's not an arduous task(!), but can
you recommend a way to get it to work without
manually editing the data?

One other thing, what's the Roland Mueller post above about (I'm
viewing htis in google groups)? What would the test.xsl file look
like?

Thanks again

Hal.
 
S

Stefan Behnel

Hal Styli, 01.03.2010 00:15:
Stefan, I was happy to see such concise code.
Your python worked with only very minor modifications.

Hai's test xml data *without* the first and last line is close enough
to the data I am using:

<order customer="john" product="eggs" quantity="12" />
<order customer="cindy" product="bread" quantity="1" />
<order customer="larry" product="tea bags" quantity="100" />
<order customer="john" product="butter" quantity="1" />
<order product="chicken" quantity="2" customer="derek" />

... quirky.

I get a large file given to me in this format. I believe it is
created by something like:
grep 'customer=' *.xml, where there are a large number of xml files.

Try to get this fixed at the source. Exporting non-XML that looks like XML
is not a good idea in general, and it means that everyone who wants to read
the data has to adapt, instead of fixing the source once and for all.

I had to edit the data to include the first and last lines, <orders>
and </orders>,
to get the python code to work. It's not an arduous task(!), but can
you recommend a way to get it to work without
manually editing the data?

Iff this cannot be fixed at the source, you can write a file-like wrapper
around a file that simply returns the boundary tags before and after
reading from the file itself. All you need is a .read(n) method, see the
documentation of the file type.

One other thing, what's the Roland Mueller post above about (I'm
viewing htis in google groups)? What would the test.xsl file look
like?

This is the XSLT script he posted:

============================
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:fo="http://www.w3.org/1999/XSL/Format"
version="1.0">

<!-- text output because we want to have an CSV file -->
<xsl:eek:utput method="text"/>

<!-- remove all whitespace coming with input XML -->
<xsl:strip-space elements="*"/>

<!-- matches any <order> element and extracts the customer,product&quantity
attributes -->
<xsl:template match="order">
<xsl:value-of select="@customer"/>
<xsl:text>,</xsl:text>
<xsl:value-of select="@product"/>
<xsl:text>,</xsl:text>
<xsl:value-of select="@quantity"/>
<xsl:text>
</xsl:text>
</xsl:template>

</xsl:stylesheet>
============================

Stefan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,176
Messages
2,570,950
Members
47,500
Latest member
ArianneJsb

Latest Threads

Top