Extracting data from xml file

M

Mag Gam

Hi All,
I am new to XML, and trying to extract some data from a file.

The file looks like this:
<CATALOG>
<CD>
<TITLE>Empire Burlesque</TITLE>
<ARTIST>Bob Dylan</ARTIST>
<COUNTRY>USA</COUNTRY>
<COMPANY>Columbia</COMPANY>
<PRICE>10.90</PRICE>
<YEAR>1985</YEAR>
</CD>
<TAPE>
<TITLE>Empire Burlesque</TITLE>
<ARTIST>Bob Dylan</ARTIST>
<COUNTRY>USA</COUNTRY>
<COMPANY>Columbia</COMPANY>
<PRICE>6.99</PRICE>
<YEAR>1985</YEAR>
<TAPE>
<CATALOG>

I am trying to get
Artist: Bob Dylan
Company: Columbia
CD Price: 10.90
Tape Price: 6.99


What is the best method to do this? Is there a tool or utility you can
recommend for Windows?
 
J

Joe Kesselman

What is the best method to do this?

Lots of tutorials exist on the web. My standard recommended starting
point: http://www.ibm.com/xml

(I'd probably hardcode it using DOM or SAX. But it might be easier for a
novice to write an XSLT stylesheet. There are other tools which might be
easier again, but they're less well standardized and I hesitate to
recommend that a novice invest in learning them.)
 
R

roy axenov

<CATALOG>
<CD>
<TITLE>Empire Burlesque</TITLE>
<ARTIST>Bob Dylan</ARTIST>
<COUNTRY>USA</COUNTRY>
<COMPANY>Columbia</COMPANY>
<PRICE>10.90</PRICE>
<YEAR>1985</YEAR>
</CD>
<TAPE>
<TITLE>Empire Burlesque</TITLE>
<ARTIST>Bob Dylan</ARTIST>
<COUNTRY>USA</COUNTRY>
<COMPANY>Columbia</COMPANY>
<PRICE>6.99</PRICE>
<YEAR>1985</YEAR>
<TAPE>
<CATALOG>

This is not well-formed and therefore not XML. If that's
your real data, XML tools are quite unlikely to help you.

Assuming it's just another case of 'oh, for some reason I
just typed that in instead of using copy-paste'...
I am trying to get
Artist: Bob Dylan
Company: Columbia
CD Price: 10.90
Tape Price: 6.99

Another day, another grouping problem...
What is the best method to do this? Is there a tool or
utility you can recommend for Windows?

Define 'best'. Define 'utility'. I don't believe there's a
DWIM-type tool that would automagically, well, do what you
mean at a click of a button. Therefore, it's a programming
problem. You could use a DOM or SAX parser in your language
of choice, as Joseph proposed. Or you could use XSLT. Or
maybe XQuery or xmlgawk. In case it's XSLT/XQuery, I
believe there are many GUI tools that might make working
with the code easier for you; I'm not sure if there are any
good open source ones, though. If you'd be happy with
Unix-style small tools, there's a number of open source
XSLT processors, including Saxon (it's written in Java, so
it shouldn't be a problem running it on a Windows box),
xsltproc and xalan (if there are no native ports, Cygwin or
MinGW will probably save the day). In short, you should
determine what you want then google for it. Come back with
specific questions.

Here's a transformation that does more or less what you
want with your sample data (after it's been fixed, of
course):

<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:key name="id" match="CD|TAPE"
use="concat(TITLE,ARTIST,COMPANY)"/>
<xsl:key name="first" match="CD|TAPE"
use=
"
generate-id()=
generate-id
(
key('id',concat(TITLE,ARTIST,COMPANY))[1]
)
"/>
<xsl:eek:utput method="text"/>
<xsl:template match="@*|node()"/>
<xsl:template match="/">
<xsl:apply-templates select="key('first',true())"/>
</xsl:template>
<xsl:template match="CD|TAPE">
<xsl:text>
</xsl:text>
<xsl:apply-templates/>
<xsl:apply-templates
select="key('id',concat(TITLE,ARTIST,COMPANY))"
mode="prices"/>
</xsl:template>
<xsl:template match="TITLE">
<xsl:text>Title: </xsl:text>
<xsl:value-of select="."/>
<xsl:text>
</xsl:text>
</xsl:template>
<xsl:template match="ARTIST">
<xsl:text>Artist: </xsl:text>
<xsl:value-of select="."/>
<xsl:text>
</xsl:text>
</xsl:template>
<xsl:template match="COMPANY">
<xsl:text>Company: </xsl:text>
<xsl:value-of select="."/>
<xsl:text>
</xsl:text>
</xsl:template>
<xsl:template match="@*|node()" mode="prices"/>
<xsl:template match="CD|TAPE" mode="prices">
<xsl:apply-templates mode="prices"/>
</xsl:template>
<xsl:template match="CD/PRICE" mode="prices">
<xsl:text>CD Price: </xsl:text>
<xsl:value-of select="."/>
<xsl:text>
</xsl:text>
</xsl:template>
<xsl:template match="TAPE/PRICE" mode="prices">
<xsl:text>Tape Price: </xsl:text>
<xsl:value-of select="."/>
<xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>
 
?

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

Mag said:
Hi All,
I am new to XML, and trying to extract some data from a file.

The file looks like this:
<CATALOG>
<CD>
<TITLE>Empire Burlesque</TITLE>
<ARTIST>Bob Dylan</ARTIST>
<COUNTRY>USA</COUNTRY>
<COMPANY>Columbia</COMPANY>
<PRICE>10.90</PRICE>
<YEAR>1985</YEAR>
</CD>
<TAPE>
<TITLE>Empire Burlesque</TITLE>
<ARTIST>Bob Dylan</ARTIST>
<COUNTRY>USA</COUNTRY>
<COMPANY>Columbia</COMPANY>
<PRICE>6.99</PRICE>
<YEAR>1985</YEAR>
<TAPE>
<CATALOG>

The last two last are not correct (closing tags should begin with /).
I am trying to get
Artist: Bob Dylan
Company: Columbia
CD Price: 10.90
Tape Price: 6.99


What is the best method to do this? Is there a tool or utility you can
recommend for Windows?

One of the many tools that can solve the problem is XMLgawk:

http://home.vrweb.de/~juergen.kahrs/gawk/XML/


The following script solves your problem.

@load xml
XMLCHARDATA { data = $0 }
XMLENDELEM == "ARTIST" && index(XMLPATH, "CD") { print "Artist:", data}
XMLENDELEM == "COMPANY" && index(XMLPATH, "CD") { print "Company:", data}
XMLENDELEM == "PRICE" && index(XMLPATH, "CD") { print "CD Price:", data}
XMLENDELEM == "PRICE" && index(XMLPATH, "TAPE") { print "Tape Price:", data}

Invoke the script like this and it will produce the
following output:

xgawk -f catalog.awk catalog.xml
Artist: Bob Dylan
Company: Columbia
CD Price: 10.90
Tape Price: 6.99
 
M

Mag Gam

The last two last are not correct (closing tags should begin with /).



One of the many tools that can solve the problem is XMLgawk:

http://home.vrweb.de/~juergen.kahrs/gawk/XML/

The following script solves your problem.

@load xml
XMLCHARDATA { data = $0 }
XMLENDELEM == "ARTIST" && index(XMLPATH, "CD") { print "Artist:", data}
XMLENDELEM == "COMPANY" && index(XMLPATH, "CD") { print "Company:", data}
XMLENDELEM == "PRICE" && index(XMLPATH, "CD") { print "CD Price:", data}
XMLENDELEM == "PRICE" && index(XMLPATH, "TAPE") { print "Tape Price:", data}

Invoke the script like this and it will produce the
following output:

xgawk -f catalog.awk catalog.xml
Artist: Bob Dylan
Company: Columbia
CD Price: 10.90
Tape Price: 6.99


Thanks everyone!
I am very new to XML and trying to learn my ropes.

Roy:
I have yet to try your XSL solution. I will try it. The XML code was
not valid, I know. I used it for an example.
Lets assume this is my new .xml file: http://msdn2.microsoft.com/en-us/library/ms762271.aspx
(made some slight modifications, like added 2 authors)

<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<author>II Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description>
</book>
<book id="bk103">
<author>Corets, Eva</author>
<title>Maeve Ascendant</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-11-17</publish_date>
<description>After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society.</description>
</book>
<book id="bk104">
<author>Corets, Eva</author>
<title>Oberon's Legacy</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-03-10</publish_date>
<description>In post-apocalypse England, the mysterious
agent known only as Oberon helps to create a new life
for the inhabitants of London. Sequel to Maeve
Ascendant.</description>
</book>
<book id="bk105">
<author>Corets, Eva</author>
<title>The Sundered Grail</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-09-10</publish_date>
<description>The two daughters of Maeve, half-sisters,
battle one another for control of England. Sequel to
Oberon's Legacy.</description>
</book>
<book id="bk106">
<author>Randall, Cynthia</author>
<title>Lover Birds</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-09-02</publish_date>
<description>When Carla meets Paul at an ornithology
conference, tempers fly as feathers get ruffled.</description>
</book>
<book id="bk107">
<author>Thurman, Paula</author>
<title>Splish Splash</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-11-02</publish_date>
<description>A deep sea diver finds true love twenty
thousand leagues beneath the sea.</description>
</book>
<book id="bk108">
<author>Knorr, Stefan</author>
<title>Creepy Crawlies</title>
<genre>Horror</genre>
<price>4.95</price>
<publish_date>2000-12-06</publish_date>
<description>An anthology of horror stories about roaches,
centipedes, scorpions and other insects.</description>
</book>
<book id="bk109">
<author>Kress, Peter</author>
<title>Paradox Lost</title>
<genre>Science Fiction</genre>
<price>6.95</price>
<publish_date>2000-11-02</publish_date>
<description>After an inadvertant trip through a Heisenberg
Uncertainty Device, James Salway discovers the problems
of being quantum.</description>
</book>
<book id="bk110">
<author>O'Brien, Tim</author>
<title>Microsoft .NET: The Programming Bible</title>
<genre>Computer</genre>
<price>36.95</price>
<publish_date>2000-12-09</publish_date>
<description>Microsoft's .NET initiative is explored in
detail in this deep programmer's reference.</description>
</book>
<book id="bk111">
<author>O'Brien, Tim</author>
<title>MSXML3: A Comprehensive Guide</title>
<genre>Computer</genre>
<price>36.95</price>
<publish_date>2000-12-01</publish_date>
<description>The Microsoft MSXML3 parser is covered in
detail, with attention to XML DOM interfaces, XSLT processing,
SAX and more.</description>
</book>
<book id="bk112">
<author>Galos, Mike</author>
<title>Visual Studio 7: A Comprehensive Guide</title>
<genre>Computer</genre>
<price>49.95</price>
<publish_date>2001-04-16</publish_date>
<description>Microsoft Visual Studio 7 is explored in depth,
looking at how Visual Basic, Visual C++, C#, and ASP+ are
integrated into a comprehensive development
environment.</description>
</book>
</catalog>

How would I get 'Book Title' and 'Book Author' ?

TIA
 
G

git

Hi All,
I am new to XML, and trying to extract some data from a file.

The file looks like this:
<CATALOG>
<CD>
<TITLE>Empire Burlesque</TITLE>
<ARTIST>Bob Dylan</ARTIST>
<COUNTRY>USA</COUNTRY>
<COMPANY>Columbia</COMPANY>
<PRICE>10.90</PRICE>
<YEAR>1985</YEAR>
</CD>
<TAPE>
<TITLE>Empire Burlesque</TITLE>
<ARTIST>Bob Dylan</ARTIST>
<COUNTRY>USA</COUNTRY>
<COMPANY>Columbia</COMPANY>
<PRICE>6.99</PRICE>
<YEAR>1985</YEAR>
<TAPE>
<CATALOG>

I am trying to get
Artist: Bob Dylan
Company: Columbia
CD Price: 10.90
Tape Price: 6.99


What is the best method to do this? Is there a tool or utility you can
recommend for Windows?

On windows, for someone who just wants to get on with the job rather than
learn xslt or xpath, I would recommend coding it all in JScript (or
vbscript). Use use the MS XML parse that comes with windows and walk over
the DOM to find the data you want.

I am working on examples of this technique on my blog/site:

http://nerds-central.blogspot.com/2007/01/creating-xml-viewer-with-jscript-exsead.html

http://nerds-central.blogspot.com/2007/01/nerds-central-gets-ajax-atom-feed.html
(I promise that I will write the follow up to that second article real
soon! And I am working VBScript examples as well).

Feel free to join the Nerds-Central email group to ask more questions if
you like the method:
http://tech.groups.yahoo.com/group/nerds-central/

Cheers

AJ
 
?

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

Mag said:
How would I get 'Book Title' and 'Book Author' ?

Use this XMLgawk script:

@load xml
XMLCHARDATA { data = $0 }
XMLENDELEM == "author" { author = data }
XMLENDELEM == "title" { title = data }
XMLENDELEM == "book" { print author, title}


And you will get the following output from the XML
data that you posted:

xgawk -f catalog2.awk catalog2.xml

II Gambardella, Matthew XML Developer's Guide
Ralls, Kim Midnight Rain
Corets, Eva Maeve Ascendant
Corets, Eva Oberon's Legacy
Corets, Eva The Sundered Grail
Randall, Cynthia Lover Birds
Thurman, Paula Splish Splash
Knorr, Stefan Creepy Crawlies
Kress, Peter Paradox Lost
O'Brien, Tim Microsoft .NET: The Programming Bible
O'Brien, Tim MSXML3: A Comprehensive Guide
Galos, Mike Visual Studio 7: A Comprehensive Guide
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,149
Members
46,695
Latest member
StanleyDri

Latest Threads

Top