Newbie ? -- SGML metadata extraction

P

ProvoWallis

Hi,

I'm trying to write a script that will extract the value of an
attribute from an element using the attribute value of another element
as the basis for extraction.

For example, in my situation I have a pre-defined list of main sections
and I want to extract the id attribute of the form element and create a
dictionary of graphic ID and section number pairs but only for the
sections in my pre-defined list but I want to exclude the id value from
any section that does not appear on my list. I.e., I want to know the
id value for the forms that appear in sections 1 and 3 but not in 2.

Boiled down my SGML looks something like this:

<main-section no="1">

<form id="graphic_1.tif">
<form id="graphic_2.tif">

<main-section no="2">

<form id="graphic_3.tif">

<main-section no="3">

<form id="graphic_4.tif">
<form id="graphic_5.tif">
<form id="graphic_6.tif">

This is what I have come up with on my own so far. My problem is that I
can't seem to pick up the value of the id attribute.

Any advice appreciated.

Greg

###

import os, re, csv

root = raw_input("Enter the path where the program should run: ")
fname = raw_input("Enter name of the CSV file containing the section
numbers: ")
sgmlname = raw_input("Enter name of the SGML file to search: ")
print

given,ext = os.path.splitext(fname)
root_name = os.path.join(root,fname)
n = given + '.new'
outputName = os.path.join(root,n)

reader = csv.reader(open(root_name, 'r'), delimiter=',')

sections = []

for row in reader:
sections.append(row[0])


inputFile = open(os.path.join(root,sgmlname), 'r')

illoList ={}

while 1:
lines = inputFile.readlines()
if not lines:
break
for line in lines:

main = re.search(r'(?i)(?m)(?s)<main-section
no=\"(\w+)\"', line)
id = re.search(r'(?i)id=\"(.*?tif)\"', line)
if main is not None and main.group(1) in sections:

if id is not None:

illoList[illo.group(1)] = main.group(1)
 
A

Adonis

ProvoWallis wrote:

<snip>

From what I gather here is a quickie, probably better solutions on the
way but this accomplishes the idea I think.

Some helpful links:
http://docs.python.org/lib/module-sgmllib.html
http://docs.python.org/lib/module-HTMLParser.html
http://docs.python.org/lib/module-htmllib.html

---

from HTMLParser import HTMLParser

data = """<main-section no="1">

<form id="graphic_1.tif">
<form id="graphic_2.tif">

<main-section no="2">

<form id="graphic_3.tif">

<main-section no="3">

<form id="graphic_4.tif">
<form id="graphic_5.tif">
<form id="graphic_6.tif">
"""

class ParseForms(HTMLParser):

def handle_starttag(self, tag, attrs):
if tag == "form":
# attrs argument is a list of tuples [(attribute, value)]
# converted it to a dictionary to access attribute easier
print "form id: %s" % dict(attrs).get('id')

if __name__ == "__main__":
parser = ParseForms()
parser.feed(data)
 
P

ProvoWallis

Thanks. One more question, though.

I'm not sure how to limit the scope of my search so that I'm just
extracting the id attribute from the sections that I want. I.e., I want
the id attributes from the forms in sections 1 and 3 but not from 2.

Maybe I'm missing something.
 
A

Adonis

ProvoWallis said:
Thanks. One more question, though.

I'm not sure how to limit the scope of my search so that I'm just
extracting the id attribute from the sections that I want. I.e., I want
the id attributes from the forms in sections 1 and 3 but not from 2.

Maybe I'm missing something.

If the data has closing tags this is easily achieved using a dom or sax
parser, but here is a slightly modified version, very ugly but simple.

hope this helps.

Adonis

---

from HTMLParser import HTMLParser

data = """<main-section no="1">

<form id="graphic_1.tif">
<form id="graphic_2.tif">

<main-section no="2">

<form id="graphic_3.tif">

<main-section no="3">

<form id="graphic_4.tif">
<form id="graphic_5.tif">
<form id="graphic_6.tif">
"""

class ParseForms(HTMLParser):

_section = None
_secDict = dict()

def getSection(self, key):
return self._secDict.get(str(key))

def handle_starttag(self, tag, attrs):
if tag == "form":
if not self._secDict.has_key(self._section):
self._secDict[self._section] = [dict(attrs).get('id')]
else:
self._secDict[self._section].append(dict(attrs).get('id'))

if tag == "main-section":
self._section = dict(attrs).get('no')

if __name__ == "__main__":
parser = ParseForms()
parser.feed(data)
print parser.getSection(1)
print parser.getSection(3)
 
P

ProvoWallis

Thanks very much for your help. It's greatly appreciated.

It look a couple of tries to see what was happening but I've figured
it out.

Greg
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,989
Messages
2,570,207
Members
46,782
Latest member
ThomasGex

Latest Threads

Top