Newbie Text Processing Question

G

gshepherd281

Hi,

I'm a total newbie to Python so any and all advice is greatly
appreciated.

I'm trying to use regular expressions to process text in an SGML file
but only in one section.

So the input would look like this:

<ch-part no="I"><title>RESEARCH GUIDE
<sec-main no="1.01"><title>content
<para>content

<sec-main no="2.01"><title>content
<para>content


<ch-part no="II"><title>FORMS
<sec-main no="3.01"><title>content

<sec-sub1 no="1"><title>content
<para>content

<sec-sub2 no="1"><title>content
<para>content


and the output like this:

<ch-part no="I"><title>RESEARCH GUIDE
<sec-main no="1.01"><title>content
<biblio>
<para>content
</biblio>

<sec-main no="2.01"><title>content
<biblio>
<para>content
</biblio>

<ch-part no="II"><title>FORMS
<sec-main no="3.01"><title>content

<sec-sub1 no="1"><title>content
<para>content

<sec-sub2 no="1"><title>content
<para>content


But no matter what I try I end up changing the entire file rather than
just one part.

Here's what I've come up with so far but I can't think of anything
else.

***

import os, re
setpath = raw_input("Enter the path where the program should run: ")
print

for root, dirs, files in os.walk(setpath):
fname = files
for fname in files:
inputFile = file(os.path.join(root,fname), 'r')
line = inputFile.read()
inputFile.close()


chpart_pattern = re.compile(r'<ch-part
no=\"[A-Z]{1,4}\"><title>(RESEARCH)', re.IGNORECASE)

while 1:
if chpart_pattern.search(line):
line = re.sub(r"<sec-main
no=(\"[0-9]*.[0-9]*\")><title>(.*)", r"<sec-main
no=\1><title>\2\n<biblio>", line)
outputFile = file(os.path.join(root,fname), 'w')
outputFile.write(line)
outputFile.close()
break

if chpart_pattern.search(line) is None:
print 'none'
break

Thanks,

Greg
 
J

James Stroud

You can edit a file in place, but it is not applicable to what you are doing.
As soon as you insert the first "<biblio>", you've shifted everything
downstream by those 8 bytes. Since they map to a physically located blocks on
a physical drive, you will have to rewrite those blocks. If it is a big file
you can do something conceptually similar to piping, where the original file
is read in line by line and a new file is created:

afile = open("somefile.xml")
newfile = open("somenewfile.xml", "w")
for aline in afile:
if tests_positive(aline):
newfile.write(make_the_prelude(aline))
newfile.write(aline)
newfile.write(make_the_afterlude(aline))
else:
newfile.write(aline)
afile.close()
newfile.close()

James

That's how Python works. You read in the whole file, edit it, and write it
back out. As far as I know there's no way to edit a file "in place" which
I'm assuming is what you're asking?

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
 
M

Mike Meyer

I'm a total newbie to Python so any and all advice is greatly
appreciated.

Well, I've got some for you.
I'm trying to use regular expressions to process text in an SGML file
but only in one section.

This is generally a bad idea. SGML family languages aren't easy to
parse - even the ones that were designed to be easy to parse - and
generally require very complex regular expessions to get right. It may
be that your SGML data can be parsed by the re you use, but there
are almost certainly valid SGML documents that your parser will not
properly parse.

In general, it's better to use a parser for the language in question.
So the input would look like this:

<ch-part no="I"><title>RESEARCH GUIDE
<sec-main no="1.01"><title>content
<para>content

<sec-main no="2.01"><title>content
<para>content


<ch-part no="II"><title>FORMS
<sec-main no="3.01"><title>content

<sec-sub1 no="1"><title>content
<para>content

<sec-sub2 no="1"><title>content
<para>content


This is funny-looking SGML. Are the the end tags really optional for
all the tag types?
But no matter what I try I end up changing the entire file rather than
just one part.

Other have explained why you can't do that, so I'll skip it.
Here's what I've come up with so far but I can't think of anything
else.

***

import os, re
setpath = raw_input("Enter the path where the program should run: ")
print

for root, dirs, files in os.walk(setpath):
fname = files
for fname in files:
inputFile = file(os.path.join(root,fname), 'r')
line = inputFile.read()
inputFile.close()


chpart_pattern = re.compile(r'<ch-part
no=\"[A-Z]{1,4}\"><title>(RESEARCH)', re.IGNORECASE)

This makes a number of assumptions that are invalid about SGML in
general, but may be valid for your sample text - how attributes are
quoted, the lack of line breaks, which can be added without changing
the content, and the format of the "no" attribute.
while 1:
if chpart_pattern.search(line):
line = re.sub(r"<sec-main
no=(\"[0-9]*.[0-9]*\")><title>(.*)", r"<sec-main
no=\1><title>\2\n<biblio>", line)

Ditto.

Heren's an sgmllib solution that gets does what you do above, except
it writes it to standard out:

#!/usr/bin/env python

import sys
from sgmllib import SGMLParser

datain = """
<ch-part no="I"><title>RESEARCH GUIDE
<sec-main no="1.01"><title>content
<para>content

<sec-main no="2.01"><title>content
<para>content


<ch-part no="II"><title>FORMS
<sec-main no="3.01"><title>content

<sec-sub1 no="1"><title>content
<para>content

<sec-sub2 no="1"><title>content
<para>content
"""

class Parser(SGMLParser):

def __init__(self):
# install the handlers with funny names
setattr(self, "start_ch-part", self.handle_ch_part)

# And start with chapter 0
self.ch_num = 0

SGMLParser.__init__(self)

def format_attributes(self, attributes):
return ['%s="%s"' % pair for pair in attributes]

def unknown_starttag(self, tag, attributes):
taglist = self.format_attributes(attributes)
taglist.insert(0, tag)
sys.stdout.write('<%s>' % ' '.join(taglist))

def handle_data(self, data):
sys.stdout.write(data)

def handle_ch_part(self, attributes):
"""This should be called start_ch-part, but, well, you know."""

self.unknown_starttag('ch-part', attributes)
for name, value in attributes:
if name == 'no':
self.ch_num = value

def start_para(self, attributes):
if self.ch_num == 'I':
sys.stdout.write('<biblio>\n')
self.unknown_starttag('para', attributes)


parser = Parser()
parser.feed(datain)
parser.close()


sgmllib isn't a very good SGML parser - it was written to support
htmllib, and really only handles that subset of sgml well. In
particular, it doesn't really understand DTDs, so can't handle the
missing end tags in your example. You may be able to work around that.

If you can coerce this to XML, then the xml tools in the standard
library will work well. For HTML, I like BeautifulSoup, but that's
mostly because it deals with all the crud on the net that is passed
off as HTML. For SGML - well, I don't have a good answer. Last time I
had to deal with real SGML, I used a C parser that spat out a parse
tree that could be parsed properly.

<mike
 
F

Fredrik Lundh

Gregory said:
That's how Python works. You read in the whole file, edit it, and write it
back out.

that's how file systems work. if file systems generally supported insert
operations, Python would of course support that feature.

</F>
 
D

Dennis Lee Bieber

that's how file systems work. if file systems generally supported insert
operations, Python would of course support that feature.
My college system's default for editor files was "keyed"... Each
line was independent, and the key was the line number (including a
decimal part for inserted lines).

1.000 first line
1.500 inserted line
2.000 last line

The machine had three "native" file formats... consecutive (what most
would consider a regular binary/stream [written from start to end]
file), keyed (ISAM type -- also used by the FORTRAN runtime for "random"
access by record number), and random (fixed size contiguous disk
allocation, with NO structure assumed -- all access was by offset from
start of file allocation).

Of course, that strange system also maintained separate read/write
pointers on files, so one could open "update" mode -- where one had to
read a record before writing (over) the record. No seeks needed.
"Scratch" required write before read. But the I/O did not have to be in
lockstep, you could read three records, write one, then read the fourth,
write the second...
--
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,990
Messages
2,570,211
Members
46,796
Latest member
SteveBreed

Latest Threads

Top