Parsing MS Word Document?

M

MrBill

I would like to be able to open, read, and extract data from a report that
is produced in MS Word. The doc seems to contain embedded spreadsheets. I
would like to extract some of the data from the spreadsheets and feed it
into another application. I've been reading a little bit about OLE and MS
Word and sure would like to find a module that hides some of this so-called
innovation from me.

Thanks,
Bill
 
J

John J. Lee

MrBill said:
I would like to be able to open, read, and extract data from a report that
is produced in MS Word. The doc seems to contain embedded spreadsheets. I
would like to extract some of the data from the spreadsheets and feed it
into another application. I've been reading a little bit about OLE and MS
Word and sure would like to find a module that hides some of this so-called
innovation from me.

:) Yeah, isn't all that baroque complexity wonderful?

1. Alex Martelli's suggestion on this list: use RTF. Word can import
and export to it. You can automate that from VB or Python in the
usual COM ways (see 3.). I don't know whether you'll get useful
RTF out of embedded Excel sheets, though.

2. Use OpenOffice via PyUNO.

3. As you already know, use the MS Office object models, with Python
for Windows extensions (or ctypes, if you're brave). Perhaps ADO
is what you're looking for? IIRC, ADO isn't too complicated and
can treat Excel sheets as data sources just as it does for
relational databases.

For simpler Word docs (no embedded stuff), there are other tools out
there, but they'd be no use in this case.

A useful tip for 3. is to record a VB macro in Word, then edit it to
something sane. You can keep it in VB, or do the relatively trivial
edits required to convert it to Python. Here's an example on
automating RTF generation:

http://www.google.com/groups?q=auth...ie=UTF-8&[email protected]&rnum=1


John
 
M

MrBill

Thanks John,
This should get me started.
Bill
John J. Lee said:
:) Yeah, isn't all that baroque complexity wonderful?

1. Alex Martelli's suggestion on this list: use RTF. Word can import
and export to it. You can automate that from VB or Python in the
usual COM ways (see 3.). I don't know whether you'll get useful
RTF out of embedded Excel sheets, though.

2. Use OpenOffice via PyUNO.

3. As you already know, use the MS Office object models, with Python
for Windows extensions (or ctypes, if you're brave). Perhaps ADO
is what you're looking for? IIRC, ADO isn't too complicated and
can treat Excel sheets as data sources just as it does for
relational databases.

For simpler Word docs (no embedded stuff), there are other tools out
there, but they'd be no use in this case.

A useful tip for 3. is to record a VB macro in Word, then edit it to
something sane. You can keep it in VB, or do the relatively trivial
edits required to convert it to Python. Here's an example on
automating RTF generation:

http://www.google.com/groups?q=auth...ie=UTF-8&[email protected]&rnum=1


John
 
D

Dave Kuhlman

MrBill said:
I would like to be able to open, read, and extract data from a
report that
is produced in MS Word. The doc seems to contain embedded
spreadsheets. I would like to extract some of the data from the
spreadsheets and feed it
into another application. I've been reading a little bit about
OLE and MS Word and sure would like to find a module that hides
some of this so-called innovation from me.

Here is another strategy:

1. Load the document into MS Word. Save the document as HTML.

2. Run the `links` Web browser on the file with the -dump option.
This will convert the HTML into plain text. Example:

links -dump mydoc.html > mydoc.txt

3. Use Python to extract information from the resulting plain text
file.

Another suggestion -- The Web browser `links` formats tables
differently from and perhaps better than `lynx`. But, you might
try lynx, too.

Dave
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,169
Messages
2,570,918
Members
47,458
Latest member
Chris#

Latest Threads

Top