Converting .doc to .txt in Linux

P

patrick.waldo

Hi Everyone,

I had previously asked a similar question,
http://groups.google.com/group/comp...59?lnk=gst&q=convert+doc+txt#9dc901da63d8d059

but at that point I was using Windows and now I am using Linux.
Basically, I have some .doc files that I need to convert into txt
files encoded in utf-8. However, win32com.client doesn't work in
Linux.

It's been giving me quite a headache all day. Any ideas would be
greatly appreciated.

Best,
Patrick

#Windows Code:
import glob,os,codecs,shutil,win32com.client
from win32com.client import Dispatch

input = '/home/pwaldo2/work/workbench/current_documents/*.doc'
input_dir = '/home/pwaldo2/work/workbench/current_documents/'
outpath = '/home/pwaldo2/work/workbench/current_documents/TXT/'

for doc in glob.glob1(input):
WordApp = Dispatch("Word.Application")
WordApp.Visible = 1
WordApp.Documents.Open(doc)
WordApp.ActiveDocument.SaveAs(doc,7)
WordApp.ActiveDocument.Close()
WordApp.Quit()

for doc in glob.glob(input):
txt_split = os.path.splitext(doc)
txt_doc = txt_split[0] + '.txt'
txt_doc_path = os.path.join(outpath,txt_doc)
doc_path = os.path.join(input_dir,doc)
shutil.copy(doc_path,txt_doc_path)
 
C

Chris Rebert

I'd recommend using one of the Word->txt converters for Linux and just
running it in a shell script:
* http://wvware.sourceforge.net/
* http://www.winfield.demon.nl/

No compelling reason to use Python in this instance. Right tool for
the right job and all that.

- Chris

Hi Everyone,

I had previously asked a similar question,
http://groups.google.com/group/comp...59?lnk=gst&q=convert+doc+txt#9dc901da63d8d059

but at that point I was using Windows and now I am using Linux.
Basically, I have some .doc files that I need to convert into txt
files encoded in utf-8. However, win32com.client doesn't work in
Linux.

It's been giving me quite a headache all day. Any ideas would be
greatly appreciated.

Best,
Patrick

#Windows Code:
import glob,os,codecs,shutil,win32com.client
from win32com.client import Dispatch

input = '/home/pwaldo2/work/workbench/current_documents/*.doc'
input_dir = '/home/pwaldo2/work/workbench/current_documents/'
outpath = '/home/pwaldo2/work/workbench/current_documents/TXT/'

for doc in glob.glob1(input):
WordApp = Dispatch("Word.Application")
WordApp.Visible = 1
WordApp.Documents.Open(doc)
WordApp.ActiveDocument.SaveAs(doc,7)
WordApp.ActiveDocument.Close()
WordApp.Quit()

for doc in glob.glob(input):
txt_split = os.path.splitext(doc)
txt_doc = txt_split[0] + '.txt'
txt_doc_path = os.path.join(outpath,txt_doc)
doc_path = os.path.join(input_dir,doc)
shutil.copy(doc_path,txt_doc_path)
 
T

Tommy Nordgren

Hi Everyone,

I had previously asked a similar question,
http://groups.google.com/group/comp...59?lnk=gst&q=convert+doc+txt#9dc901da63d8d059

but at that point I was using Windows and now I am using Linux.
Basically, I have some .doc files that I need to convert into txt
files encoded in utf-8. However, win32com.client doesn't work in
Linux.

It's been giving me quite a headache all day. Any ideas would be
greatly appreciated.

Best,
Patrick

#Windows Code:
import glob,os,codecs,shutil,win32com.client
from win32com.client import Dispatch

input = '/home/pwaldo2/work/workbench/current_documents/*.doc'
input_dir = '/home/pwaldo2/work/workbench/current_documents/'
outpath = '/home/pwaldo2/work/workbench/current_documents/TXT/'

for doc in glob.glob1(input):
WordApp = Dispatch("Word.Application")
WordApp.Visible = 1
WordApp.Documents.Open(doc)
WordApp.ActiveDocument.SaveAs(doc,7)
WordApp.ActiveDocument.Close()
WordApp.Quit()

for doc in glob.glob(input):
txt_split = os.path.splitext(doc)
txt_doc = txt_split[0] + '.txt'
txt_doc_path = os.path.join(outpath,txt_doc)
doc_path = os.path.join(input_dir,doc)
shutil.copy(doc_path,txt_doc_path)
You can do it manually with Open Office. <http://www.openoffice.org/>
A free office suite.
 
C

Carl Banks

Hi Everyone,
but at that point I was using Windows and now I am using Linux.
Basically, I have some .doc files that I need to convert into txt
files encoded in utf-8. However, win32com.client doesn't work in
Linux.
It's been giving me quite a headache all day. Any ideas would be
greatly appreciated.

#Windows Code:
import glob,os,codecs,shutil,win32com.client
from win32com.client import Dispatch
input = '/home/pwaldo2/work/workbench/current_documents/*.doc'
input_dir = '/home/pwaldo2/work/workbench/current_documents/'
outpath = '/home/pwaldo2/work/workbench/current_documents/TXT/'
for doc in glob.glob1(input):
WordApp = Dispatch("Word.Application")
WordApp.Visible = 1
WordApp.Documents.Open(doc)
WordApp.ActiveDocument.SaveAs(doc,7)
WordApp.ActiveDocument.Close()
WordApp.Quit()
for doc in glob.glob(input):
txt_split = os.path.splitext(doc)
txt_doc = txt_split[0] + '.txt'
txt_doc_path = os.path.join(outpath,txt_doc)
doc_path = os.path.join(input_dir,doc)
shutil.copy(doc_path,txt_doc_path)

You can do it manually with Open Office. <http://www.openoffice.org/>
A free office suite.

On Debian there is a package called "unoconv"--written in Python--that
can do the conversions from the command line. It requires a running
instance of Open Office. However, the doc-to-txt conversion of Open
Office isn't that good. (It wasn't as good as Word's formatted text
converter, last time I used it.)


Carl Banks
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,812
Latest member
GracielaWa

Latest Threads

Top