character-filtering and Word (& company)

C

Charles Hartman

I'm working on text-handling programs that want plain-text files as
input. It's fine to tell users to feed the programs with plain-text
only, but not all users know what this means, even after you explain
it, or they forget. So it would be nice to be able to handle gracefully
the stuff that MS Word (or any word-processor) puts into a file.
Inserting a 0-127 filter is easy but not very friendly. Typically, the
w.p. file loads OK (into a wx.StyledTextCtrl a.k.a Scintilla editing
pane), and mostly be readable. Just a few characters will be wrong:
"smart" quotation marks and the like.

Is there some well-known way to filter or translate this w.p. garbage?
I don't know whether encodings are relevant; I don't know what encoding
an MSW file uses. I don't see how to use s.translate() because I don't
know how to predict what the incoming format will be.

Any hints welcome.

Charles Hartman
 
M

Mike Meyer

Charles Hartman said:
I'm working on text-handling programs that want plain-text files as
input. It's fine to tell users to feed the programs with plain-text
only, but not all users know what this means, even after you explain
it, or they forget. So it would be nice to be able to handle
gracefully the stuff that MS Word (or any word-processor) puts into a
file. Inserting a 0-127 filter is easy but not very
friendly. Typically, the w.p. file loads OK (into a wx.StyledTextCtrl
a.k.a Scintilla editing pane), and mostly be readable. Just a few
characters will be wrong: "smart" quotation marks and the like.

Is there some well-known way to filter or translate this w.p. garbage?
I don't know whether encodings are relevant;

Bingo. You need to figure out the encoding before you can do
intelligent translation of the non-ASCII characters in the text.
I don't know what encoding an MSW file uses.

Different WPs will use different encodings. Especially when you start
working in a cross-platform environment.

I don't know that there is a good solution to this problem. It
certainly hasn't been sovled on the web.

<mike
 
J

John Machin

Charles said:
I'm working on text-handling programs that want plain-text files as
input. It's fine to tell users to feed the programs with plain-text
only, but not all users know what this means, even after you explain
it, or they forget. So it would be nice to be able to handle gracefully
the stuff that MS Word (or any word-processor) puts into a file.
Inserting a 0-127 filter is easy but not very friendly. Typically, the
w.p. file loads OK (into a wx.StyledTextCtrl a.k.a Scintilla editing
pane), and mostly be readable. Just a few characters will be wrong:
"smart" quotation marks and the like.

Is there some well-known way to filter or translate this w.p. garbage?
I don't know whether encodings are relevant; I don't know what encoding
an MSW file uses. I don't see how to use s.translate() because I don't
know how to predict what the incoming format will be.

Any hints welcome.

This may help: http://wvware.sourceforge.net/

[not a recommendation, I've never used it]
 
C

Cameron Laird

Charles said:
I'm working on text-handling programs that want plain-text files as
input. It's fine to tell users to feed the programs with plain-text
only, but not all users know what this means, even after you explain
it, or they forget. So it would be nice to be able to handle gracefully
the stuff that MS Word (or any word-processor) puts into a file.
Inserting a 0-127 filter is easy but not very friendly. Typically, the
w.p. file loads OK (into a wx.StyledTextCtrl a.k.a Scintilla editing
pane), and mostly be readable. Just a few characters will be wrong:
"smart" quotation marks and the like.

Is there some well-known way to filter or translate this w.p. garbage?
I don't know whether encodings are relevant; I don't know what encoding
an MSW file uses. I don't see how to use s.translate() because I don't
know how to predict what the incoming format will be.

Any hints welcome.

This may help: http://wvware.sourceforge.net/

[not a recommendation, I've never used it]

As Mike Meyer wrote, there is *not* standardization. wvWare is
indeed useful. Before you go farther, though, I want to empha-
size to you what a challenge this is. While it sounds simple to
users to collect their writings through a Web interface, this
turns out to present difficulties that go on and on. Anything
you can do to structure the problem helps.

One minor variation that can help is to expose TEXTAREAs or
equivalent, and ask users to cut-and-paste their content into
them. In some situations, that's surprisingly effective.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,229
Messages
2,571,160
Members
47,785
Latest member
deepusaini

Latest Threads

Top