How to open a word document in C

A

Asma

Dear Sir,

I am trying to find a way to open a Word document using C language and
read the text of word doc into a variable.
(Turbo C on Dos 6.0).

Can anyone please tell me which libraries in C can be used to perform
this task.

Thanks you so much
Asma
 
Q

Quentarez

Dear Sir,

I am trying to find a way to open a Word document using C language and
read the text of word doc into a variable.
(Turbo C on Dos 6.0).

Can anyone please tell me which libraries in C can be used to perform
this task.

Thanks you so much
Asma

If you are referring to Microsoft Word, I suggest you open one in a text
editor such as Notepad. You will see that a Microsoft Word document is not
plain text.

You can, of course, use the standard library functions such as fopen,
fgets, fread, etc. to read the contents of the file, but to translate it
from its format into something else, you will need to understand the
format.

http://www.wotsit.org is a site that has information on many file formats.
You may be able to find information on the Word format on that site.
 
M

Martin Ambuhl

Asma said:
Dear Sir,

I am trying to find a way to open a Word document using C language and

fopen()
read the text of word doc into a variable.

fread()
(Turbo C on Dos 6.0).

Can anyone please tell me which libraries in C can be used to perform
this task.

The standard libraries contain fopen() and fread(). These functions are
prototyped in <stdio.h>
 
R

Ross A. Finlayson

Asma said:
Dear Sir,

I am trying to find a way to open a Word document using C language and
read the text of word doc into a variable.
(Turbo C on Dos 6.0).

Can anyone please tell me which libraries in C can be used to perform
this task.

Thanks you so much
Asma

Hi,

The Microsoft Word document, since some version of Word 6.0, is stored
in the Object Linking and Embedding Structured Storage Compound
Document. The OLE Structured Storage has what are called streams in
it, which are basically file system structures within the, uh, file.
So, that way instead of, for example, how HTML with a bunch of image
files is a bunch of different files unless you use the Masinter
Data:URL to encode the image files directly within the HTML file, for
example, all the media contents of the Word document file are stored in
that one file. One of those streams contains the standardized document
properties as appear on the document properties tag of the explorer.
The main stream has the word file. Now, there is text in the Word
file, it is not the way WordPerfect, or RTF or HTML is, where
attributes of the text are inline with the text, the attributes are
stored first and then the text data is there. In the Word 5.0 files,
you can just chop off the "binary" stuff and the text remains. In the
structured storage, that text data is not guaranteed to be contiguous.

You might want to look at Quikview and the Quikview file parser API,
for programming that in C, and go ask in a newsgroup about Microsoft
Word.

Recently, Microsoft has changed their policies and now you can actually
request from them the Office 2003 file format(s). You can get some
older versions of the specification on the Internet, eg Word 6, Word 8,
and stuff.

The Word compound document can contain a lot of things, for example
PostScript and TIFF, embedded and linked OLE objects, Office Drawing
items, forms and mail merge information, and all the other stuff that
has to go in there, obviously.

So, look at the Quikview file parser API, there is a DLL you can load
and call its entry point to extract text from .doc files. I may be
mistaken about that, or, your computer may not support that.

If you'd like a full-fledged portable C language implementation of a
Word doc parser, and are willing to pay some money and wait for it,
please let me know.

Excuse me, this is a newsgroup for discussing computer programming
using the C programming language, and programming issues related to
particular systems or applications are generally considered off-topic.
The previous poster is correct.

Thank you,

Ross F.
 
A

Asma

Dear Mr Ross (and all),

Thank you so much for the explanation. Your post has helped me a lot. I
will look into the options (Quikview file parser API) that you have
suggested. As my question was related to handling OOP and OLE related
calls of Word doc to be open in a NON OOP/OLE supporting C language, I
was not getting a idea on how to go ahead with the task of programming
in C. I have no plans to pay any money yet, but if i write a parser
myself, i will surely send it to you for FREE. You surely saved me from
re inventing the wheel.

Also thanks to others about fopen and other functions they have
recommended. My question was to find if some one has allready written a
library in C to do OOP/OLE related functions related to Microsoft Word
and I think Mr Ross do have some good insight on it.

Thanks to all for replying.

Regards,
Asma
 
R

Ross A. Finlayson

Asma said:
Dear Mr Ross (and all),

Thank you so much for the explanation. Your post has helped me a lot. I
will look into the options (Quikview file parser API) that you have
suggested. As my question was related to handling OOP and OLE related
calls of Word doc to be open in a NON OOP/OLE supporting C language, I
was not getting a idea on how to go ahead with the task of programming
in C. I have no plans to pay any money yet, but if i write a parser
myself, i will surely send it to you for FREE. You surely saved me from
re inventing the wheel.

Also thanks to others about fopen and other functions they have
recommended. My question was to find if some one has allready written a
library in C to do OOP/OLE related functions related to Microsoft Word
and I think Mr Ross do have some good insight on it.

Thanks to all for replying.

Regards,
Asma

Hi Asma, Martin, Jason, Randy, Quentarez, etcetera,

Thank you for your kind, respectful words. This is definitely
off-topic for comp.lang.c.

I have not used those Word automation classes from C. Perhaps you
might generate some correct IDL files that match the interface of those
objects, compile the IDL into the C++ source code, and include them
with the CINTERFACE and COBJMACROS definitions, then use them with the
COM functions, from C. Perhaps the SDK has a predefined C interface, I
really don't know.

I have no experience with using C with COM objects, thus what seems a
simple suggestion might in reality be fraught with uncertainty.

If you must deal with the COM objects then the Don Box "Essential COM"
is a good read, start with chapter seven, also "Essential IDL." I read
it several years ago and had a more difficult time understanding it
than reading it now, it makes a lot of sense.

If you plan to actually parse the files themselves, good luck. In
examining a few versions' specifications, it seems that there are some
serious differences in them for the basic file structures'
identifiers. As well, some of the styles are not well-specified. All
I have so far is some implementation of the structured storage and
transliteration of the specification elements into a bunch of XML
files. I guess the OpenOffice has a word file reader, it and AbiWord
and most other open source implementations are based upon the Caolan
McNamara wordview or wv: GPL.
From here this is totally off-topic for comp.lang.c.

I think open source software is good, but people think that software is
free, and programming is really expensive. I used to think coding some
big thing would only take months, and I'm still not done. You can copy
software without capital inputs. Something like Linux or the Apache
web server is a click away, each written in C, and while with no
immediate monetary cost, they're the result of hundreds of person-years
of labor, not to mention that they saves thousands of years of labor
about every second and introduce new labor-using opportunities. It's
easy to think that programming is easy, until you do it for a while,
then it seems easy, continue, it seems difficult, ad nauseum.
Programming is easy, learning to program elegantly and productively is
not easy, programming inelegantly is not easy, learning to program
general purpose software takes years and years.

The people worst off from the ready availability of software to do
things for free are the small software developers. When there's a free
alternative so many people take it when otherwise they could budget and
afford to pay for the tool that does the job, that small companies,
while being all over the place in the hundreds and thousands, almost
all face huge competition from free software. Many unfortunately abuse
and breach GPL, one time I used a GPL getlongopts but that is not
necessary. They can generally compete with the larger companies on
certain types of software, because scale conflicts with deadweight, but
once a generally accepted free solution for a task is widely available,
the market for that tool basically has a big hole shot in it. All the
commercial software developers bleed, the big ones just have more
blood.

So, to stay ahead as a small software development company, and don't
look to me because I'm not successful, it seems that innovation is a
must, because somebody will eventually release a free replacement of
various levels of utility, performance, conformance, reliability,
robustness, and overall quality. Some outfits focus on systems like
Windows that are generally non-free, there is a lot of free software
wrapped for COM for 495 for the Corp VB P/A. Other places try to get
their API out, and then have dual commercial licenses for extensions or
something. It used to be about shrinkwrap vs. shareware. In hindsight
a huge variety of successful tools are simple and obvious, almost any
programming task is obvious to the expert practitioner, and in a key
way, prior art.

To stay ahead as a software services company, a lot of them turn to
free software.

That's about horizontal components, not vertical components like niche
business type software, which is generally fiercely competitive, yet
there is almost no free software involved, or there are a lot of free
horizontal components used, where much of it is custom software or
rooted in custom software, which is expensive.

When I say expensive, I mean not free, I don't mean exorbitant. When I
mean exorbitant I say really expensive.

I use free software, I use gcc. I also use commercial compilers, and
it's good that I paid some money for a compiler, because the people
willing to pay for software are on systems thus supported. You'll tend
to get a lot more than you put in for developer tools, the serious
developer.

Buy American software.

Now, I tell you this, and then think to myself: if I am to encourage
you to buy American software, why am I giving you free advice? Error,
error, ....

Probably because you wouldn't pay for it. I enjoy sharing what little
information this is with you, but for somebody else it's their economic
advantage to know and withhold similar information, eg, the location of
the "on" switch, or a proprietary file format's specification, although
that might have to do with the public good, and not just particular
elements'.

There are lots of ways to beat a dead horse. Please excuse the
non-topical digression.

Good luck,

Ross Finlayson
 
C

CBFalconer

.... snip 140 odd lines of junk ...
There are lots of ways to beat a dead horse. Please excuse the
non-topical digression.

No. There is a reason newsgroups have a general topic and theme.
It is not to encourage long interminable ranting on disconnected
subjects. Find a group that has a suitable subject.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,160
Messages
2,570,889
Members
47,421
Latest member
StacyTaver

Latest Threads

Top