Re: How to open a word document in C

From: Ross A. Finlayson (raf_at_tiki-lounge.com)
Date: 03/07/05


Date: 7 Mar 2005 09:09:16 -0800

Asma wrote:
> Dear Sir,
>
> I am trying to find a way to open a Word document using C language
and
> read the text of word doc into a variable.
> (Turbo C on Dos 6.0).
>
> Can anyone please tell me which libraries in C can be used to perform
> this task.
>
> Thanks you so much
> Asma

Hi,

The Microsoft Word document, since some version of Word 6.0, is stored
in the Object Linking and Embedding Structured Storage Compound
Document. The OLE Structured Storage has what are called streams in
it, which are basically file system structures within the, uh, file.
So, that way instead of, for example, how HTML with a bunch of image
files is a bunch of different files unless you use the Masinter
Data:URL to encode the image files directly within the HTML file, for
example, all the media contents of the Word document file are stored in
that one file. One of those streams contains the standardized document
properties as appear on the document properties tag of the explorer.
The main stream has the word file. Now, there is text in the Word
file, it is not the way WordPerfect, or RTF or HTML is, where
attributes of the text are inline with the text, the attributes are
stored first and then the text data is there. In the Word 5.0 files,
you can just chop off the "binary" stuff and the text remains. In the
structured storage, that text data is not guaranteed to be contiguous.

You might want to look at Quikview and the Quikview file parser API,
for programming that in C, and go ask in a newsgroup about Microsoft
Word.

Recently, Microsoft has changed their policies and now you can actually
request from them the Office 2003 file format(s). You can get some
older versions of the specification on the Internet, eg Word 6, Word 8,
and stuff.

The Word compound document can contain a lot of things, for example
PostScript and TIFF, embedded and linked OLE objects, Office Drawing
items, forms and mail merge information, and all the other stuff that
has to go in there, obviously.

So, look at the Quikview file parser API, there is a DLL you can load
and call its entry point to extract text from .doc files. I may be
mistaken about that, or, your computer may not support that.

If you'd like a full-fledged portable C language implementation of a
Word doc parser, and are willing to pay some money and wait for it,
please let me know.

Excuse me, this is a newsgroup for discussing computer programming
using the C programming language, and programming issues related to
particular systems or applications are generally considered off-topic.
The previous poster is correct.

Thank you,

Ross F.

--
"It's the smallest infinitesimal, Russell,
there are smaller infinitesimals."


Relevant Pages

  • Re: object system...
    ... for that you need machine language. ... isn't even as fast as other systems programming languages. ... Stroustrup's stated design goal was to enable ... all manner of elegance or abstraction can be sacrificed for speed, ...
    (comp.object)
  • Re: DirectX in HLA
    ... I guess that you have a great knowledge of DirectX ... > understanding by looking at them in assembly language... ... > actually represents, really, is a means to "undo" the OOP so ... > is NOT an "OOPL" (object-orientated programming language), ...
    (comp.lang.asm.x86)
  • Re: DirectX in HLA
    ... I guess that you have a great knowledge of DirectX ... > understanding by looking at them in assembly language... ... > actually represents, really, is a means to "undo" the OOP so ... > is NOT an "OOPL" (object-orientated programming language), ...
    (alt.lang.asm)
  • Re: LSP and subtype
    ... What is the class of problems solvable using UML? ... the language of physics cannot describe. ... whatever paradigm equivalent to 2GL/3GL ... there is still a great need for reuse and generic programming. ...
    (comp.object)
  • Re: Why C Is Not My Favourite Programming Language
    ... If you decide afterall that C programming is just not your thing you ... > C has no string type. ... > compiler take care of the rest. ... Why does any normal language ...
    (comp.lang.c)