Re: Reading/Parse Word '97 documents with Tcl



JohanBG.Johansson@xxxxxxxxx wrote:

The same thing have been done with Word 2.0 without extentions,
modules or components.

It can extract metadata from Word 2.0 document with an heuristic
algorithm that parses the document with no real knowledge of the
format, I am working towards it also being able to do the same thing
with Word '97.

AFAIK the Word 2.0 format is the same as RTF, which is a pure text format similiar to HTML/XML and can be read easily. .doc from word 95 on is binary and very complicated - consider the problems that OpenOffice still has on reading Word files. There are some projects, however, that allow you to extract text from Word-doc-files.

Antiword http://www.winfield.demon.nl/
KOffice
OpenOffice
Abiword

In principle you could write a parser in pur Tcl, but it would be slow and a very heavy task. For pointers, look up the format descriptions at
http://www.wotsit.org/


Christian
.



Relevant Pages

  • Re: Date.parse(17:26:33 Oct 31, 2009) returns invalid date
    ... For some reason ruby 1.8.6 return invalid date exception when parsing a ... date of Oct 31 in the format above. ... year or timezone don't matter. ... I doubt it "parses fine" by your definition: ...
    (comp.lang.ruby)
  • Re: ftp directory (ls) format
    ... Jim Strehlow wrote: ... > One of our developers is using a WS_FTP component that parses results ... > it expects the format above but is getting fed the format ...
    (comp.os.vms)
  • Datevalue
    ... I have taken over an application that parses out data from a flat file. ... When run with my short date format set to US, ... The odd thing is that when the sRunDate is something like "20080227", ...
    (microsoft.public.access.modulesdaovba)
  • Re: CoreSound now available !
    ... Here is how the format of a warrior should look like to be sure it parses ... (also there should be no extra white lines I think... ...
    (rec.games.corewar)