Re: Convert MS-Word to plain text



In article <g09frv$9i$1@xxxxxxxxxxxxxxxx>,
Ben Bullock <benkasminbullock@xxxxxxxxx> wrote:

"Jurgen Exner" <jurgenex@xxxxxxxxxxx> wrote in message
news:n7t924pig28pcl1kgcpubip5m73chum672@xxxxxxxxxx
backpack <curtyoung@xxxxxxxxx> wrote:
Are there any perl modules that will allow you to convert MS-Word docs
to plain text?

Opposite to earlier formats the DOCX format is an open XML format and
information about it is available on the Microsoft website. I don't know
if someone already wrote a parser for it, but at least it should be
possible now.

The docx format only applies to Word 2007. Microsoft have also made their
binary formats for other versions of Word public:

http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx

But deciphering these is only necessary if you want to read the files on a
non-Windows system or you don't have a working copy of Microsoft Word. If
you have Microsoft Word available, the most sensible thing to do is to use
Win32::OLE. Here is a quick program to save a Word file as text:
#! perl
use warnings;
use strict;
use Win32::OLE;
use Win32::OLE::Const 'Microsoft.Word';
my $dir = "C:/Documents and Settings/bkb/My Documents/scripts/tests/";
my $doc = $dir."test2.doc";
my $txt = $dir."test.txt";
my $word = Win32::OLE->new('Word.Application', 'Quit')
or die "Word problem: ",Win32::OLE->LastError();
my $document = $word->Documents->Open($doc)
or die "Word problem: ",Win32::OLE->LastError();
$document->SaveAs($txt, wdFormatText);


I wonder how the result compares to the (non-perl, I think)
program "antiword"?

Any experiences?

David


.



Relevant Pages

  • Re: Developing a Template - Questions on Instructions to the User
    ... float the text boxes in front of text so that they don't accidentally ... Microsoft's Legal Users' Guide) http://addbalance.com/usersguide ... Format the text in the textboxes ... > Intermediate User's Guide to Microsoft Word (supplemented version of ...
    (microsoft.public.word.docmanagement)
  • Re: I have some data in (I guess) picture format and I want to change
    ... usable text format such to copy and paste how would I accomplish this using ... either microsoft word, excel,powerpoint or adobe acrobat reader ... If you don't have it installed, you can rerun the installer and choose a custom ... If you have a scanner, ...
    (microsoft.public.office.misc)
  • Re: How do I save to Microsoft Word Document format?
    ... Sue Mosher, Outlook MVP ... > The 98 and 2000 machines I use have Word set up as the email editor, ... > The other Outlook 2000 machine was set up to use Rich Text format as the ... There was no option to save as Microsoft Word Document format. ...
    (microsoft.public.outlook)
  • Re: Linux Word Processors
    ... > format without any difficulties; ... That isn't even possible between different versions of Microsoft Word. ... Word processors are nice for typing up the content and spell checking ... > natively in Linux, please. ...
    (Fedora)
  • Re: How do I save to Microsoft Word Document format?
    ... Sue Mosher, Outlook MVP ... > Then I looked at the same emails in Outlook 98 and Outlook 2000, ... There was no option to save as Microsoft Word Document format. ...
    (microsoft.public.outlook)