Re: Convert MS-Word to plain text
- From: dkcombs@xxxxxxxxx (David Combs)
- Date: Fri, 30 May 2008 01:52:29 +0000 (UTC)
In article <g09frv$9i$1@xxxxxxxxxxxxxxxx>,
Ben Bullock <benkasminbullock@xxxxxxxxx> wrote:
"Jurgen Exner" <jurgenex@xxxxxxxxxxx> wrote in message
news:n7t924pig28pcl1kgcpubip5m73chum672@xxxxxxxxxx
backpack <curtyoung@xxxxxxxxx> wrote:
Are there any perl modules that will allow you to convert MS-Word docs
to plain text?
Opposite to earlier formats the DOCX format is an open XML format and
information about it is available on the Microsoft website. I don't know
if someone already wrote a parser for it, but at least it should be
possible now.
The docx format only applies to Word 2007. Microsoft have also made their
binary formats for other versions of Word public:
http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx
But deciphering these is only necessary if you want to read the files on a
non-Windows system or you don't have a working copy of Microsoft Word. If
you have Microsoft Word available, the most sensible thing to do is to use
Win32::OLE. Here is a quick program to save a Word file as text:
#! perl
use warnings;
use strict;
use Win32::OLE;
use Win32::OLE::Const 'Microsoft.Word';
my $dir = "C:/Documents and Settings/bkb/My Documents/scripts/tests/";
my $doc = $dir."test2.doc";
my $txt = $dir."test.txt";
my $word = Win32::OLE->new('Word.Application', 'Quit')
or die "Word problem: ",Win32::OLE->LastError();
my $document = $word->Documents->Open($doc)
or die "Word problem: ",Win32::OLE->LastError();
$document->SaveAs($txt, wdFormatText);
I wonder how the result compares to the (non-perl, I think)
program "antiword"?
Any experiences?
David
.
- References:
- Convert MS-Word to plain text
- From: backpack
- Re: Convert MS-Word to plain text
- From: Jürgen Exner
- Re: Convert MS-Word to plain text
- From: Ben Bullock
- Convert MS-Word to plain text
- Prev by Date: Re: Win32:Printer Windows x86 build with FreeImage.dll support build issues
- Next by Date: Re: Convert MS-Word to plain text
- Previous by thread: Re: Convert MS-Word to plain text
- Next by thread: Re: Convert MS-Word to plain text
- Index(es):
Relevant Pages
|
|