Re: Convert MS-Word to plain text




"Jurgen Exner" <jurgenex@xxxxxxxxxxx> wrote in message news:n7t924pig28pcl1kgcpubip5m73chum672@xxxxxxxxxx
backpack <curtyoung@xxxxxxxxx> wrote:
Are there any perl modules that will allow you to convert MS-Word docs
to plain text?

Opposite to earlier formats the DOCX format is an open XML format and
information about it is available on the Microsoft website. I don't know
if someone already wrote a parser for it, but at least it should be
possible now.

The docx format only applies to Word 2007. Microsoft have also made their binary formats for other versions of Word public:

http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx

But deciphering these is only necessary if you want to read the files on a non-Windows system or you don't have a working copy of Microsoft Word. If you have Microsoft Word available, the most sensible thing to do is to use Win32::OLE. Here is a quick program to save a Word file as text:
#! perl
use warnings;
use strict;
use Win32::OLE;
use Win32::OLE::Const 'Microsoft.Word';
my $dir = "C:/Documents and Settings/bkb/My Documents/scripts/tests/";
my $doc = $dir."test2.doc";
my $txt = $dir."test.txt";
my $word = Win32::OLE->new('Word.Application', 'Quit')
or die "Word problem: ",Win32::OLE->LastError();
my $document = $word->Documents->Open($doc)
or die "Word problem: ",Win32::OLE->LastError();
$document->SaveAs($txt, wdFormatText);

.



Relevant Pages

  • Re: Reading Text on Graphics
    ... The contents of the word file are not available as text but as a single ... If it is Office 2003 you have Microsoft Document Imaging located in All Programs, Microsoft Office, Microsoft Office Tools. ... That program will covert a scanned image into readable text I also believe Office 2007 has a similar program but if so, I didn't install it by default when I installed Office 2007. ... some times you can more easily change contrast and clarity in photoshop rather than the scanner. ...
    (microsoft.public.windowsxp.hardware)
  • Re: Contents of Word file disappeared
    ... Microsoft Newsgroups ... Get Windows XP Service Pack 2 with Advanced Security Technologies: ... "Neil UK" wrote: ... | Word file and began typing material on the 32nd page. ...
    (microsoft.public.windowsxp.general)
  • Re: Unsynchable OUTLOOK Message
    ... > I've had no problem with other tabled documents that are ... Hello Tracy, ... how you could get the file sent to Microsoft... ... I was also wondering if you could mail me the word file.. ...
    (microsoft.public.pocketpc.activesync)
  • Re: Runtime error involving "scoring.cpp" since Office SP2 update?
    ... The error message gives me 3 options: Abort Retry ... time I open a Word file, and thanks so much Microsoft!), the Dorland ...
    (microsoft.public.word.docmanagement)
  • Re: Reading Text on Graphics
    ... The contents of the word file are not available as text but as a single ... on the graphic eventhough the actual text is visible but not available for ... If it is Office 2003 you have Microsoft Document Imaging located in All Programs, Microsoft Office, Microsoft Office Tools. ... That program will covert a scanned image into readable text I also believe Office 2007 has a similar program but if so, I didn't install it by default when I installed Office 2007. ...
    (microsoft.public.windowsxp.hardware)