Re: clean up html document created by Word



Wow, thanks for all the great responses!

Here's my summary:

- demoronizer (from John Walker) is designed to solve some very
particular problems that could be considered bugs. However, it
doesn't remove the unnecessary html generated by Word.
http://www.fourmilab.ch/webtools/demoroniser/


- The tool from Microsoft can be used in two ways: you can copy html
to the clipboard or export to "compact html". The former results in
slightly cleaner html but doesn't include the style *** and so the
rendering isn't as nice; the latter does include the style *** but
it's got slightly more junk in it. Both approaches preserve the
"blank" paragraphs (basically, <p>&nbsp;</p>) for spacing, which is
unnecessary and clutters up the html. This tool did properly preserve
the footnotes in my test document.
http://www.microsoft.com/downloads/details.aspx?FamilyID=209ADBEE-3FBD-482C-83B0-96FB79B74DED&displaylang=EN

BTW, I didn't know this, but much of the extra html was added by
Microsoft to allow round-tripping between html and Word.

- Tidy with Win2000 configuration: It's already bundled in with my
editor (PSPad) so this was a nice surprise (I guess I never explored
that submenu -- that's the "problem" with modern editors and their
zillions of features). The tidy output could use a more whitespace to
improve html readability, but I assume I can change the config file to
do this. No "blank paragraphs" (better than the Microsoft tool) but
footnotes were messed up.
http://www.w3.org/People/Raggett/tidy/

-- jeff

.


Quantcast