Re: clean up html document created by Word
- From: "jd" <chimalus@xxxxxxxxx>
- Date: 30 Mar 2007 18:17:19 -0700
Wow, thanks for all the great responses!
Here's my summary:
- demoronizer (from John Walker) is designed to solve some very
particular problems that could be considered bugs. However, it
doesn't remove the unnecessary html generated by Word.
http://www.fourmilab.ch/webtools/demoroniser/
- The tool from Microsoft can be used in two ways: you can copy html
to the clipboard or export to "compact html". The former results in
slightly cleaner html but doesn't include the style *** and so the
rendering isn't as nice; the latter does include the style *** but
it's got slightly more junk in it. Both approaches preserve the
"blank" paragraphs (basically, <p> </p>) for spacing, which is
unnecessary and clutters up the html. This tool did properly preserve
the footnotes in my test document.
http://www.microsoft.com/downloads/details.aspx?FamilyID=209ADBEE-3FBD-482C-83B0-96FB79B74DED&displaylang=EN
BTW, I didn't know this, but much of the extra html was added by
Microsoft to allow round-tripping between html and Word.
- Tidy with Win2000 configuration: It's already bundled in with my
editor (PSPad) so this was a nice surprise (I guess I never explored
that submenu -- that's the "problem" with modern editors and their
zillions of features). The tidy output could use a more whitespace to
improve html readability, but I assume I can change the config file to
do this. No "blank paragraphs" (better than the Microsoft tool) but
footnotes were messed up.
http://www.w3.org/People/Raggett/tidy/
-- jeff
.
- References:
- Prev by Date: Re: Remote XML Parsing
- Next by Date: Re: Indentation for code readability
- Previous by thread: Re: clean up html document created by Word
- Next by thread: Re: LRU cache (and other things missing from the standard library ...)
- Index(es):