Re: Problem round-tripping with xml.dom.minidom pretty-printer



Ben Butler-Cole wrote:
Hello

I have run into a problem using minidom. I have an HTML file that I
want to make occasional, automated changes to (adding new links). My
strategy is to parse it with minidom, add a node, pretty print it and
write it back to disk.

However I find that every time I do a round trip minidom's pretty
printer puts extra blank lines around every element, so my file grows
without limit. I have found that normalizing the document doesn't make
any difference. Obviously I can fix the problem by doing without the
pretty-printing, but I don't really like producing non-human readable
HTML.

Here is some code that shows the behaviour:

import xml.dom.minidom as dom
def p(t):
d = dom.parseString(t)
d.normalize()
t2 = d.toprettyxml()
print t2
p(t2)
p('<a><b><c/></b></a>')

Does anyone know how to fix this behaviour? If not, can anyone
recommend an alternative XML tool for simple tasks like this?
Hi,

The last line of p() calls itself: it is an unconditional recursive call so, no matter what it does, it will never stop. And since p() also prints something, calling it will print endlessly. By removing this line, you get something like:

<?xml version="1.0" ?>
<a>
<b>
<c/>
</b>
</a>

That seems sensible, imo. Was that what you wanted?

An additional thing to keep in mind is that toprettyxml does not print an XML identical to the original DOM tree: it adds newlines and tabs. When parsed again these blank characters are inserted in the DOM tree as character nodes. If you toprettyxml an XML document twice in a row, then the second one will also add newlines and tabs around the newlines and tabs added by the first. Since you call toprettyxml an infinite number of times, it is expected that lots of blank characters appear.

Finally, normalize() is supposed to merge consecutive sibling character nodes, however it will never remove character contents even if they are blank. That means that several character
nodes will be replaced by a single one whose content is the concatenation of the respective content of the original nodes. Clear enough?

Cheers,
RB
.



Relevant Pages

  • Re: Exporting trouble
    ... the output file character set: Windows (but I have tried all other ... tabs and returns are of course reserved as field and records delimiters. ... Of course, if you are exporting the data for use as HTML code, then you ...
    (comp.databases.filemaker)
  • Re: Another strange character showing up in a .PPT file
    ... How many Tabs are there? ... differently, depending on fonts installed, etc? ... the associated font changes from Arial Narrow to Arial Unicode MS, ... the character does not appear. ...
    (microsoft.public.powerpoint)
  • Re: Variables interpolated in character classes?
    ... by generating an html file with the same content, ... Here's an edited-for-brevity ... Then I realized, the regex contains "$_", which was embedding ... I had thought that character classes removed the special ...
    (comp.lang.perl.misc)
  • Re: c# .net write html to word special characters not writing
    ... In this case, you're lucky because Word has a lot of code in it to deal with users that lie to it, so once it fails to open the file you gave it as an actual Word document, it goes through other file formats it understands, detects the data as HTML, and interprets it that way. ... Given what little code you've posted, there's no reason that the character entities such as "™" shouldn't be preserved correctly. ... If the "existing html file" is not in fact exactly the same characters you wind up writing to the new ".doc" file, then any number of differences in the way that Word ultimately winds up parsing the file data could explain what you're seeing. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: How to read data from the ascii file?
    ... give different spacing for different tabs. ... Fortran code. ... declare a character variable big enough to hold the entire ... the tabs from the shell. ...
    (comp.lang.fortran)