Re: lxml/ElementTree and .tail



Chas Emerick wrote:

and keep patting our-
selves on the back, while the rest of the world is busy routing around
us, switching to well-understood XML subsets or other serialization
formats, simpler and more flexible data models, simpler API:s, and
more robust code. and Python ;-)

That's flatly unrealistic. If you'll remember, I'm not one of "those people" that are specification-driven -- I hadn't even *heard* of Infoset until earlier this week!

The rant wasn't directed at you or anyone special, but I don't really think you got the point of it either. Which is a bit strange, because it sounded like you *were* working on extracting information from messy documents, so the "it's about the data, dammit" way of thinking shouldn't be news to you.

And the routing around is not unrealistic, it's is a *fact*; JSON and POX are killing the full XML/Schema/SOAP stack for communication, XHTML is pretty much dead as a wire format, people are apologizing in public for their use of SOAP, AJAX is quickly turning into AJAJ, few people care about the more obscure details of the XML 1.0 standard (when did you last see a conditional section? or even a DTD?), dealing with huge XML data sets is still extremely hard compared to just uploading the darn thing to a database and doing the crunching in SQL, and nobody uses XML 1.1 for anything.

Practicality beats purity, and the Internet routes around damage, every single time.

> overwhelming majority of the developers out there care for nothing
> but the serialization, simply because that's how one plays nicely
> with others.

The problem is if you only stare at the serialization, your code *won't* play nicely with others. At the serialization level, it's easy to think that CDATA sections are different from other text, that character references are different from ordinary characters, that you should somehow be able to distinguish between <tag></tag> and <tag/>, that namespace prefixes are more important than the namespace URI, that an &nbsp; in an XHTML-style stream is different from a U+00A0 character in memory, and so on. In my experience, serialization-only thinking (at the receiving end) is the single most common cause for interoperability problems when it comes to general XML interchange.

But when you focus on the data model, and treat the serialization as an implementation detail, to be addressed by a library written by someone who's actually read the specifications a few more times than you have, all those problems tend to just go away. Things just work.

And in practice, of course, most software engineers understand this, and care about this. After all, good software engineering is about abstractions and decoupling and designing things so you can focus on one part of the problem at a time. And about making your customer happy, and having fun while doing that. Not staying up all night to look for an obscure interoperability problem that you finally discover is caused by someone using a CDATA section where you expected a character reference, in 0.1% of all production records, but in none of the files in your test data set.

(By the way, did ET fail to *read* your XML documents? I thought your complaint was that it didn't put the things it read in a place where you expected them to be, and that you didn't have time to learn how to deal with that because you had more important things to do, at the time?)

</F>

.



Relevant Pages

  • Re: xmlns=> was not expected.
    ... > deserialization on this xml and get the results. ... It's not possible to serialize/deserialize this based on serialization ... This method needs to return an XML Schema ...
    (microsoft.public.dotnet.xml)
  • Re: dateTime in Web Services
    ... The Special Case of XML ... Simply because the XML encoding for a DateTime ... When we start with a local time, the result of serialization (encode to ...
    (microsoft.public.dotnet.languages.vb)
  • Re: Retain default attribute values after XSD Validation?
    ... >> when I serialize the resulting doc using its WriteContentTo method, ... >> the default attributes are NOT represented in the XML output stream. ... It could generate an XML serialization preserving as much of the ... I'll want to digitally sign the Account element to both ...
    (microsoft.public.dotnet.xml)
  • Re: mfc to .NET
    ... Yes, our IDL system, and its precursor, the LG system, took pointer rebuilding as a key ... There are both "standard" extensions to XML, and ad hoc extensions to XML, that allow this ... The nested-vs.-flat notation was a boolean parameter of the writer; ... When I looked at MFC serialization, I saw it had all of the ...
    (microsoft.public.vc.mfc)
  • Re: lxml/ElementTree and .tail
    ... formats, simpler and more flexible data models, simpler API:s, and ... And yes, we are in fact ensuring that we get from the HTML/XHTML/text/PDF/etc serialization we have to consume to a uniform, normalized, and "clean" data model in as few steps as possible. ... care about the more obscure details of the XML 1.0 standard (when did ... that CDATA sections are different from other text, that character ...
    (comp.lang.python)