Re: xml + mmap cross



On Sep 4, 7:54 pm, alex23 <wuwe...@xxxxxxxxx> wrote:
On Sep 4, 8:31 am, castironpi <castiro...@xxxxxxxxx> wrote:

Any interest in pursuing/developing/working together on a mmaped-xml
class?  Faster, not readable in text editor.

XML is text-based, so it should -always- be readable in a text editor.
It's part of the definition, I believe.

However, an implementation of one of the alternative binary XML
formats would probably be very welcome.

Fast Infoset:http://www.itu.int/rec/T-REC-X.891-200505-I/en
EXI:http://www.w3.org/TR/2007/WD-exi-20070716/

I don't know enough about either format to say if it would be
possible, but an implementation that conformed to the ElementTree API
could be a big win.

I was thinking something much less restrictive than the two links.
Since it's not text, I'm not sure it event counts as structured
markup. More generic, something like hierarchical 'tag-content-child'
pairs.

Here's what the xml.etree.ElementTree API says:

Each element has a number of properties associated with it:

- a tag which is a string identifying what kind of data this element
represents (the element type, in other words).
- a number of attributes, stored in a Python dictionary.
- a text string.
- an optional tail string.
- a number of child elements, stored in a Python sequence

Since all of these would be buffer-based representations, the
attribute list would merely implement the mapping-object protocol, not
be in a true dictionary. The strings would be stored as offsets to
length-prefixed buffer segments.

Each node would look roughly like:
tag_offset, first_attr, text_offset, tail_offset, first_child,
prev_sibling, next_sibling, parent

Attributes would look like:
key_offset, value_offset, prev_attr, next_attr, node

These are all integers representing offsets elsewhere into the map.

A short observation:

a= e.XML( '<a><b>abc</b></a>' )
a.getchildren()[0].text
'abc'
a.getchildren()[0].text= 'ab<'
e.tostring(a)
'<a><b>ab&lt;</b></a>'
e.XML(_)
<Element a at c2c3f0>
_.getchildren()[0].text
'ab<'

The current implementation supports round trips between special
characters '<' and markup '&lt;', which I propose to support as well.

Of course, you'd have to garbage collect removed nodes by hand, on any
deletions.

Also, poss. change subject to: ElementTree + mmap cross.
.



Relevant Pages