Re: Fast XML filtering

From: Maarten Wiltink (maarten_at_kittensandcats.net)
Date: 11/21/03


Date: Fri, 21 Nov 2003 14:01:58 +0100


"Martin Kofoed" <inzide@hot.mail.com> wrote in message
news:3fbde622$0$198$edfadb0f@dread11.news.tele.dk...

> I'm writing a DLL that should be able to sweep through XML data at
> sizes up to 10 MB per call.
>
> Basically the user will send the XML as a string and pass another
> string containing a start-tag that indicates which parts to filter
> out from the XML (starttag and corresponding end-tag AND all the
> elements and data between them).

I take it that you're not much concerned with the data actually being
XML, then?

> Of course, "performance" is the key word here. I started out sweeping
> the string using standard string handling functions, but I'm not
> impressed with performance.
>
> Which approach would be the best seen from a performance point of view?

Locate the starting position as a PChar. Same for the ending position.
Compute difference as a simple integer. Copy the element into a buffer
or simply leave it where it is and work with a start pointer and length.

There is that algorithm for very quickly looking for substrings in a
string that I can never remember the name of; the one that looks for
the last letter first and uses a table for how many letters it can skip.

Beware. This method skips an awful lot of steps, any of which can ruin
your day. Character entities allow different encodings of the same
document; looking for a closing _tag_ means you may find one for a
different _element_; many other horrible things are possible.

Groetjes,
Maarten Wiltink



Relevant Pages