Re: XML CDATA special characters

From: Terence (tk.lists_at_fastmail.fm)
Date: 11/19/03


Date: Wed, 19 Nov 2003 10:39:20 +1100

John van Terheijden wrote:

> I didn't mention SAX, is that the standard PHP parser I'm using now? I
> thought it was Expat. Thanks for making this even more confusing ;)
>

Yeah, it's a bit like that. I didn't want to include too much
explanations else I'd be in danger of writing a huge article. Trust me,
restraint is a good thing for me. When you're on the newbie end of a
technology, then it's best just to pretend you never read/heard the
stuff that confused you (initially of course).

Simple Api for Xml (SAX) is indeed what PHP's inadequately named the
"XML extension". And yes, it is based on the Expat (product name)
implementation of SAX. SAX is a standard, Expat is a product that
implements that standard.

DOM is a standard, PHP uses the libxml product which implements that
standard. PHP5 is slated to use libxml2 which is very exciting indeed :)

If you don't know anything about XSLT, then ignore the tip I gave to
XSLT users who might take my advice on the [no need to use] CDATA issue.
XSLT is a whole new kettle of fish, don't go there until you have a firm
grasp on XML.

I recomend familiarising yourself with the XML "infoset". You will find
the "infoset" standard on the w3c website. Do not panic, it is a
relatively short document that can be skimmed quite readily. Don't get
depressed if it all doesn't stick the first time. At least *familiarise*
yourself with the *concept* of the infoset. There should be an
introduction/primer type article there.

> Ok, I'll just dive into DOM now and see where this will all end up. I'll
> probably come across all the terms again, in time. B.t.w. I don't understand
> much of your XSL note, probably because I know very little about XSL. I'm
> using XML to store data while avoiding databases.
>
> Thanks!
>
> "Terence" <tk.lists@fastmail.fm> schreef in bericht
> news:3fb9969e$1@herald...
>
>>for a start, if you are "creating" XML content, then you need to use the
>>DOM API and not the SAX API. As far as I am aware, the SAX API is just
>>for "reading" XML data and not writing to it. Someone please correct me
>>if I am wrong.
>>
>>The DOM API will conveniently do all special character escaping for you
>>so you dont have to worry about using functions *like* htmlentities().
>>On that point, basic XML only has 5 pre-defined default entities. And
>>off the top of my head, I think they are:
>>
>> > -- &gt;
>>< -- &lt;
>>" -- &quot;
>>& -- &amp;
>>[insert fifth one here]
>>
>>The other one escapes me (no pun intended). If you try and use HTML
>>entities, then you will likely create invalid XML documents because HTML
>>has entities that are "undefined" in the default XML set.
>>
>>When you use an XML parser (be it SAX, or DOM) to get the data back from
>>the XML storage files, everything (including entities) will be converted
>>back (un-escaped). So you really do not need to use CDATA sections.
>>CDATA sections do have their usages but their absolute neccecity is
>>limited to a very few cases.
>>
>>SPECIAL NOTE ON XSL STYLESHEETS:
>>If you are using XSL templates to extract HTML markup contained
>>(escaped) in the XML storage files, use the disable-output-escaping
>>attribute of the value-of directive to disable output escaping. This is
>>useful if you have done something like this...
>>$element->set_content($htmlSource);
>>and you wish the output tree to contain unescaped HTML.
>>
>>As for character encoding (UTF8 etc), it depends on what sort of data
>>you are putting in there. Odds are you needn't concern yourself with
>>this unless you know that your source data is UTF-16 or something. Just
>>try using the DOM XML functions and see how you go.
>>
>
>
>