Re: non SGML character escape



On Mar 13, 1:44 pm, Tom Anderson <t...@xxxxxxxxxxxxxxx> wrote:
On Fri, 13 Mar 2009, Srini wrote:
I have some typographical/special characters in our database which
comes from user input by pasting from documents. I have to take that
data and create xml file. Run the xml through W3C xml validator, it is
failing and saying that

"Line 37231, Column 135:nonSGMLcharacter number 25

You have used an illegal character in your text. HTML uses the
standard UNICODE Consortium character repertoire, and it leaves
undefined (among others) 65 character codes (0 to 31 inclusive and 127
to 159 inclusive) that ...... and so on"

I am using Apache Commons Lang package escape utils class
StringEscapeUtils.escapeXml() method and I also tried using
StringEscapeUtils.escapeHtml() methods. Which both of them are failed
to escape these characters.

I think what the error report is saying is that there's no way to escape
the characters, because they're characters that just don't exist in
unicode. It's just like if you had Klingon characters in your database.

Your solution is to remove the characters, and either replace them with
something equivalent that is in unicode, or forget about them. ASCII
character 25 is EM, 'end of medium' - what does that mean in your system?
How on earth are your users entering it?

Can some one point me in the right direction, is there an utility that
I can use for this???
Even though XML Validator fails can XSLT validation by pass these
characters when it parse this xml??

It's likely but not certain that XML parsers will choke on the characters
(a standards-compliant parser will), and since parsing is a prerequisite
for XSLT processing, you can't rely on that being possible.

tom

--
THE DRUMMER FROM DEF LEPPARD'S ONLY GOT ONE ARM!

I believe these are the characters coming from users doing copy/paste
from applications like word documents. So the solution would be just
ignore that particular element when parser chokes?? and asking user
not to do cut and past from word processor?? but how can you control
users???

.



Relevant Pages

  • Re: Special Characters not resolving
    ... starting data at the origin in an ORacle database is 2000 characters. ... When the XML isdelivered to me on disk and I load an ... Obviously I need to find either a way to have the XML file provider strip ...
    (microsoft.public.dotnet.xml)
  • Re: XML SAX parser bug?
    ... > I think I ran into a bug in the XML SAX parser. ... the SAX parser misreads the line. ... > I put a 'print characters' line in the 'characters' method of the ...
    (comp.lang.python)
  • Re: Special Characters not resolving
    ... various characters are resolved. ... Here is an example cut from 2 .XML files. ... write to disk to resolve the characters then load again. ... I have an XML file that is generated by Oracle. ...
    (microsoft.public.dotnet.xml)
  • Re: Serving RSS feeds
    ... so you need to handle some non-ASCII characters. ... You can't do this with the XML prolog, because the web server HTTP ... addtype application/rss+xml rss ...
    (uk.net.web.authoring)
  • Re: Converting "&#x2019;" to an Apostrophe?
    ... euro symbol, double quote, etc.) to their ASCII equivalents? ... Maria's problem is expressed a bit vaguely but let's assume that her XML ... struggle to think up or locate ASCII equivalents for some of these. ... UTF-8 characters properly? ...
    (comp.lang.perl.misc)