Re: editing perl script through TEXTAREA

From: Alan J. Flavell (flavell_at_ph.gla.ac.uk)
Date: 08/18/04


Date: Wed, 18 Aug 2004 20:20:19 +0100

On Wed, 18 Aug 2004, Gunnar Hjalmarsson wrote:

> There is a lot of confusion here.

Nothing new there, then... :-}

> A '&' character that is submitted via a textarea control is URI
> encoded, not converted to the corresponding HTML entity. Accordingly,
> after URI decoding, it's still a '&'.

Agreed.

> It's when you want to display content initially in a textarea field
> that you should first convert certain characters to HTML entities, but
> if you for instance have:
>
> <textarea name="demo">Smith &amp; Son Co.</textarea>
>
> the browser will convert the '&amp;' to '&' and display it as '&'
> right away, i.e. before submitting.

Spot on.

> So my point, which I also tried to illustrate with a little program
> in another post in this thread, is that there is never a need for the
> Perl program to do any "reverse conversion" of HTML entities.

As a matter of principle you're correct here. But that isn't quite
true in practice, as I'll deal with in a moment.

As usual, it's all a matter of dividing the problem up into its
component parts, and understanding how each one works separately,
before assembling them into a working application.

But, over and above this, if folks go pasting weird characters into
their form submission (and there's no way you can stop them doing so),
then browsers do strange things with them. As the Perl Encode
documentation so engagingly remarks:

  It is beyond the power of words to describe the way HTML
  browsers encode non-ASCII form data.

   - http://www.perldoc.com/perl5.8.0/lib/Encode/Supported.html

And there are some browsers (or should I say "browser-like operating
system components"?) which, when the user feeds into a form a
character which cannot be represented in the prevailing character
encoding, will turn it into &#number; or even into &entityname; format
for submission.

On arrival at the server, of course, the server-side process can have
no idea whether the user typed just a curly-quote character, or typed
the ASCII string &#8220; (ampersand, hash, 8, 2, 2, 0, semicolon). By
that time they are indistinguishable. The behaviour in this situation
is undefined anyway, and browser developers have addressed it in
various different ways as they saw fit.

But Perl is only a small part of this problem - the major issues
really need to be hammered out on a suitable WWW-related group.
Where one might even get referred to my no-longer-quite-new
tutorial-ish page on the topic,
http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html

have fun



Relevant Pages

  • Re: XmlDocument Escaping - and I Dont Want It To
    ... browsers ignore all but the first contiguous space, ... In fact, if you expect to be dealing with HTML, you owe it to yourself and others to commit this kind of knowledge to your memory cells. ... It's an issue about what putting " ", or any character entity, into your HTML _really_ does, as well as what is the definition of "white space" in the context of HTML. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: testing TextBox.Text for any html content
    ... simply check for that character, otherwise you could encode the text so ... that it is irrelevant if any html appears in it because you'll have ... server-side. ...
    (microsoft.public.dotnet.languages.vb)
  • Re: calling a function from a iframe
    ... The aspect of HTML validity that is significant is ... When presented with structurally invalid HTML browsers engage in 'error ... Structurally valid HTML mark-up has a tree-like structure, the DOM also ...
    (comp.lang.javascript)
  • Re: Layout basics
    ... What I'm not sure about is if you're at all familiar with HTML ... display in those browsers. ... IE7, which has been delayed for entirely too long, we have an "absolute ... Because Windows Forms use "absolute positioning" in a Form interface, ...
    (microsoft.public.dotnet.framework.aspnet)
  • Re: Layout basics
    ... I can certainly see why you would call the positioning issue an "absolute nightmare." ... A web application is an application in which the vast majority of the programming is on a server machine, but the user interface is presented via a "thin-client" HTML browser user interface. ... HTML started out as an invention of the Mosaic group, who created the first web browsers, and a language for formatting display in those browsers. ... HTML started off rather simply, with not much thought in the way of fancy layout, and the use of an ever-expanding list of inline attributes to handle layout properties. ...
    (microsoft.public.dotnet.framework.aspnet)