Re: URL encoding api in Java 1.4.2
- From: Tom Anderson <twic@xxxxxxxxxxxxxxx>
- Date: Thu, 29 Jan 2009 18:35:27 +0000
On Wed, 28 Jan 2009, angrybaldguy@xxxxxxxxx wrote:
On Jan 28, 3:05 pm, Tom Anderson <t...@xxxxxxxxxxxxxxx> wrote:
Fantastic explanation.
I'm hoping you mean that as a complement, rather than as an assertion that it's a fantasy! :)
To clarify, when you see a URL like:
http://www.google.co.uk/search?hl=en&safe=off&q=my+query&btnG=Search
There are *two* *different* layers of syntax here. First is the URI/URL,
syntax, which breaks the string down to:
Scheme: http
Authority:www.google.co.uk
Path: search
Query: hl=en&safe=off&q=my+query&btnG=Search
Second is x-www-form-urlencoding of the query part, which breaks it down
to:
hl: en
safe: off
q: my query
btnG: Search
Note that it is permitted to have raw + signs in the query part: they're
reserved characters in URI syntax, but in the lesser 'subcomponent
delimiter' set, rather than the greater 'generic delimiter' set, and that
means that they can be used unescaped in a part, provided that the syntax
for that part permits it. I can't find anything in a specification of the
http URL scheme that forbids + from the query part, and thus, applying
ancient Anglo-Saxon legal principles, it's permitted. If you don't like
it, you can always escape them:
http://www.google.co.uk/search?hl=en&safe=off&q=my%2bquery&btnG=Search
I sincerely believe that that URL is exactly equivalent to the one above.
Although i note that Google doesn't think so. Hmm.
Google's right. The query part of that URL expands to
hl: en
save: off
q: my+query <-- Note the plus.
btnG: Search
The query part is encoded using x-www-form-urlencoding. Or rather, it's encoded using the encoding specified in the form's enctype attribute, which has a default value of application/x-www-form-urlencoding. The specification for that says that spaces are escaped as pluses:
http://www.w3.org/TR/html4/interact/forms.html#h-17.13
So that plus *is* a space.
Clearly, it's not handled that way in practice. Furthermore, i'm a bit dubious about the interaction between x-www-form-urlencoding and URI escaping. For instance, what happens when a form value has a & or a = in it? Those are used as delimiters in the x-www-form-urlencoding syntax, so they have to be escaped. But what the spec says to do is to encode them using %hh, which is the URI escape notation. Under my model for the interaction of x-www-form-urlencoding and URI escaping, that would mean that a form dataset like this:
text: this is <html>
Would be x-www-form-urlencoded as:
text=this+is+%26lt%3bhtml%26gt%3b
And using that as a query part would make a URI like:
http://example.com/search?text=this%2bis%2b%2526lt%2b53bhtml%2b2526gt%2b253b
That is, with the %s escaped again!
What actually happens that the %-escaped characters in the query part are not %-escaped again - and nor are the +s.
I assume that what happened is that URI encoding and x-www-form-urlencoding have been treated as a single process, with the structure of the query part being considered part of the structure of the URI, despite what the specifications say. Or rather, that they really are part of the same process, and my reading of the specifications is wrong. This may be covered under the slightly handwavey bits in the URI spec that talk about scheme-specific syntax; if we consider x-www-form-urlencoding part of the http scheme's syntax, rather than a separate layer of encoding on top, i think it makes sense. Even though that's not what the HTML spec says.
Although in fact, what the HTML spec says about form submission is:
If the method is "get" and the action is an HTTP URI, the user agent
takes the value of action, appends a `?' to it, then appends the form
data set, encoded using the "application/x-www-form-urlencoded" content
type. The user agent then traverses the link to this URI.
Note that it *doesn't* say that the encoded string is used as a query part - it says it's appended directly to the action URL. Which sort of means that URLs carrying form data are not strictly URLs at all ...
Anyway, i've gone mad now, so i'll leave it at that.
In practice, a web app which treats
GET /foo?a+b HTTP/1.1
and
GET /foo?a%20b HTTP/1.1
differently is going to break one way or another. However, when comparing URLs, those are two distinct URLs.
No, those URLs are equivalent. From RFC 3986, in the section about how to compare URIs for equality:
6.2.2.2. Percent-Encoding Normalization
The percent-encoding mechanism (Section 2.1) is a frequent source of
variance among otherwise identical URIs. In addition to the case
normalization issue noted above, some URI producers percent-encode octets
that do not require percent-encoding, resulting in URIs that are
equivalent to their non-encoded counterparts. These URIs should be
normalized by decoding any percent-encoded octet that corresponds to an
unreserved character, as described in Section 2.3.
And on this, for once, Google at least agrees with me - try these:
http://www.google.co.uk/search?q=a+b
http://www.google.co.uk/search?q=a%20b
tom
--
Mr. Cadbury's Parrot impressions go down surprisingly well during
lovemaking! -- D
- Follow-Ups:
- Re: URL encoding api in Java 1.4.2
- From: angrybaldguy
- Re: URL encoding api in Java 1.4.2
- From: John B. Matthews
- Re: URL encoding api in Java 1.4.2
- References:
- URL encoding api in Java 1.4.2
- From: Saju Pillai
- Re: URL encoding api in Java 1.4.2
- From: John B. Matthews
- Re: URL encoding api in Java 1.4.2
- From: Saju Pillai
- Re: URL encoding api in Java 1.4.2
- From: John B. Matthews
- Re: URL encoding api in Java 1.4.2
- From: Mark Space
- Re: URL encoding api in Java 1.4.2
- From: John B. Matthews
- Re: URL encoding api in Java 1.4.2
- From: Tom Anderson
- Re: URL encoding api in Java 1.4.2
- From: angrybaldguy
- URL encoding api in Java 1.4.2
- Prev by Date: Re: Urgent opening for .Net Developer- Permanent full time position-New Port Beach, CA
- Next by Date: Re: reading text-file with very long lines
- Previous by thread: Re: URL encoding api in Java 1.4.2
- Next by thread: Re: URL encoding api in Java 1.4.2
- Index(es):
Relevant Pages
|