Re: URL encoding api in Java 1.4.2



On Jan 29, 1:35 pm, Tom Anderson <t...@xxxxxxxxxxxxxxx> wrote:
On Wed, 28 Jan 2009, angrybald...@xxxxxxxxx wrote:
On Jan 28, 3:05 pm, Tom Anderson <t...@xxxxxxxxxxxxxxx> wrote:

Fantastic explanation.

I'm hoping you mean that as a complement, rather than as an assertion that
it's a fantasy! :)

It was. :)

To clarify, when you see a URL like:

http://www.google.co.uk/search?hl=en&safe=off&q=my+query&btnG=Search

There are *two* *different* layers of syntax here. First is the URI/URL,
syntax, which breaks the string down to:

Scheme: http
Authority:www.google.co.uk
Path: search
Query: hl=en&safe=off&q=my+query&btnG=Search

Second is x-www-form-urlencoding of the query part, which breaks it down
to:

hl: en
safe: off
q: my query
btnG: Search

Note that it is permitted to have raw + signs in the query part: they're
reserved characters in URI syntax, but in the lesser 'subcomponent
delimiter' set, rather than the greater 'generic delimiter' set, and that
means that they can be used unescaped in a part, provided that the syntax
for that part permits it. I can't find anything in a specification of the
http URL scheme that forbids + from the query part, and thus, applying
ancient Anglo-Saxon legal principles, it's permitted. If you don't like
it, you can always escape them:

http://www.google.co.uk/search?hl=en&safe=off&q=my%2bquery&btnG=Search

I sincerely believe that that URL is exactly equivalent to the one above.
Although i note that Google doesn't think so. Hmm.

Google's right. The query part of that URL expands to

hl: en
save: off
q: my+query <-- Note the plus.
btnG: Search

The query part is encoded using x-www-form-urlencoding. Or rather, it's
encoded using the encoding specified in the form's enctype attribute,
which has a default value of application/x-www-form-urlencoding. The
specification for that says that spaces are escaped as pluses:

http://www.w3.org/TR/html4/interact/forms.html#h-17.13

So that plus *is* a space.

You are in a maze of twisty little standards, all different.

I can see how that reading is supported by the HTML spec, which is
*disgustingly* vague, but that's not what happens. The spec also
supports my reading.

You could accurately describe the encoding process that's actually
used in most browsers:

1. Convert all the reserved characters *except spaces* to their %-
encoded equivalents. (This converts +s in form fields to %2b, &s to
%26, and ?s to %3f, among others)
2. Convert spaces to +s.
3. Join each key to its value using =.
4. Join each key-value pair in the order they appear in the form using
& as a separator.
5(get). Append the resulting string to the submission URL, offset by
an unescaped ?.
5(post). Submit the resulting string as the request body.

Yes, this sucks. Removing the exception for spaces in step 1 and
removing step 2 entirely gives a simpler encoding process with
equivalent power.

I'm fairly sure the process I just described is a result of
incremental growth, and it's probably Mosaic's fault. The initial
encoding was probably "convert spaces to +s," before someone noticed
that sometimes people want to enter strings with +s (and ?s and &s) on
forms.

Clearly, it's not handled that way in practice. Furthermore, i'm a bit
dubious about the interaction between x-www-form-urlencoding and URI
escaping.

The x-www-form-urlencoding process is done instead of URI escaping,
rather than as well as. The result is a valid URI and can be correctly
converted back to the form data mostly via the URI unescaping rules:

1. Split the query into key-value pairs at every &.
2. Split the key-value pairs into keys and values at the first =.
3. Convert +s to %20.
4. URI-unescape the keys and values.

Although in fact, what the HTML spec says about form submission is:

  If the method is "get" and the action is an HTTP URI, the user agent
  takes the value of action, appends a `?' to it, then appends the form
  data set, encoded using the "application/x-www-form-urlencoded" content
  type. The user agent then traverses the link to this URI.

Note that it *doesn't* say that the encoded string is used as a query part
- it says it's appended directly to the action URL. Which sort of means
that URLs carrying form data are not strictly URLs at all ...

They are: the resulting strings fit the syntax requirements for URLs
and URIs; they also fit the structural requirements for HTTP URLs.

In practice, a web app which treats

GET /foo?a+b HTTP/1.1
and
GET /foo?a%20b HTTP/1.1

differently is going to break one way or another. However, when
comparing URLs, those are two distinct URLs.

No, those URLs are equivalent. From RFC 3986, in the section about how to
compare URIs for equality:

  6.2.2.2. Percent-Encoding Normalization

  The percent-encoding mechanism (Section 2.1) is a frequent source of
  variance among otherwise identical URIs. In addition to the case
  normalization issue noted above, some URI producers percent-encode octets
  that do not require percent-encoding, resulting in URIs that are
  equivalent to their non-encoded counterparts. These URIs should be
  normalized by decoding any percent-encoded octet that corresponds to an
  unreserved character, as described in Section 2.3.

6.2.2 (Syntax-based Normalization) begins with "Implementations may",
not "Implementations must". RFC 3986 doesn't specify which way HTTP
implementations fall on this one. RFC 2616 (HTTP 1.1) 3.2.3 URI
Comparison states:

When comparing two URIs to decide if they match or not, a client SHOULD
use a case-sensitive octet-by-octet comparison of the entire URIs, with
these exceptions:

- A port that is empty or not given is equivalent to the default
port for that URI-reference;
- Comparisons of host names MUST be case-insensitive;
- Comparisons of scheme names MUST be case-insensitive;
- An empty abs_path is equivalent to an abs_path of "/".

Characters other than those in the "reserved" and "unsafe" sets (see
RFC 2396 [42]) are equivalent to their ""%" HEX HEX" encoding.

The "reserved" set is section 2.2 of RFC 3986:

reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="

So, ...?q=a%2bb and ...?q=a+b are *not* equivalent, since + is in the
"reserved" set. Neither are ...?q=a%20b and ...?q=a+b, since %20 is
not a hex encoding of +. (RFC 3986 does not impose that the +
separator must be used for spaces, just that + is a separator.).

And since RFC 2616 only says "a client SHOULD" regarding the
exceptions, you can only reliably trust byte-for-byte identical URIs
to be treated as equivalent - which is used in some places http: URIs
show up outside of HTTP, like XML namespaces (see
http://www.w3.org/TR/xml-names/#NSNameComparison).

And on this, for once, Google at least agrees with me - try these:

http://www.google.co.uk/search?q=a+b
http://www.google.co.uk/search?q=a%20b

Different URLs, logically-different resources (cached separately,
except Google results are not cacheable), identical content - which is
what I was trying to say was the Right Thing.

The real-world implementations of these rules are usually what I
described. The various RFCs and W3C recommendations are woefully
loose, but in this case it's pretty cut and dried.

-o
.



Relevant Pages

  • Re: PEP on breaking outer loops with StopIteration
    ... This is the desired, if not desirable, syntax:: ...     import string ...     for letter in letters: ...
    (comp.lang.python)
  • Re: Increasing numbers in a table
    ...     Dim s As String ... that doesn't work - "Syntax error". ...
    (microsoft.public.excel.programming)
  • RE: MSVSTO.Applications.Runtime.IEntryPoint vs multiple verions is
    ... FileOptions options, SECURITY_ATTRIBUTES secAttrs, String msgPath, Boolean ... at System.Net.FileWebResponse..ctor(FileWebRequest request, Uri uri, ... the assembies ref to error ...
    (microsoft.public.office.developer.automation)
  • Re: FtpWebRequest UploadFile
    ... As for using Uri class ... on using Uri or just string path. ... string in advance, using Uri class ... | Subject: Re: FtpWebRequest UploadFile ...
    (microsoft.public.dotnet.framework.aspnet.webcontrols)
  • Re: Put document and FileName
    ... If you want to write in Spanish, post SharePoint questions to microsoft.public.es.sharepoint. ... SipaAdmElectIntegracion.SharePointFileUploader.SendRequest(String uri, ... String& webUrl, String& fileUrl) ... public void PutDocument(string uri, bytebFichero, string metaInfo) ...
    (microsoft.public.sharepoint.windowsservices)