Re: urllib interpretation of URL with ".."



John Nagle <nagle@xxxxxxxxxxx> writes:

Duncan Booth wrote:
"Martin v. Löwis" <martin@xxxxxxxxxxx> wrote:


Is "urllib" wrong?

Section 5.2 is also relevant here. In particular:


g) If the resulting buffer string still begins with one or more
complete path segments of "..", then the reference is
considered to be in error. Implementations may handle this
error by retaining these components in the resolved path (i.e.,
treating them as part of the final URI), by removing them from
the resolved path (i.e., discarding relative levels above the
root), or by avoiding traversal of the reference.


The common practice seems to be for client-side implementations to
handle this using option 2 (removing them) and servers to use option
3 (avoiding traversal of the reference). urllib uses option 1 which
is also correct but not as useful as it might be.

That's helpful. Thanks.

In Python, of course, "urlparse.urlparse", which is
the main function used to disassemble a URL, has no idea whether it's being
used by a client or a server, so it, reasonably enough, takes option 1.

(Yet another hassle in processing real-world HTML.)

Note that RFC 3986 obsoletes RFC 2396, and attempts to codify current
good practice re generic URL syntax (URI and relative reference
syntax, to use the precise terminology of the RFC). It discusses
normalisation at length, quite sensibly and pragmatically. And very
readable and useful it is too.

Somebody submitted a module implementing the URL splitting / joining
algorithms specified in RFC 3986 for inclusion in Python 2.6 -- I
haven't looked at that recently...

See also RFC 3987.


John
.



Relevant Pages

  • _ SHOULD _ Have No Back Reference.
    ... I will now use this RFC to shut people up ... " back reference " refers to ... So Google.COM is in violation of RFC 1036. ... So every last poster in this thread ...
    (sci.physics)
  • Re: integration between struts and servlet auth
    ... My reference is RFC 1855, ... > ok, I don't have any RFC to refer to, .. ... You can refer to the same one. ... Andrew Thompson ...
    (comp.lang.java.programmer)
  • You are in violation of RFC 1036.
    ... " Re: The ultimate luxury? ... " back reference " refers to ... So Google.COM is also in violation of RFC 1036. ...
    (sci.physics)
  • Re: rules for MX records
    ... Cnews-ms wrote: ... > Can I point my MX record to an alias or should I use an A record. ... > there an RFC to reference?? ...
    (microsoft.public.win2000.dns)
  • Re: String Reference Type
    ... All unary and binary operators have predefined implementations that are ... Therefore its always allocated in the heap and a variable of string ... As with all classes in this case y and x both reference the same String ... language depandant matter as below. ...
    (microsoft.public.dotnet.framework.aspnet)