Re: Unrecognized escape sequences in string literals



On Mon, 10 Aug 2009 00:32:30 -0700, Douglas Alan wrote:

In C++, if I know that the code I'm looking at compiles, then I never
need worry that I've misinterpreted what a string literal means.

If you don't know what your string literals are, you don't know what your
program does. You can't expect the compiler to save you from semantic
errors. Adding escape codes into the string literal doesn't change this
basic truth.

Semantics matters, and unlike syntax, the compiler can't check it.
There's a difference between a program that does the equivalent of:

os.system("cp myfile myfile~")

and one which does this

os.system("rm myfile myfile~")


The compiler can't save you from typing 1234 instead of 11234, or 31.45
instead of 3.145, or "My darling Ho" instead of "My darling Jo", so why
do you expect it to save you from typing "abc\d" instead of "abc\\d"?

Perhaps it can catch *some* errors of that type, but only at the cost of
extra effort required to defeat the compiler (forcing the programmer to
type \\d to prevent the compiler complaining about \d). I don't think the
benefit is worth the cost. You and your friend do. Who is to say you're
right?



At
least not if it doesn't have any escape characters in it that I'm not
familiar with. But in Python, if I see, "\f\o\o\b\a\z", I'm not really
sure what I'm seeing, as I surely don't have committed to memory some of
the more obscure escape sequences. If I saw this in C++, and I knew that
it was in code that compiled, then I'd at least know that there are some
strange escape codes that I have to look up.

And if you saw that in Python, you'd also know that there are some
strange escape codes that you have to look up. Fortunately, in Python,
that's really simple:

"\f\o\o\b\a\z"
'\x0c\\o\\o\x08\x07\\z'

Immediately you can see that the \o and \z sequences resolve to
themselves, and the \f \b and \a don't.



Unlike with Python, it
would never be the case in C++ code that the programmer who wrote the
code was just too lazy to type in "\\f\\o\\o\\b\\a\\z" instead.

But if you see "abc\n", you can't be sure whether the lazy programmer
intended "abc"+newline, or "abc"+backslash+"n". Either way, the compiler
won't complain.



You just have to memorize it. If you don't know what a backslash escape
is going to do, why would you use it?

(1) You're looking at code that someone else wrote, or (2) you forget to
type "\\" instead of "\" in your code (or get lazy sometimes), as that
is okay most of the time, and you inadvertently get a subtle bug.

The same error can occur in C++, if you intend \\n but type \n by
mistake. Or vice versa. The compiler won't save you from that.



This is especially important when reading (as opposed to writing) code.
You read somebody else's code, and see "foo\xbar\n". Let's say you know
it compiles without warning. Big deal -- you don't know what the escape
codes do unless you've memorized them. What does \n resolve to? chr(13)
or chr(97) or chr(0)? Who knows?

It *is* a big deal. Or at least a non-trivial deal. It means that you
can tell just by looking at the code that there are funny characters in
the string, and not just a backslashes.

I'm not entirely sure why you think that's a big deal. Strictly speaking,
there are no "funny characters", not even \0, in Python. They're all just
characters. Perhaps the closest is newline (which is pretty obvious).



You don't have to go running for
the manual every time you see code with backslashes, where the upshot
might be that the programmer was merely saving themselves some typing.

Why do you care if there are "funny characters"?

In C++, if you see an escape you don't recognize, do you care? Do you go
running for the manual? If the answer is No, then why do it in Python?

And if the answer is Yes, then how is Python worse than C++?


[...]
Also, it seems that Python is being inconsistent here. Python knows that
the string "\x" doesn't contain a full escape sequence, so why doesn't
it
treat the string "\x" the same way that it treats the string "\z"?
[...]
I.e., "\z" is not a legal escape sequence, so it gets left as "\\z".

No. \z *is* a legal escape sequence, it just happens to map to \z.

If you stop thinking of \z as an illegal escape sequence that Python
refuses to raise an error for, the problem goes away. It's a legal escape
sequence that maps to backslash + z.



"\x" is not a legal escape sequence. Shouldn't it also get left as
"\\x"?

No, because it actually is an illegal escape sequence.



He's particularly annoyed too, that if he types "foo\xbar" at the
REPL, it echoes back as "foo\\xbar". He finds that to be some sort of
annoying DWIM feature, and if Python is going to have DWIM features,
then it should, for example, figure out what he means by "\" and not
bother him with a syntax error in that case.

Now your friend is confused. This is a good thing. Any backslash you
see in Python's default string output is *always* an escape:

Well, I think he's more annoyed that if Python is going to be so helpful
as to put in the missing "\" for you in "foo\zbar", then it should put
in the missing "\" for you in "\". He considers this to be an
inconsistency.

(1) There is no missing \ in "foo\zbar".

(2) The problem with "\" isn't a missing backslash, but a missing end-
quote.





Me, I'd never, ever, EVER want a language to special-case something at
the end of a string, but I can see that from his new-to-Python
perspective, Python seems to be DWIMing in one place and not the other,
and he thinks that it should either do no DWIMing at all, or
consistently DWIM. To not be consistent in this regard is "inelegant",
says he.

Python isn't DWIMing here. The rules are simple and straightforward,
there's no mind-reading or guessing required. There is no heuristic
trying to predict what the user intends. It's a simple rule:

When parsing a string literal (apart from raw strings), if you see a
backslash, then grab the next token (usually a single character, but for
\x and \0 it could be multiple characters). If there is a mapping
available for that token, insert that in the string being built, and if
not, insert the backslash and the token.

(As I said earlier, this may not be precisely how it is implemented, but
functionally, it is what Python does.)


And I can see his point that allowing "foo\zbar" and "foo\\zbar" to be
synonymous is a form of DWIMing.

Is it "a form of DWIMing" to consider 1.234e1 and 12.34 synonymous?

What about 86 and 0x44? Is that DWIMing?

I'm sure both you and your friend are excellent programmers, but you're
tossing around DWIM as a meaningless term of opprobrium without any
apparent understand of what DWIM actually is.




--
Steven

.



Relevant Pages

  • Re: more on unescaping escapes
    ... without the quotes in the file so my parser can read it as a single ... string. ... It really is a tab that gets stored there, not the escape for one. ... if you give python an unknown escape it passes it leaves it ...
    (comp.lang.python)
  • Re: Unrecognized escape sequences in string literals
    ... if he could just look at the string literal and know. ... friend is a programmer. ... If you don't know what a backslash escape is going to do, ... That's an enormous difference from Python, ...
    (comp.lang.python)
  • Re: more on unescaping escapes
    ... I need to use the \x20 because of my parser. ... it's not really a problem of what happens when you assign a string ... It really is a tab that gets stored there, not the escape for one. ... if you give python an unknown escape it passes it leaves it ...
    (comp.lang.python)
  • Re: OT (was: Re: Letter to US Sen. Byron Dorgan re unpaid overtime)
    ... time-complexity involved in repeatedly calculating the length of a string), ... if Jos Horsmeier or Programmer Dude were to state something that I ... You wouldn't know good reasoning if it bit you on the nose (which, ... If you make a mistake and then say "oops", then nobody cares two hoots about ...
    (comp.programming)
  • Re: [EGN] Hoisting Loop Invariants (Was: Re: [EGN] Numerical Accuracy)
    ... compiler out there somewhere that did as you claim. ... > the programmer has this knowledge, then the programmer should not use ... >> string in a loop, regardless of the blatant inefficiency of doing so. ...
    (comp.programming)