Re: coerce for arbitrary types



From: Scott Burson <FSet....@xxxxxxxxx>
If I understand the hyperspec correctly, then there is no way to extend
coerce to accept any other result-type than the ones explicetly [sic] listed.
From: Kent M Pitman <pit...@xxxxxxxxxxx>
Why do you need to do this? This is often a sign of a program being
structured wrong--trying to get too much information out of too few bits.
http://www.nhplace.com/kent/PS/EQUAL.html#two-bit_rule

Well maybe that's sorta relevant, but the main essay (below) is
more to the point, IMO.

In fact, the entire article
http://www.nhplace.com/kent/PS/EQUAL.html
in which that rule resides builds up to the relevant point, which
is that in dynamic languages, additional care is required in order
to infer "intent" because the fact of a representation's use is not
the same as the intent of a representation's use, and you can only
dynamically query about the fact of its use, not the intent.

IMO that's primarily because we pass raw objects, rather then
intention-tagged objects, as parameters to functions. We do this
for efficiency, but there's a tradeoff that the runtime system
doesn't have sufficient information to know which of many possible
intentional types is mapped to the given runtime-implementational
type. But this problem isn't limited to dynamic languages, because
even in static languages people use the same declared datatype to
represent several different kinds of data-object, because the
programming language doesn't have enough declaration datatypes to
cover all the possible intentions one-to-one. Consider for example
that most languages have a single datatype to cover:
- Unsigned integer (of some particular length)
- Bitmask (of some particular number of bits)
- Boolean (using some convention to map two integer/bitmask values
to the two values True/False, and somehow gracefully dealing with
other values that may occur in lieu of the "correct" values, often
*deliberately* allowing all those other values to represent one of
the boolean values to allow some shortcuts in coding).

Then there's the most extreme case of void* in C.

Now in Common Lisp, all we'd need to do to demark the boundary
between container and element in structures based on the CONS cell
would be to wrap the overall container with an extra layer of
intent, then the generic fuction can use that info to decide what
operation we *really* want to do upon some subset of the CONS cells
dangling from the toplevel container pointer. But this increases
the overhead hence slowness of execution, and prevents sharing
common structure between multiple containers of the same type.

Likewise an extra layer of intent around a primitive type such as
integer can tell us whether it's to be treated as a number or a
bitmask. Here the overhead of the extra wrapper is perhaps more
extreme.

And in static languages, we might use a typedef to add the
intentional information, which a polymorphic compiler could use to
choose the correct function to apply to the data. (I'm pretty sure
K&R C doesn't support this, whereas C++ does. I'm not sure whether
Objective Think C, which I used briefly in 1992, does or doesn't.)

This is not a bug in dynamic languages because it buys
additional, much needed flexibility in other situations.

Agreed, but not just in dynamic languages, such as integer being
used as bitmask because there is no built-in bitmask datatype, or
because the compiler can't handle two functions by the same name
that take different-typed arguments (i.e. compiler isn't
polymorphic).

But in the case here, you have to be careful to either pass more
information or not to infer so much information from what is
passed.

Agreed. I think the main lesson is for newbies to realize that pair
and list and alist and property-list and tree all look exactly the
same to the runtime system, a pointer to a single CONS cell that
has more or less "massive quantities of" stuff hanging off it,
perhaps exactly the *same* stuff, perhaps decisively different
stuff but it would require too much CPU time to explore a structure
deeply to decide which stuff hangs off it every time a toplevel
operation is to be performed on "it" that is supposed to do
different things depending on the intention of the programmer as
expressed by "all that stuff" hanging off it.

So the newbie must learn to call different functions to deal with
each of the various kinds of structures all hanging off the same
exact kind of CONS cell, or:

Sometimes more information means "another argument" and sometimes
it means "a different representation of the data than the one
you're trying to coerce".

As for "another argument" as a solution: In general this requires
the programmer write his/her own utility function which takes that
extra argument (parameter) and dispatches to the appropriate
built-in or programmer-defined single-case utility.

Your (Pittman's) essay on intention of data structure deals mostly
only with CONS-based structures. But the OP and similar threads
lately have dealt with floating point values converted to
rationals. Here the ambiguity isn't so much the intentional *type*
of the data, but rather the precision and accuracy. After all, a
floating point value is **supposed** to be only an approximation,
not an exact value, whereas a rational is supposed to be an exact
value. So when the runtime system sees only a floating-point value,
how is the system supposed to know how [in]accurate the value is?
Any good programmer (cough cough) performs numerical analysis to
learn the tolerance for every floating-point approximation within
his/her program at every point during execution, right? So at the
point where a floating-point approximation is to be printed out, or
converted to a rational, the programmer can include code to
explicitly take that known error-margin into consideration, right?

Common Lisp provides only two built-in conversions:
- (rational <float>) assumes the tolerance is zero, the value is *exact*.
- (rationalize <float>) assumes the data value was entered by
parsing from a string, and no further processing was done after
parsing, and the string before parsing was exactly the correct
decimal-fraction, and the floating-point precision was sufficient
to distinguish that decimal fraction from any other fraction
whatsoever of equal or smaller denominator.
Obviously neither assumption is correct in most cases where a
floating-point value is converted to a rational. Even when the data
value was obtained immediately from parsing string representation,
we have to beware of the limit on precision available in various
floating-point internal representations:

(rationalize 1.237) => 1237/1000 (good)
(rationalize 1.2373) => 7039/5689 (whoops, not enough bits in single-float)

(rationalize 1.2373d0) => 12373/10000 (good)
...
(rationalize 1.2372583d0) => 12372583/10000000 (good)
(rationalize 1.23725683d0) => 101747482/82236347 (whoops, n.e.b.i.double-f.)

So, for example:
IMO [not allowing coerce to be extended is] a good idea, because coerce
is already a crock from the start. It is not possible to change something
from the type it was to the type you want to be
This isn't the reason it's a "crock". That's just the reason it has a
bad name. The definition has issues that make it not easily
extendable independent of its name, and those issues derive from the
nature of the language. In brief, COERCE is too short a name and
falsely gives the sense that it is a canonically defined operation
with only one meaning rather than an operation that has several
(perhaps myriad) friends that are all vying for the cool short-name.

Ah, you are more diplomatic than I am. Indeed there simply
shouldn't be a function by that name, or it should be more limited
in cases than it currently is.

If these were domain names, not function names, all the names
coerce.org, coerce.net., coerce.cc, coerce.tv, etc. would all be
bought up and people would be assigning them subtly different
meanings. But when there's no trailing GTLD to remind us, it's easy
to forget that there was a competition in play.

Hmm, nice way of looking at the issue.

The real problem is that something you can customize needs to be
defined in such a way that it's clear what customizing it would mean.

I.e. "coerce" is too vague to clearly say what it's supposed to do
in general when there are multiple reasonable things a function
might do? (Be glad you actually need to call a function to convert
a value from one type to a reasonable value of another type.
Compare with languages such as C where simply assigning a value to
a different type of variable automatically performs some kind of
conversion, or even C++ where this issue was clarified in some
cases but left as automatic conversion in other cases. At lesat in
Lisp we can argue about what a named function should do, rather
than what an "assignment" operator should also do in addition to
its primary task of assignment.)

For historical reasons, COERCE is already messed up because of
the NIL vs () issue [note well: NOT the issue commonly fussed
over, which is the false vs. empty list issue].

Hmm, on that non-issue, I can imagine that COERCE applied to a
sequence, with target type not any sequence type, might map down
the list like MAP, converting each element into the requested type,
putting the elements into a new sequence of the same type as the
old sequence as much as possible. For example, if (coerce
'character <int>) performed (code-char <int>), then (coerce
'character <string>) might perform (map 'vector #'code-char
<string>).

Back to the main issue: It's indeed unfortunate that early versions
of Lisp tried to get by on very few types of objects (fixnums, CONS
cells, symbols, nothing else that I can recall), so of course the
idea of having yet another data type reserved solely for a single
object that indicated an empty list, which read and printed as (),
would have seemed extravagant. So somebody needed to decide whether
the fixnum 0 or the symbol NIL would denote the empty list (hence
the last CDR of a non-empty list), and NIL won. I suppose the idea
was that it would be useful to distinguish between any number
whatsoever, including zero, and an empty list, and hence it would
be a pain if some particular number, especially a commonly-used
value such as zero were to be the indicator of empty list, whereas
NIL was such a strangly named symbol that it wouldn't turn up in
normal use, so it could take on this role without causing lots of
trouble? Anybody who really needed the symbol NIL for any other
purpose would just have to deal with the consequences, while the
majority wouldn't ever suffer a problem? And after 25 years of
using NIL to denote the empty list were assumed by a huge amount of
working code, the idea of changing that fact of life when merging
MacLisp and LispMachineLisp etc. into Common Lisp seemed
politically infeasible, right?

The simple case of:
(coerce nil 'string) => ""
(string nil) => "NIL"
illustrates the problem.

IMO since STRING needs to interpret the argument as some sort of
sequence of objects that each can be coerced into a character, it
needs to decide whether NIL is to be interpreted as an empty list
of objects which *would* have been convertable to characters if
there had been any in the list, or as a symbol whose print-name
should be used. If STRING worked like this:
(string '(#\F #\O #\O)) => "FOO"
then it would have been reasonable for this to also work:
(string '()) => ""
i.e.
(string NIL) => ""
But that kind of conversion isn't defined at all, whereas this *is* defined:
(string 'FOO) => "FOO"
So the only reasonable option is:
(string 'NIL) => "NIL"
and since NIL evaluates to itself, the quoting isn't necessary:
(string NIL) => "NIL"
Now if conversion from list of characters to string was in fact
implemented, then we'd have an ambiguity as to what (string NIL)
should return, and I'd prefer it to signal an error, suggesting
alternate code to make clear the intent of the programmer:
(map 'string 'identity '(#\F #\O #\O)) => "FOO"
(map 'string 'identity '()) => ""
(map 'string 'identity nil) => ""
(symbol-name 'FOO) => "FOO"
(symbol-name 'NIL) => "NIL"

The reason this happens by the way is that STRING plainly does
not work on arbitrary sequences, so there's no confusion that
STRING is not asking to operate on sequences.

Agreed.

But COERCE does take sequences, and empty sequences were common
and useful. The progression
(coerce '(a b) 'vector)
=> #(A B)
(coerce '(a) 'vector)
=> #(A)
(coerce '() 'vector)
=> #(#\N #\I #\L)
would have been devastating to getting any work done.

Although coerce may be "nice" as a shortcut for lazy programmers,
really IMO the programmer should have written:
(map 'vector #'identity '(a b))
(map 'vector #'identity '(a))
(map 'vector #'identity '())

Now consider this progression:
(coerce '(#\F #\O #\O) 'vector)
=> #(#\F #\O #\O)
(coerce * 'string)
=> "FOO"
(coerce '(#\F #\O #\O) 'string)
Looks like it should work, by transitivity, right? If the list is
considered "the same" as the vector, except for implementation
type, such that there's a default conversion from one to the other,
and if the vector is considered "the same" as the string, likewise
default conversion, then why aren't the list and string considered
"the same" too??
(coerce "FOO" 'vector)
=> "FOO" ;Because string is a sub-type of vector, so no conversion is needed.
(coerce #(#\F #\O #\O) 'list)
=> (#\F #\O #\O)
(coerce "FOO" 'list)
=> (#\F #\O #\O)
Wow, it works in the opposite directly all the way from string to list!
(coerce "" 'list)
=> NIL
Yes, Virginia, there really is a reverse coercion available!

Putting it all together:
(string (coerce "" 'list)) => "NIL"
That one little snippet of code is terse+deep enough to be a newbie koan!

In effect, you are meant to make the choice between STRING and
(COERCE x 'STRING) somewhat on the basis of understanding how you
want this particular tie to be broken.

IMO that is a perverse bit of trivia for Lisp programmers to need
to memorize and not screw up. IMO it's better that neither STRING
nor COERCE be used at all, ever, in production code. Now since most
of the string-handling functions implicitly apply STRING to their
supposedly-string arguments, we must understand the three cases
where STRING works (string to itself, character to single-character
string, or symbol to print name). But still, directly calling
STRING from application code seems unnecessary. Whereas the string
functions are somewhat generic, working equally well with actual
strings as well as with characters or symbols, whereas they don't
know the intention of the programmer, this kind of "magic" coercion
to the proper type may be necessary. But when a programmer is
working with characters as if strings, or with symbols as if their
print-names, I see no reason the programmer can't make his/her
intention explicit by coding what is really intended, either
(format nil "~A" <char>) or (symbol-name <sym>) instead of asking
the magic genie STRING to do "what's right" which as we see above
isn't always what's right.

But if we were to make it open season on extending this, in the
present of a dynamic language with multiple inheritance, this kind
of problem would happen all the time, and the result would be
ragged... absent the intent information I alluded to above.

Indeed, if the previous discussion wasn't enough to convince the
diehard genie-user to stop using the magic genie and clearly say
exactly what data conversion is desired, this nuclear weapon of
clue-by-four should suffice.

Now I was almost going to make a general recommendation to avoid
using shortcuts that disguise intent. For example, you want to do
one thing if the value of x is nil and something else otherwise,
you should say:
(if (null x) <FirstThing> <SecondThing>)
or else
(if (not (null x)) <SecondThing> <FirstThing>)
rather than
(if x <SecondThing> <FirstThing>)
But then I relalized I take a similar shortcut frequently when
parsing strings by scanning for successive occurrances of
particular delimiting characters or white-space. IX is starting
index i.e., or the index where I left off the previous time around
the loop, like this:
(unless (setq ix1 (position #\! str :start ix0))
<FinalCode>)
(setq ix2 (or (position #\Space str :start ix0)
(length str)))
(push revres (subseq str (+ 1 ix1) ix2))
It's the first parameter to OR, which returns NIL if the character
isn't found until end of string, where I'm taking the shortcut.
I don't see any cleaner way to write that kind of code, so maybe I
can think of guidelines for when to use the shortcut that non-NIL
is treated as true and when to make the intent more explicit,
rather than saying to try to always make intent explicit?

Now back to what you said about class in heritance etc. It seems to
me that when somebody is developing a domain-specific bunch of
code, it's reasoanble to define some default coercions that make
sense for that specific domain, but *not* expect the underlying
Lisp system to already provide such custom intent built-into
functions such as COERCE. Let the domain-specific programmer write
a domain-specific coercion function him/herself, and make such a
function consistent for just that one domain, and not over-reach.

Regarding that proposed koan earlier: Does anybody have a
collection of short code snippets which likewise illustrate a whole
mess of deep understanding, where just trying to understand why
this particular snippet of code produces the result it does will
enlighten a newbie?
.



Relevant Pages

  • Re: OT (was: Re: Letter to US Sen. Byron Dorgan re unpaid overtime)
    ... time-complexity involved in repeatedly calculating the length of a string), ... if Jos Horsmeier or Programmer Dude were to state something that I ... You wouldn't know good reasoning if it bit you on the nose (which, ... If you make a mistake and then say "oops", then nobody cares two hoots about ...
    (comp.programming)
  • Re: [EGN] Hoisting Loop Invariants (Was: Re: [EGN] Numerical Accuracy)
    ... compiler out there somewhere that did as you claim. ... > the programmer has this knowledge, then the programmer should not use ... >> string in a loop, regardless of the blatant inefficiency of doing so. ...
    (comp.programming)
  • Why C Is Not My Favourite Programming Language
    ... C has no string type. ... compiler take care of the rest. ... Why does any normal language ... the programmer fail. ...
    (comp.lang.c)
  • Re: Another VBA bug
    ... string, which it does however surprising at first it may seem. ... Dim dbl As Double ... ' can coerce but can't evaluate ... In Integer Division, if A and/or B are floating point numbers, ...
    (microsoft.public.excel.programming)
  • Re: The Demise of C#
    ... I understand generics fine thank you. ... And please point out to me where I called myself a C++ programmer. ... > stating that both operands are going to be converted to string as ... >>> requiring explicit casts and explicitly indicating where you ...
    (microsoft.public.dotnet.framework.aspnet)