Re: sort unique
- From: rem642b@xxxxxxxxx (Robert Maas, see http://tinyurl.com/uh3t)
- Date: Sat, 05 Apr 2008 21:00:55 -0700
From: Kent M Pitman <pit...@xxxxxxxxxxx>
... you're betting against the implementation doing better if you
implement it yourself.
Agreed. When in doubt, do benchmark timing tests to see whether the
behaviour provided by the vendor is o(N**2) or l(N * logN) or what.
Then decide whether, on the basis of that info, it's worth
re-inventing a better wheel.
in this case it's better to use hash table:
(defun remove-duplicates-ht (list)
(loop with ht = (hash-table :test 'equal)
for w in list
do (setf (gethash e ht) t)
finally (return (loop for w being each hash-key of ht
collect w))))
Why are you assuming that the system version of remove-duplicates
does not do
(defun remove-duplicates (list &key ... (test #'eql))
(if (or (eq test 'eql) (eq test #'eql)
(eq test 'eq) (eq test #'eq)
(eq test 'equal) (eq test #'equal)
(eq test 'equalp) (eq test #'equalp))
(secretly-use-hash-tables ...)
(...do something more general...)))
Actually in those common cases, given that a hash table is not
*guaranteed* decent speed, you *might* have data where everything
hashes into a single bucket per the vendor-supplied hashing
algorithm, using a hash table might not really be a good thing to
secretly use without the advise and consent of the application
programmer. If the test somehow supports a total ordering on the
possible data values, then sorting per that total ordering (N *
logN) followed by collating adjacent elements in the sorted list,
might be the safest thing for a vendor to secretly supply. If the
test is EQ, then machine address could be secretly used as sort
key. If the test is EQL, then data could be split into classes
(numeric, character, CONS cell, etc.), each such group sorted per
its appropriate total-ordering, adjacent duplicates eliminated
within each group, and finally an APPEND of all the groups. That's
also less set-up overhead (for small datasets) than building a
temporary hash table.
IMO, you're better off just calling the system thing and sending
a bug report if you observe it to be foolishly slow.
IMO if the vendor's algorithm does something "obvious", which is
slower than a really tricky or innovative algorithm would be, and
the result is *correct* per the spec, that's not grounds for a bug
report. Maybe something gentler would be appropriate, such as a
"suggestion for further improvement"?
That way, you get the benefit of improvements as they arrive, you
don't have to maintain as much code, and other people's code gets
faster too.
If they don't get pissed at you for filing a BUG REPORT for
something that is merely a performance issue, and they simply
discard your BUG REPORT as another annoying person they'd rather
not have as a customer anyway. If you can serve 90% of the
potential customers, discard the other 10%, and thereby avoid an
order of magnitude extra cost dealing with nasty customers ...
Now in general it really isn't even possible to make the
remove-duplicates algorithm more efficient than o(N**2) with a
user-supplied comparison function. Imagine for example a comparison
function to eliminate keys that hash to the same bucket per some
user-defined really-weird hashing function. (Why did the
application programmer want this? Because he/she was deliberately
trying to show an example of a "perfect hash" for exposition
purposes in some tutorial. So he/she wanted to generate a large set
of random keys, eliminate duplicates, and thereby obtain a set of
keys with no hash-bucket duplicates which filled or almost filled
the hash table, then if almost-filled run a simple loop to try to
fill that last one or two buckets in the table.) Or maybe there
really was no purpose to the complicated user-defined hashing
function except to provoke the vendor-supplied remove-duplicates
function into o(N**2) timing. For example, make a list of random
strings of random lengths, and two strings (ignoring length) are
considered "the same" if their middle character is the same (for
odd-length strings) or middle two characters are both the same (for
even-length strings). Thus
(remove-duplicates listOfStrings :test #'middle-strings-same)
couldn't be optimized automatically by the vendor's algorithm, but
(remove-duplicates listOfStrings :test #'string= :key #'middle-string)
could indeed by optimized by pre-computing the keys then sorting
via #'string< then collating adjacent elements.
Crap, XLISP-PLUS version 2.1g doesn't have multiple-value-setq!! I
was going to write middle-string, using XLISP to debug it, locally
on my Mac, without needing to go online to use CMUCL, just now
(12:04 PDT) while composing newsgroup articles offline. Nevermind.
.
- Follow-Ups:
- Re: sort unique
- From: Pascal Bourguignon
- Re: sort unique
- Prev by Date: Re: Strings and symbols
- Next by Date: Re: (parse-decimal-fraction <string>) --> <rational> ?
- Previous by thread: MWUAHAHAAHAHAHA!!! All Your Interface Are Belong to Us!
- Next by thread: Re: sort unique
- Index(es):
Relevant Pages
|
Loading