Tcl faster than Perl/Python...but only with tricks...



Hello

Currently there is a thread in c.l.python
(http://groups.google.de/group/comp.lang.python/browse_thread/thread/923e34e8466ac920/233f1310151e19f6)
about if it is possible for Python to beat Perl in a small text matching
task. Have some patience, there comes a really fast Tcl solution at the
end, but I like to describe the other Versions of Perl/Python first and
then present Tcl and some questions about performance of some things in
Tcl.

The text to match was the case insensitive word "destroy" in a text from
gutenberg.org (King James Bible). The text used for the test was generated
this way:

$ wget http://www.gutenberg.org/files/7999/7999-h.zip
$ unzip 7999-h.zip
$ cd 7999-h
$ cat *.htm > bigfile
$ du -h bigfile
du -h bigfile
8.2M bigfile

The code there for Perl was:
---
open(F, 'bigfile') or die;
while(<F>) {
s/[\n\r]+$//;
print "$_\n" if m/destroy/oi;
}
---

Which really fast finds all lines containing "destroy" case insensitive and
prints them out. On my computer (Linux 2.6.18, 2.6 GHz Pentium 4) this took
0.273s for Perl (for all measurements I used the average of the last three
runs of four, throwing away the first for caching).

The Python-Version was
---
import re
r = re.compile(r'destroy', re.IGNORECASE)
for s in file('bigfile'):
if r.search(s): print s.rstrip("\r\n")
---

Also fast, I think: 0.622s. After some Iterations, the Pythonians came up
with this solution and faster 0.526s:
---
import re
r = re.compile(r'destroy', re.IGNORECASE)
def stripit(x):
return x.rstrip("\r\n")
print "\n".join( map(stripit, filter(r.search, file('bigfile'))) )
---

I asked myself, how this would perform in Tcl, so I first did the straight
forward version, which resembles the other versions:

---
set f [open bigfile r]
while { [gets $f line] >= 0} {
if {[string match -nocase "*destroy*" $line]} {
puts $line
}
}
---

0.937s Ouch... (Tcl 8.4.13, with 8.5a4 I got even worse 1.2s)

I asked myself, what make Tcl so damn slow here. I commented out the
if...puts...-part what made the thing twice as fast (and useless of
course...). But that shows, that matching only took half of the time, which
surprised me. I thought, reading the file and running through the
while-loop should take nearly no time...

So my question is, why are [gets] and or [while] so slow, and is there a
change to improve that? For text processing these are two very central
commands...

I think about all the usenet-threads and preconceptions about Tcls slowness
(just have a look at the current thread in c.l.tcl: "Is Tcl work for large
programs?"). Tcl CAN do really fast, but you need some tricks and
knowledge, which is far from being obvious... After some thinking, I came
up with this:

---
set f [open bigfile r]
puts [join [regexp -all -inline -linestop -nocase {.*destroy.*\n} [read $f]]
{}]
---

0.223s (8.5a4: 0.241) Wow! Faster than Perl and at least as unreadable as
Perl, the Perl-Guys would love it! ;-)

But I don't. It doesn't look good, and it uses an unfair trick by reading
the whole file into memory. But that does not work, if the file is too
large for memory, while this would be no problem for the
Perl/Python-Versions. The only good thing about this version is, that it
shows, that Tcl regexp are nearly as fast as Perls, whcih is really good, I
think.

So, I could beat Perls/Pythons performance with Tcl, but it does not really
make me happy...

Regards
Stephan

.



Relevant Pages