memory leak in (?)... (redux)
From: De Clarke (de_at_zeitaku.ucolick.org)
Date: 03/05/04
- Next message: Gerald Lester: "Re: send prog through /dev/ttyS0"
- Previous message: Donald Arseneau: "Re: [glob -tails] broken in respect to tilde-filenames!"
- Next in thread: Gerald Lester: "Re: memory leak in (?)... (redux)"
- Reply: Gerald Lester: "Re: memory leak in (?)... (redux)"
- Reply: Michael Schlenker: "Re: memory leak in (?)... (redux)"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: 4 Mar 2004 15:12:49 -0800
well well well, here we are again :-) I'm hunting the last of
the mem leaks in our long-running tcl daemons. this one is a
good test case; the app is pure tcl (well, ok, with tclX), no
homegrown extensions are used. while it is possible that there's
a leak in TclX (the mainloop cmd for example) I find it unlikely:
we use tclX all the time, we have many event-driven tclX apps,
and the others don't leak but this one does.
when actively exercised the app grows continuously. I spliced into it
a proc for auditing variable usage (global and namespace). I'm gonna go
over this proc a few more times looking for stupidity, then perhaps will
post it here so others can use it -- or point out what's wrong with it, as
the case may be. it's supposed to tell us the size of all our variables
(global and namespace) at the time of auditing. it produces a tedious log
file which can be massaged in various ways to find those that are growing.
btw, I confess I am an old fogey and don't use namespaces much,
but part of this app is written by a colleague who really likes
namespaces. is
# scrape the namespaces for vars
set ns [namespace children ::]
foreach n $ns {
set nvars [namespace eval $n info vars]
set nn [string trimleft $n :]
foreach nv $nvars {
if {[lcontain $gvars $nv]} {continue}
lappend nsvars ${nn}::$nv
}
}
this sufficient to collect all the persistent var names out of
all current namespaces?
anyway...
the mystery is that while our audit code does show us a couple of arrays
that grow, down in a namespace, their size is not very impressive and
doesn't begin to account for the VSZ growth rate as seen by top, ps, etc.
after staring at this tcl-introspection output for a while w/o
enlightenment, I finally built tcl 844 on the target host and ran the
app under 844 with TCL_MEM_DEBUG turned on, adding a "memory display"
cmd to the variable auditing. so every 15 minutes, the variable space
and memory are checked (simple "after") and meanwhile externally, at about
the same interval, a watcher process parses ps output to monitor the VSZ
and RSS as seen by the OS. I exercise the app by thrashing the service
it offers w/a script that runs for 5 hrs or so, and then let it rest;
then wade through logfiles later to compare the output during the "busy"
(growing) period vs the "idle" (stable) period.
the output from the "memory display" cmd is not quite as lovely as one
might like :-) for one thing it's rather large, a challenge to filter
and parse. for another, alas all linkage with the interp namespace is
lost, and all it tells me is that N bytes were allocated between addr1
and addr2 by a ckalloc from line NNN of underlying C source file Foo.c
in the distrib. so I don't know which namespace this alloc was invoked
from, what proc caused it to be allocated, or the name of the variable
for which it was needed -- there is not enough evidence to assign blame.
all I know is what *type* of variable or alloc it was :-) which is better
than nothing.
at any rate, after poring over a lot of output files (at 5-6 mb each),
and massaging them to summarise total memory alloc'd at each time point
by routines from each source file, I find the following:
the steepest and highest growth is in space ckalloc'd from tclHash.c --
runner-up is tclVar.c, followed distantly by tclExecute.c and even
more distantly by tclListObj.c. ckallocs from tclHash.c accounted
for 150Kb of net growth over the test period according to the
MEM_DEBUG output. tclVar accumulated about 110Kb of new allocs
during the same period, and tclExecute about 60KB. when the app is
more idle, merely ticking over with routine queued events at 1Hz or
so, all traces flatline as viewed on a scale of 1,150K with about
20 samples of busy-ness and 70 samples of idleness.
it looks like the leak might have something to do with array creation or
extension -- at least that is what I would associate with tclHash.c,
but I am not a Tcl internals hacker and am wary of Leaping to Confusions.
there is indeed a lot of array creation going on during busy periods;
this app participates in an IP messaging system, where incoming messages
trigger callbacks; each incoming message is parsed into an unique
array whose name is its sequential message num (since start of app).
however, these arrays are disposed of by "unset" after the higher levels
of the app process each message. (yes, we have instrumented it and we
see the disposal proc being called, and we don't see these ephemeral
arrays accruing in the namespace during our audits). in theory, the
app should run nearly memory-neutral after startup. some trivial-sized
housekeeping arrays should grow to elem counts in the 5 to 10K range,
but then start re-using element indices. that's the theory anyway :-)
in practise, the darned thing just grows. it starts out around
7 MB VSZ and after a few days will show up at 20 MB; a few days later
it will be 30 or 40 MB and so on. we have not yet seen any indication
of a stopping point :-) and I think if the machine stayed up long
enough, this app would eventually reach gargantuan sizes.
so far, we haven't found where the memory is going. when we tediously
audit the actual numbers of elements in arrays, and the size of each
element (and the same for lists, etc, as discussed above), we don't find
anywhere near the amt of memory that is disappearing. of course this
kind of "it can't do that!" result means I have to go back and re-check
every step we've taken so far :-( and that is what I will be doing
next. but these latest results from the MEM_DEBUG experiment inspire
me to ask a few questions:
1) is there hidden overhead in array creation (the hash table for
example) that means the allocs for hashing might exceed the actual
data size by a significant factor, so that a "small" array growth
in the visible data space seen by tcl code might produce large
memory growth as seen by the OS? what might cause this?
2) is the overhead of array creation exacerbated, perhaps even in
some nonlinear way, by the char length of the indices or the
element count exceeding some threshold? is there a rule by which
we might calculate the OH for, say, 35-char indices and 5 thousand
elements? if there is a reliable formula then my audit proc could
take a whack at computing the invisible overhead and adding it to
the apparent array size. I remember George's (Howlett) assertion
that a Tcl array of 10K elements might use as much memory to store
the indices as to store the data, but surely this depends very much
on the size of the data! can we express the actual memory used
algorithmically?
3) is there any additional complication or overhead when arrays
declared as "variable" inside a namespace, and exported, have
long index strings, or are created and deleted in large numbers
with great rapidity during the app lifetime?
4) is there any good way to close the gap between introspection
from the tcl level (which can only see the visible content of
variables) and the MEM_DEBUG output which shows raw memory
usage (nothing hidden) but is divorced from the tcl namespace?
can TclPro, for example, or some other tool or extension, tell
us the real size (incl overhead) of tcl data objects, from
w/in the interp, in real time or to a periodic log?
has anyone been down this road before us?
at any rate, I'm now going back to the "audit" code to see if there's
a glaring error in it.
right now we have a lot of memory disappearing; we know that a couple
of arrays are growing, but we think they are "small" (i.e. the contents
of each element are small) and that they should stabilise when they
start re-using indices. either something else is growing that we have
overlooked, or these arrays are not really "small" and they somehow do
account for the VSZ bloat.
de
-- ............................................................................. :De Clarke, Software Engineer UCO/Lick Observatory, UCSC: :Mail: de@ucolick.org | Your planet's immune system is trying to get rid : :Web: www.ucolick.org | of you. --Kurt Vonnegut : :1024D/B9C9E76E | F892 5F17 8E0A F095 05CD EE8B D169 EDAA B9C9 E76E:
- Next message: Gerald Lester: "Re: send prog through /dev/ttyS0"
- Previous message: Donald Arseneau: "Re: [glob -tails] broken in respect to tilde-filenames!"
- Next in thread: Gerald Lester: "Re: memory leak in (?)... (redux)"
- Reply: Gerald Lester: "Re: memory leak in (?)... (redux)"
- Reply: Michael Schlenker: "Re: memory leak in (?)... (redux)"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|