Re: ALTER design (Was: Code problems with Perform Thru Exit causes fall through)
- From: Alistair <alistair@xxxxxxxxxxxxxxxxxxxxx>
- Date: Fri, 03 Aug 2007 07:00:36 -0700
On 3 Aug, 04:26, "Pete Dashwood" <dashw...@xxxxxxxxxxxxxxxxxxxxxxxxx>
wrote:
"Alistair" <alist...@xxxxxxxxxxxxxxxxxxxxx> wrote in message
As one who has been called out repeatedly overnight (same suite, same
job, same program, same error - abend due to user data input causing
excessive levels of error) I can honestly say:
Hang on a minute... read what you just wrote again, and then pretend you are
a manager. Doesn't it strike you as "odd" that a known "troublespot" was not
addressed during daylight hours? Wouldn't you have people tightening the
data validations in said suite, just for openers?
Yeah, it seems to call for intelligent users and a computer solution
to a computer problem. The problem was users who did not care about
putting in pricing information that was wrong and did not check the
quality of their work. If a user wants to put the price of a loaf of
bread up to $75 then no computer system could stop them. The users'
management claimed to have intelligent users (when we developed a
quick and dirty Godfather Payments reporting system the management
said to not put in any validation on the parameter file because they
had intelligent users who would not make mistakes. We put in some
rudimentary validation and the job fell over on it's first run) and
did not accept that they made mistakes. The reasons for the repeated
failures were:
1. The introduction of system managed storage which deprived the
system of a dedicated sort pack on nights when a large insurance
company print run grabbed all SYSDA;
2. The intermediate level of client management, being aware of the
'don't care' attitude of their employees, requested that error
messages be produced and stored with the system abending after x,000
errors had been recorded. They reviewed the errors and then decided to
continue or fix and rerun the suite. It gave me great pleasure to ring
the client manager at home with a cheery 'It's your early morning
call...'. Despite repeatedly waking him up (and his wife) over the
years the errors still continued. The client had asked for a quote for
a report to be developed which could vet the price changes but it
would have been too expensive to do; he certainly never raised a work
request to authorize spending time on thinking the problem through (it
being cheaper to use his 2 days per week of support time up on
callouts than to pay for additional work). <SOUND OF HEAVENLY TRUMPETS
BLOWING> I suspect that, if the client had approved a budget to
develop a solution, I would have been the only company employee
capable of specifying and coding the solution </SOUND OF HEAVENLY
TRUMPETS BLOWING>.
3. We did, however, develop a price change security system which
prevented prices from being altered under certain conditions (like for
major customers), but that didn't stop the errors on other customers'
prices.
1. I have only ever implemented one fix that has subsequently been
seen to be at fault - it was an inefficient blocksize on a tape;
Good.
2. Apart from (1), and the programmer who inadvertently wiped the
entire weeks' data from an input file by restoring the wrong gdg, I am
not aware of any other overnight fixes that were subsequently deemed
to be faulty;
I'd say you were lucky (and independent statistics bear me out.http://www.springerlink.com/content/g821443230486268/,http://portal.acm.org/citation.cfm?id=126259&dl=ACM&coll=portal&CFID=...)
, software maintenance is the major source of error, and software
maintenance under pressure, is the leading category.
Don't confuse CALLOUT and MAINTENANCE. Maintenance does introduce
errors and they may occur at errors-per-line rates far higher than
development rates but, in my experience, callout error rates (a subset
of maintenance error rates) are much lower.
3. I have worked with many people, of various skill levels, and with
the proviso of (2) they have implemented hundreds of overnight fixes
that have been deemed to be correct and permanent solutions to the
problems;
OK, another (better?) question... Why did they have to implement "hundreds
of overnight fixes"?
Because the development teams produced faulty solutions. You may
recall when I twittered on about a developer who put an untested
system into live because he had run out of budget for testing. It is
also an axiomatic truth that developers are not interested in finding
bugs, rather they are interested in proving that their solution is
correct. Therefore, they do not test thoroughily and skimping allows
errors to creep through. Further, isn't it true that there is no such
thing as a bug-free program, rather that any program said to be bug-
free just hasn't fallen over yet?
4. I am very *issed off that the quality of systems moving from
development into operational status (particularly where the move
requires transfering responsibility across teams) is such that the
errors arising from poor developments far outways the number of errors
arising from faulty fixes (and that includes 3 am fixes);
That view is not borne out by indepedent studies. It may have been a
localized effect at the place where you were working.
Quite possibly. The error rates went down after developers took
responsibility for post-implementation fixes. Their managers also saw
the results of their employees' shoddy work, previously hidden by the
fact that ops did not report development errors back to the
developers.
5. I do not know of any manager who is adept at handling the users'
disappointment nor do I know of any user who is happy with a 10
million pound invoicing run failing at 2 am and not being fixed or
rerun because the programmer wanted a good nights sleep;
Then you kust don't know any good managers. As for the invoicing run having
to wait 8 hours, given that the payments from it will wait thirty days, it
isn't such a big deal.
My clients would have had a fit if you said that to them. An 8 hour
delay became a 24 hour delay in delivery of the invoice to the client.
If the average debt-age is 30 days then you have just increased the
debt-age by 3.333 percent which would work out as a significant amount
of interest.
The call is to get it fixed properly by a team who
are properly rested and up for it, against getting some indiviidual awoken
in the middle of the night and expected to deal with code that he problably
didn''t even write, while half asleep. As a manager, I see that as a no
brainer.
As an IT solutions provider, if the client wants callout then he gets
callout. It wasn't life-threatening but if he says jump, then the only
questions I should ask are how high and in which direction? Sorry, I
know that is a trite example, but a manager of mine did once use that
as his justification for agreeing impossible deadlines for
developments which put the team under repeated and unnecessary levels
of stress.
6. I know of many programmers, myself included, who prefer a good
nights sleep rather than repeated call outs (so we agree on that one).
I gave up doing the callout because the money did not compensate for
losing a nights sleep;
Good choice.
7. The quality of delivered systems improved, where I worked, when I
became active in the handover process and started looking at design,
code and test documents and insisted upon the developers providing the
first three months' standby cover.
Another good choice :-)
What is it that history teaches us? Those who fail to study history
are doomed to repeat the errors and those who study history are doomed
to repeat the errors endlessly?
And then there are those who make history, rather than repeat it... :-)
Interesting. A friend once pointed out to me that there had been no
significant developments in firearms technology (pistols and rifles)
in the C20 that had not been previously done in the C19. The only
examples I could come up with were triangular cross-section bullets
and non-metallic guns (plastics and ceramics) but they had all been
tried before.
What history will we be able to make that has not been previously
achieved? Space colonies, inter-stellar travel and marine colonies?
.
- References:
- Re: ALTER design (Was: Code problems with Perform Thru Exit causes fall through)
- From: Pete Dashwood
- Re: ALTER design (Was: Code problems with Perform Thru Exit causes fall through)
- From: Richard
- Re: ALTER design (Was: Code problems with Perform Thru Exit causes fall through)
- From: Pete Dashwood
- Re: ALTER design (Was: Code problems with Perform Thru Exit causes fall through)
- From: Howard Brazee
- Re: ALTER design (Was: Code problems with Perform Thru Exit causes fall through)
- From: Pete Dashwood
- Re: ALTER design (Was: Code problems with Perform Thru Exit causes fall through)
- From: Alistair
- Re: ALTER design (Was: Code problems with Perform Thru Exit causes fall through)
- From: Pete Dashwood
- Re: ALTER design (Was: Code problems with Perform Thru Exit causes fall through)
- Prev by Date: Re: ALTER design (Was: Code problems with Perform Thru Exit causes fall through)
- Next by Date: Re: ALTER design (Was: Code problems with Perform Thru Exit causes fall through)
- Previous by thread: Re: ALTER design (Was: Code problems with Perform Thru Exit causes fall through)
- Next by thread: Re: ALTER design (Was: Code problems with Perform Thru Exit causes fall through)
- Index(es):
Relevant Pages
|