Imagine the following situation: you have come across numerous references to a paper purporting to show that the chances of successfully treating a disease contracted at age 10 are substantially lower if the disease is detected later: somewhat lower at age 20 to very poor at age 50. Every author draws more or less the same bar chart to depict this situation: the picture below, showing rising mortality from left to right.
You search for the original paper, which proves a long quest: the conference publisher have lost some of their archives in several moves, several people citing the paper turn out to no longer have a copy, etc. You finally locate a copy of the paper (let's call it G99) thanks to a helpful friend with great scholarly connections.
And you find out some interesting things.
The most striking is what the author's original chart depicts: the chances of successfully treating the disease detected at age 50 become substantially lower as a function of age when it was contracted; mortality is highest if the disease was contracted at age 10 and lowest if contracted at age 40. The chart showing this is the picture below, showing decreasing mortality from top to bottom, for the same ages on the vertical axis.
Not only is the representation topsy-turvy; the two diagrams can't be about the same thing, since what is constant in the first (age disease detected) is variable in the other, and what is variable in the first (age disease contracted) is constant in the other.
Now, as you research the issue a little more, you find out that authors prior to G99 have often used the first diagram to report their findings; reportedly, several different studies on different populations (dating back to the eighties) have yielded similar results.
But when citing G99, nobody reproduces the actual diagram in G99, they all reproduce the older diagram (or some variant of it).
You are tempted to conclude that the authors citing G99 are citing "from memory"; they are aware of the earlier research, they have a vague recollection that G99 contains results that are not totally at odds with the earlier research. Same difference, they reason, G99 is one more confirmation of the earlier research, which is adequately summarized by the standard diagram.
And then you come across a paper by the same author, but from 10 years earlier. Let's call it G89. There is a strong presumption that the study in G99 is the same that is described in G89, for the following reasons: a) the researcher who wrote G99 was by then already retired from the institution where they obtained their results; b) the G99 "paper" isn't in fact a paper, it's a PowerPoint summarizing previous results obtained by the author.
And in G89, you read the following: "This study didn't accurately record the mortality rates at various ages after contracting the disease, so we will use average rates summarized from several other studies."
So basically everyone who has been citing G99 has been building castles on sand.
Suppose that, far from some exotic disease affecting a few individuals each year, the disease in question was one of the world's major killers (say, tuberculosis, the world's leader in infectious disease mortality), and the reason why everyone is citing either G99 or some of the earlier research is to lend support to the standard strategies for fighting the disease.
When you look at the earlier research, you find nothing to allay your worries: the earlier studies are described only summarily, in broad overview papers or secondary sources; the numbers don't seem to match up, and so on. In effect you are discovering, about thirty years later, that what was taken for granted as a major finding on one of the principal topics of the discipline in fact has "sloppy academic practice" written all over it.
If this story was true, and this was medicine we were talking about, what would you expect (or at least hope for, if you haven't become too cynical), should this story come to light? In a well-functioning discipline, a wave of retractations, public apologies, general embarrassment and a major re-evaluation of public health policies concerning this disease would follow.
The story is substantially true, but the field isn't medicine: it is software engineering.
I have transposed the story to medicine, temporarily, as an act of benign deception, to which I now confess. My intention was to bring out the structure of this story, and if, while thinking it was about health, you felt outraged at this miscarriage of academic process, you should still feel outraged upon learning that it is in fact about software.
The "disease" isn't some exotic oddity, but the software equivalent of tuberculosis - the cost of fixing defects (a.k.a. bugs).
The original claim was that "defects introduced in early phases cost more to fix the later they are detected". The misquoted chart says this instead: "defects detected in the operations phase (once software is in the field) cost more to fix the earlier they were introduced".
Any result concerning the "disease" of software bugs counts as a major result, because it affects very large fractions of the population, and accounts for a major fraction of the total "morbidity" (i.e. lack of quality, project failure) in the population (of software programs).
The earlier article by the same author contained the following confession: "This study didn't accurately record the engineering times to fix the defects, so we will use average times summarized from several other studies to weight the defect origins".
Not only is this one major result suspect, but the same pattern of "citogenesis" turns up investigating several other important claims.
Software engineering is a diseased discipline.
The publication I've labeled "G99" is generally cited as: Robert B. Grady, An Economic Release Decision Model: Insights into Software Project Management, in proceedings of Applications of Software Measurement (1999). The second diagram is from a photograph of a hard copy of the proceedings.
Here is one typical publication citing Grady 1999, from which the first diagram is extracted. You can find many more via a Google search. The "this study didn't accurately record" quote is discussed here, and can be found in "Dissecting Software Failures" by Grady, in the April 1989 issue of the "Hewlett Packard Journal"; you can still find one copy of the original source on the Web, as of early 2013, but link rot is threatening it with extinction.
A more extensive analysis of the "defect cost increase" claim is available in my book-in-progress, "The Leprechauns of Software Engineering".
Here is how the axes were originally labeled; first diagram:
- vertical: "Relative Cost to Correct a Defect"
- horizontal: "Development Phase" (values "Requirements", "Design", "Code", "Test", "Operation" from left to right)
- figure label: "Relative cost to correct a requirement defect depending on when it is discovered"
- vertical: "Activity When Defect was Created" (values "Specifications", "Design", "Code", "Test" from top to bottom)
- horizontal: "Relative cost to fix a defect after release to customers compared to the cost of fixing it shortly after it was created"
- figure label: "Relative Costs to Fix Defects"
Reminds me of the bit in "Cargo Cult Science" by Richard Feynman:
My father, a respected general surgeon and an acute reader of the medical literature, claimed that almost all the studies on early detection of cancer confuse degree of disease at time of detection with "early detection". That is, a typical study assumes that a small cancer must have been caught early, and thus count it as a win for early detection.
An obvious alternate explanation is that fast-growing malignant cancers are likely to kill you even in the unlikely case that you are able detect them before they are large, whereas slow-growing benign cancers are likely to sit there until you get around to detecting them but are not particularly dangerous in any case. My father's claim was that this explanation accounts for most studies' findings, and makes something of a nonsense of the huge push for early detection.
Interesting. Is he (or anyone else) looking at getting something published that either picks apart the matter or encourages others to?
In trying to think of an easy experiment that might help distinguish, all that's coming to mind is adding a delay to one group and not to another. It seems unlikely this could be done ethically with humans, but animal studies may help shed some light.
This is more common than most people would like to think, I think. I experienced this tracking down sunk cost studies recently - everyone kept citing the same studies or reviews showing sunk cost in real world situations, but when you actually tracked back to the experiments, you saw they weren't all that great even though they had been cited so much.
Well it seems to me that disciplines become diseased when people who demand the answers can't check answers or even make use of the answers.
In case of your software engineering example:
The software development work flow is such that some of the work affects the later work in very clear ways, and the flaws in this type of work end up affecting a growing number of code lines. Obviously, the less work you do that you need to re-do to fix the flaw, the cheaper it is.
That is extremely solid reasoning right here. Very high confidence result, high fidelity too - you know what kind of error will cost more and more to fix later, not just 'earlier error'.
The use of statistics from a huge number of projects to conclude this about a specific project is, however, a case of extremely low grade reasoning, giving a very low fidelity result as well.
In presence of extremely high grade reasoning, why do we need the low fidelity result from extremely low grade reasoning? We don't.
To make an analogy:
The software engineer has no more need for this statistical study than barber has need for the data on correlation of head size and distance between the eyes, and correlation of distance between eyes and ... (read more)
"Should Computer Scientists Experiment More? 16 excuses to avoid experimentation":... (read more)
Is it really valid to conclude that software engineering is diseased based on one propagating mistake? Could you provide other examples of flawed scholarship in the field? (I'm not saying I disagree, but I don't think your argument is particularly convincing.)
Can you comment on Making Software by Andy Oram and Greg Wilson (Eds.)? What do you think of Jorge Aranda and Greg Wilson's blog, It Will Never Work in Theory?
To anyone interested in the subject, I recommend Greg Wilson's talk on the subject, which you can view here.
I'm a regular reader of Jorge and Greg's blog, and even had a very modest contribution there. It's a wonderful effort.
"Making Software" is well worth reading overall, and I applaud the intention, but it's not the Bible. When you read it with a critical mind, you notice that parts of it are horrible, for instance the chapter on "10x software developers".
Reading that chapter was in fact largely responsible for my starting (about a year ago) to really dig into some of the most-cited studies in our field and gradually realizing that it's permeated with poorly supported folklore.
In 2009, Greg Wilson wrote that nearly all of the 10x "studies" would most likely be rejected if submitted to an academic publication. The 10x notion is another example of propagation of extremely unconvincing claims, that nevertheless have had a large influence in shaping the discipline's underlying assumptions.
But Greg had no problem including the 10x chapter which rests mostly on these studies, when he became the editor of "Making Software". As you can see from Greg's frosty tone in that link, we don't see eye to eye on this issue. I'm partially to blame for that, insof... (read more)
This strikes me as particularly galling because I have in fact repeated this claim to someone new to the field. I think I prefaced it with "studies have conclusively shown...". Of course, it was unreasonable of me to think that what is being touted by so many as well-researched was not, in fact, so.
Mind, it seems to me that defects do follow both patterns: Introducing defects earlier and/or fixing them later should come at a higher dollar cost, that just makes sense. However, it could be the same type of "makes sense" that made Aristotl... (read more)
You're ascribing diseases to an entity that does not exist.
Software engineering is not a discipline, at least not like physics or even computer science. Most software engineers, out there writing software, do not attend conferences. They do not write papers. They do not read journal articles. Their information comes from management practices, consulting, and the occasional architect, and a whole heapin' helpin' of tribal wisdom - like the statistic you show. At NASA on the Shuttle software, we were told this statistic regularly, to justify all the process ... (read more)
Your core claim is very nearly conventional wisdom in some quarters. You might want to articulate some remedies.
A few thoughts --
One metric for disease you didn't mention is the gap between research and practice. My impression is that in graphics, systems, networking and some other healthy parts of the CS academy, there's an intensive flow of ideas back and forth between researchers and practitioners. That's much rarer in software engineering. There are fewer start-ups by SE researchers. There are few academic ideas and artifacts that have become widely ... (read more)
Discussed in this oft-quoted (here, anyway) talk.... (read more)
While the example given is not the main point of the article, I'd still like to share a bit of actual data. Especially since I'm kind of annoyed at having spouted this rule as gospel without having a source, before.
A study done at IBM shows a defect fixed during the coding stage costs about 25$ to fix (basically in engineer hours used to find and fix it).
This cost quadruples to 100$ during the build phase; presumably because this can bottleneck a lot of other people trying to submit their code, if you happen to break the build.
The cost quadruples again for... (read more)
Very cool analysis, and I think especially relevant to LW.
I'd love to see more articles like this, forming a series. Programming and software design is a good testing ground for rationality. It's about mathematically simple enough to be subject to precise analysis, but its rough human side makes it really tricky to determine exactly what sort of analyses to do, and what changes in behavior the results should inspire.
There is a similar story in The trouble with Physics. I think it was about whether there had been proven to be no infinities in a class of string theory, where the articles cited only proved there weren't any in the low order terms.
As an aside it is an interesting book on how a subject can be dominated by a beguiling theory that isn't easily testable, due to funding and other social issues. Worth reading if we want to avoid the same issues in existential risk reduction research.
The particular anecdote is very interesting. But you then conclude, "Software engineering is a diseased discipline." This conclusion should be removed from this post, as it is not warranted by the contents.
I see that you have a book about this, but if this error is egregious enough, why not submit papers to that effect? Surely one can only demonstrate that Software Engineering is diseased is if, once the community have read your claims, they refuse to react?
What makes you think I haven't?
As far as "official" academic publishing is concerned, I've been in touch with the editors of IEEE Software's "Voice of Evidence" column for about a year now, though on an earlier topic - the so-called "10x programmers" studies. The response was positive - i.e. "yes, we're interested in publishing this". So far, however, we haven't managed to hash out a publication schedule.
You've said so yourself, I'm making these observations publicly - though on a self-published basis as far as the book is concerned. I'm not sure what more would be accomplished by submitting a publication - but I'm certainly not opposed to that.
It's a lot more difficult, as has been noted previously on Less Wrong, to publish "negative" results in academic fora than to publish "positive" ones - one of the failures of science-in-general, not unique to software engineering.
Final 'variable' ought to read 'constant'.
is weak because many are cynical about medicine being able to do this (see I... (read more)
Your "investigating" link is broken.
I see that the there is a problem, but it seems that both charts support the same conclusion: the longer problem goes undetected, the more problems it brings.
Are there any methodological recommendations which are supported by one chart, but not the other?
As Software Engineering is too far from being a science anyway, correct sign of correlation seems to be everything that matters, because exact numbers can always be fine-tuned given the lack of controlled experiments.
Anyone want to come up with a theory about why not bothering to get things right was optimal in the ancestral environment?
Because you couldn't. In the ancestral environment, there weren't any scientific journals where you could look up the original research. The only sources of knowledge were what you personally saw and what somebody told you. In the latter case, the informant could be bullshitting, but saying so might make enemies, so the optimal strategy would be to profess belief in what people told you unless they were already declared enemies, but base your actions primarily on your own experience; which is roughly what people actually do.
Strictly speaking these are bar charts rather than histograms, aren't they?
Does anyone have any good resources for further reading on the claim that the phenomenon that this post describes also applies to second-hand smoke research? I've read several loosely collected pieces like that link which give partially anecdotal and partially convincing accounts. I more or less feel inconclusive about it and wondered whether there's a more cogent summary of that issue which states useful facts.
Well, one comment I would make as a practising software engineer is that this truism, that bugs introduced earlier in the software development process are more expensive to fix, is not all that relevant to modern software engineering. Because at the time this truism first became generally accepted, the prevalent "software development process" was generally some variant of the "waterfall" method, which is so discredited that... just don't get me started:)
Nowadays, everyone who is actually succeeding in shipping working code on time is do... (read more)
Since I think more people should know about this, I have made a question on Stackoverflow about it: http://stackoverflow.com/questions/9182715/is-it-significantly-costlier-to-fix-a-bug-at-the-end-of-the-project
Could you add the true axis labels to your post?
Interesting read to begin with. Nice anology. I do support the thought that claims made (in any field) should have data to back it up.
I do think at this point that , even though there is no 'hard scientific data' to claim it; Don't we have enough experience to know that once software is in operation, when bugs are found they cost more the fix than initially?
(Bugs are also in my opinion features that do not meet the expectations)
Even though the chart may be taken out of context, and a bit taken too far I don't think it belongs to the infamous quotes like &... (read more)
I think the appropriate way to study that system is a two-dimensional distribution (bug made,bug detected)->cost Using that (together with the frequency of the individual bins), it is possible to generate both graphs out of the same source. I know that this is more work to do ;).
Is there anything in software engineering that you rate positively, or would if anyone was doing it? Saying "this is bad" is easy, but what would be better (beyond merely saying "don't do these bad things")? Some have tried, notably Dijkstra (who you mention in a comment), and the whole functional programming tradition comes out of that, but why is it still notable and difficult to make a verifiably correct C compiler, and what would have to happen to make such a thing the obvious, no-brainer thing to do?