When programs have to work-- lessons from NASA

by NancyLebovitz2 min read31st Jul 201150 comments

37

ProgrammingOrganizational Culture & DesignGroup RationalityTeamwork
Personal Blog

They Write the Right Stuff is about software which "never crashes. It never needs to be re-booted. This software is bug-free. It is perfect, as perfect as human beings have achieved. Consider these stats : the last three versions of the program -- each 420,000 lines long-had just one error each. The last 11 versions of this software had a total of 17 errors. Commercial programs of equivalent complexity would have 5,000 errors."

The programmers work from 8 to 5, with occasional late nights. They wear dressy clothes, not flashy or grungy. I assume there's a dress code, but I have no idea whether conventional clothes are actually an important part of the process. I'm sure that working reasonable numbers of hours is crucial, though I also wonder whether those hours need to be standard office hours.

"And the culture is equally intolerant of creativity, the individual coding flourishes and styles that are the signature of the all-night software world. "People ask, doesn't this process stifle creativity? You have to do exactly what the manual says, and you've got someone looking over your shoulder," says Keller. "The answer is, yes, the process does stifle creativity." " I have no idea what's in the manual, or if there can be a manual for something as new as self-optimizing AI. I assume there could be a manual for some aspects.

What follows is main points quoted from the article:

The important thing is the process: The product is only as good as the plan for the product. About one-third of the process of writing software happens before anyone writes a line of code.

2. The best teamwork is a healthy rivalry. The central group breaks down into two key teams: the coders - the people who sit and write code -- and the verifiers -- the people who try to find flaws in the code. The two outfits report to separate bosses and function under opposing marching orders. The development group is supposed to deliver completely error-free code, so perfect that the testers find no flaws at all. The testing group is supposed to pummel away at the code with flight scenarios and simulations that reveal as many flaws as possible. The result is what Tom Peterson calls "a friendly adversarial relationship."

I note that it's rivalry between people who are doing different things, not people competing to get control of a project.

3. The database is the software base.

One is the history of the code itself -- with every line annotated, showing every time it was changed, why it was changed, when it was changed, what the purpose of the change was, what specifications documents detail the change. Everything that happens to the program is recorded in its master history. The genealogy of every line of code -- the reason it is the way it is -- is instantly available to everyone.

The other database -- the error database -- stands as a kind of monument to the way the on-board shuttle group goes about its work. Here is recorded every single error ever made while writing or working on the software, going back almost 20 years. For every one of those errors, the database records when the error was discovered; what set of commands revealed the error; who discovered it; what activity was going on when it was discovered -- testing, training, or flight. It tracks how the error was introduced into the program; how the error managed to slip past the filters set up at every stage to catch errors -- why wasn't it caught during design? during development inspections? during verification? Finally, the database records how the error was corrected, and whether similar errors might have slipped through the same holes.

The group has so much data accumulated about how it does its work that it has written software programs that model the code-writing process. Like computer models predicting the weather, the coding models predict how many errors the group should make in writing each new version of the software. True to form, if the coders and testers find too few errors, everyone works the process until reality and the predictions match.

4. Don't just fix the mistakes -- fix whatever permitted the mistake in the first place.

The process is so pervasive, it gets the blame for any error -- if there is a flaw in the software, there must be something wrong with the way its being written, something that can be corrected. Any error not found at the planning stage has slipped through at least some checks. Why? Is there something wrong with the inspection process? Does a question need to be added to a checklist?

Importantly, the group avoids blaming people for errors. The process assumes blame - and it's the process that is analyzed to discover why and how an error got through. At the same time, accountability is a team concept: no one person is ever solely responsible for writing or inspecting code. "You don't get punished for making errors," says Marjorie Seiter, a senior member of the technical staff. "If I make a mistake, and others reviewed my work, then I'm not alone. I'm not being blamed for this."

 

37

50 comments, sorted by Highlighting new comments since Today at 2:07 AM
New Comment
[-][anonymous]10y 14

They Write the Right Stuff is about software which "never crashes. It never needs to be re-booted...."

This is incorrect in an interesting way.

There's a famous story, among people who study Apollo history, of the 1201 and 1202 program alarms that occurred during Apollo 11, as described here and here. Those links are short and well worth reading in their entirety, but here's a summary:

Apollo 11's guidance computer had incredibly limited hardware by modern standards. When you read its specs, if you know anything about computers, you will not believe that people's lives were trusted to something so primitive. As Neil and Buzz were performing their powered descent to the Moon, the guidance computer started emitting obscure "1201" and "1202" program alarms that they had never seen before. Significant computer problems at this stage, hovering over the moon with only minutes of fuel to spare, mean that the astronauts should abort and return to orbit, instead of attempting to land and crashing due to broken software. The program experts quickly determined that the alarms were ignorable, and the mission proceeded. As it turned out, the astronauts had been incorrectly trained to leave a switch on, which fed radar data to the computer that it shouldn't have been getting (and the switch wasn't connected to a real computer during training so this wasn't noticed). This overloaded the computer, which had too much data to process given its hard real-time constraints. Then it did something that would be amazing in this era, much less 1969:

On Apollo 11, each time a 1201 or 1202 alarm appeared, the computer rebooted, restarted the important stuff, like steering the descent engine and running the DSKY to let the crew know what was going on, but did not restart all the erroneously-scheduled rendezvous radar jobs. The NASA guys in the MOCR knew - because MIT had extensively tested the restart capability - that the mission could go forward.

This auto-restart ability, combined with prioritization, allowed the computer to (literally literally) reboot every 10 seconds, while continuing to handle tasks whose failure would kill the astronauts, and dropping less important tasks.

The thing about space software is that it's enormously, insanely expensive in real terms (i.e. it requires lots of time from lots of skilled people). Ordinary software (desktop, server, phone, console, you name it) is cheaper, bigger, and evolves more rapidly. It's also buggier, but its bugs typically don't kill people and rarely cost a billion dollars. NASA has done things wrong, but their approach to software is perfectly suited to their requirements.

Then it did something that would be amazing in this era, much less 1969: [snip description of reboot]

That's not really amazing. It's par for the course for modern microcontrollers, of the sort that litter the innards of modern cars and tractors and such. They usually keep their programs in NOR Flash memory, so they don't need to be read from a hard drive on start-up, and don't need to keep much state in volatile memory. And they are usually designed to be able to start up in the blink of an eye. There are fairly cheap microcontrollers with better specs than the Apollo Guidance Computer, and they're common in applications that need reliable embedded software. It's a safe bet that the private space industry uses quite a lot of them. And the job prioritization is typical for any system designed to be hard realtime.

Even in big computers like the one on your desk, failing really quickly and well can help with reliability. There's a school of thought in server design which says that servers should consist of large numbers of isolated parts, which crash if anything goes wrong, and can be rebooted very quickly. This is how most web sites stay up despite bugs, random crashes, and server failures.

I think what is interesting is not the reboot but the fact that it every task was prioritized and unimportant ones were inherently discarded. I do not think this is a feature typical to embedded programming.

That's actually a very common realtime scheduling algorithm: execute the highest-priority task ready to run at any time, and discard the lower-priority tasks if you don't have time for them. It's popular because of situations exactly like the one the Apollo Guidance Computer ran into.

This wasn't actually a computer software failure. It was a failure of procedure development. Also it suggests their training should also be a high-fidelity simulation test, as that would have found this problem on the ground right away. So its maybe a testing failure but even then not a testing failure for the software but for the entire landing system (considering hardware, software and human procedures).

[-][anonymous]10y 1

This wasn't actually a computer software failure.

I didn't say it was.

NASA's approach is a local optimum: great reliability, for a problem of moderate complexity, with extreme cost and slowness. There are other ways to get more reliable software more cheaply. Anybody here who's written Haskell code can probably point out one: having a compiler with a really good type-checker has a huge effect on the number of bugs you're able to write, and it does so without being nearly as lumbering and costly as the NASA-style processes that would be needed to get an equivalent bug reduction.

Another more exotic example is Cryptol, software for writing cryptographic code that has to work correctly. Not only does it do strong type checking, it can take a reference implementation of an algorithm and an optimized hardware implementation, and prove whether or not they have the same results for all inputs. Last I checked, they used it to make some shiny new crypto hardware for Rockwell-Collins. It worked, and they finished the whole thing in about three months, which is really good for this sort of project. If they had NASA'd the problem, they would probably still be working on it.

For people looking for a more lightweight place on the effort-reliability curve, just writing decent unit tests can be a big help. Randomized testing can often do even better, since it uncovers weird twisty corner cases. A lot of the bugs in large mature software like web browsers are uncovered by writing a fuzz testing program to chuck random inputs at them and see what happens. This is a lot easier than going over a large code base with a fine-toothed comb.

One is the history of the code itself -- with every line annotated, showing every time it was changed, why it was changed, when it was changed, what the purpose of the change was, what specifications documents detail the change. Everything that happens to the program is recorded in its master history. The genealogy of every line of code -- the reason it is the way it is -- is instantly available to everyone.

This sounds impressive (and it is!) but implies that this sort of tooling is limited to NASA developers. It's not. Every halfway decent web startup, and every halfway sane lone coder has access to the same capabilities.

Git)

Git keeps a branching history of changes, but it takes humans and a policy to keep records of the reasons for changes.

Also I don't think git is line by line, it is revision version by revision version.

NASA must have a tool that requires each line-change be committed separately with a comment supplied. Perhaps even a multifield comment with requirement that none be blank, and audit of who made each comment so review of changes quickly points out people who are not reasonably populating the fields.

I'm guessing.

Also I don't think git is line by line, it is revision version by revision version.

'git blame' will still tell you, line-by-line, which revision introduced a change.

Yes but the point is you don't supply a comment for each line that is changed.

Subversion has the same feature(s) as well. On the other hand, NASA was doing this quite some time ago, so they at least get points for being ahead of the curve and setting a useful trend for others to follow :)

had just one error each

That they know of.

they have a lot of advantages over the rest of the software world. They have a single product: one program that flies one spaceship. They understand their software intimately, and they get more familiar with it all the time. The group has one customer, a smart one. And money is not the critical constraint

Not that the methods here don't have their place, but it seems to me that this is a point by point list of exactly why the methodology used by this team is not used generally.

The average software project may involve many different products and many different programmers, making it difficult for anyone involved to become intimately knowledgeable with the work or for standardized programming practices to be enforced everywhere. There are usually very tight deadline and budget constraints, and the clients may or may not know exactly what they want so specifications are usually impossible to nail down and getting quick user feedback becomes very important.

The software design classes at my university teach Agile software development methods. The main idea is breaking a project down into small iterations, each one implementing a subset of the complete specification. After each iteration, the software can be handed off to testers and clients/users for evaluation, allowing for requirements to change dynamically during the development process. Agile seems to be the exact opposite of what is advocated in this article (invest money and resources at the beginning of the project and have a concrete specification before you write a single line of code).

People are less well understood than physics, so it makes sense that interfaces need to be tested on people at earlier stages of development than a program which is just interacting with non-sentience does, even without getting into the problem of making a product for people who don't quite know what they want until they're shown something that isn't it.

Is there a third team making what looks like real code, with errors hidden as cleverly as possible, that is given to the debugging team and not distinguished from real lines of code?

If not, why not?

I told my PhD advisor that I had inserted the sentence "The whale is the largest living mammal." somewhere in my thesis about superconducting devices. I then did NOT insert the sentence. My theory was if I had put the sentence in then he would have realized he could stop reading so carefully once he found it.

How did this go over with your advisor? (Serious question.)

He was a tenured professor that ran large observatories. He was always more interested in what I could do technically than in whether I approved of him or not. I don't think he paid any attention to my comment about the whale, although he did give me good comments on my thesis.

I did include quotes from "Moby Dick" at the tops of some of the chapters, including "...I hold that the Whale is a fish [not an animal]..." whcih I had at the top of a chapter which was the quantum theory of the SIS device I was working with. So there were whale quotes in there for him :)

[-][anonymous]10y 16

That would consume development and testing resources that could be put to better use. For example:

The Backup Flight System (BFS) was separately developed software running on the fifth computer, used only if the entire four-computer primary system failed. The BFS was created because although the four primary computers were hardware redundant, they all ran the same software, so a generic software problem could crash all of them. [...] While the BFS could run in parallel with PASS, the BFS never engaged to take over control from PASS during any shuttle mission.

That is called Crazy Prepared.

This very system was responsible for the failure of the first shuttle launch. It turns out that there was a 1/64 chance that the clocks in the primary system and the clocks in the BFS would fail to synchronize on startup, and the very first scheduled launch hit the jackpot: the clocks didn't sync, and NASA had to scrub the mission, and spend weeks figuring out what went wrong.

[-][anonymous]10y 0

Interesting, I didn't know that! Wikipedia says "A launch attempt two days earlier was scrubbed because of a timing problem in one of Columbia’s general-purpose computers." and NASA says "Launch April 10 postponed due to timing skew in orbiter's general purpose computer system. Backup flight software failed to synchronize with primary avionics software system.", but this article explains the problem in extreme detail.

The hard part of programming is figuring out what you want the program to do in the first place. The reason that it's hard is because once you've defined what you want to implement precisely enough to tell a machine, you've already written the code, even if it's just in a highly compressed form inside your head.

This is somewhat implied in the article-- part of the style described is doing a lot of upfront work on planning, rather than charging forward to code something fast.

However, I talked about this with Eric Raymond over the weekend, and he pointed out a problem which was not implied in the article and which I hadn't thought of-- what he calls corner cases. If I remember correctly, these are unexpected interactions between parts of the programs and/or what might happen when a program gets unexpected input. [1] Modularity helps, but not enough, and they increase rapidly as programs get more complex.

It seems to me that self-amplifying AI is an optimal way of creating more corner cases.

[1] Corner cases-- wikipedia seems to give a slightly different definition. Either one is going to be a very complicated challenge.

[-][anonymous]10y 5

I wonder how similar this methodology is to that used by like boeing or a major car manufacturer for their automated systems. You don't hear about a lot of 777s crashing due to software bugs, but I doubt that they go through this same insane process. The software I write is unlikely to kill anyone but it could certainly lose a lot of money if it did the wrong thing. It has bugs all the time and I am always changing it so it will never be anywhere near bug free. What I do to avoid getting destroyed is put in safety features to keep bugs from causing major problems, and test stuff out to try to eliminate serious bugs before production. That works pretty well for me.

Overall, the space shuttle had a lot of bugs if we include mechanical systems and it killed a bunch of people. Given that I don't know how impressed we should be that they had a really cumbersome software development process.

This summary (I haven't read the original article) has a lot of what I think of as ordinary advice about software, except the dress code and hours. I'm not sure how widely they're actually followed today, but I think most people who voice an opinion about what to do agree on them. But the article was written in 1996; I think NASA was ahead of its time.

Yes but this sort of thing caused people a lot of trouble. If you've ever had anyone try to inflict CMM on your software shop you can pretty much blame NASA.

Thanks for this. This article gave me an insight. Part of winning is getting better software. Why don't we use our winning approach to develop a rational programming methodology?

A rational programming methodology is just what's outlined there, it seemed to me.

But to simplify it, start with code review — having someone else look over each line of code, attempting to find what's wrong with it. Code review directly addresses a range of cognitive biases simply by adding a second person with different intentions as an obstacle to checking in bad code.

Then when something goes wrong anyway, follow up in a way that actually reduces the chances of future problems. First and foremost, never give anyone an incentive to conceal problems. Don't shame people for writing bugs. Instead, find out why the bug escaped review, and fix that. Second, after deciding what to do to prevent future problems, actually do so. It's easy to say you're adopting new techniques; harder to actually adopt them.

What's outlined there is rational under a particular set of tradeoffs. If a typical software company implemented that methodology correctly, they would go out of business, because they would take longer than their competitors to produce a more expensive product. Most of the things that software is used for simply don't need those extra nines of reliability.

As a user of those apps i strongly disagree. If i add up the time spent waiting for crashed software, filing bug reports and troubleshooting incompatibilities i have lost a considerable portion of my lifetime.

Also if software "just works" my company could save millions on the IT support department. And yes, we would be willing spend some extra money if a software manufacturer can back up these claims.

You can disagree if you want to. If I disagreed, I would feel bound by rationality to explain why the customers are so stupid as to make these same buggy software products among the most commercially successful endeavors in human history.

I endorse explaining things. That said, you make it sound like the existence of a thriving market for cheap low-quality goods is much stronger evidence against the existence of a market for expensive high-quality goods than it seems to me to be.

Hrm? I had taken mwengler to be making a different point: the lack of a market for high-quality software outside life-critical applications suggests that such software is not cost-effective to produce.

Bingo on asr. Engineers and economists do the same thing: optimize. It is as expensive a mistake to put $1billion more into something than it is worth as it is to put $1billion less into something than it is worth.

The overwhelming success of markets for software at the quality at which it is at is not indicative of a failure of the market or even of the software. It is indicative that the right tradeoff between fixing bugs, new features, delay, and more development money is where it is, that higher quality software might even exist and simply not make money.

It is tremendously important to realize in economics, engineering, and probably other fields, that perfection is infinitely expensive and is therfore provably NOT the goal.

There's one important caveat here, which I want to call attention to. There are externalities here. Some of the cost of bad software is paid by people out across the network who receive spam, DDOS attacks etc, that would have been prevented if I had ran a more secure system. So it might be that the economically optimal level of software quality is higher than the current market would imply.

That said, i agree the optimal level is probably far short of perfection. It happens regularly that some program on my machine will crash (without affecting the rest of the system.) I'm not willing to pay very much to reduce the rate of such events.

This still leaves the possibility that people are underestimating the cost to them of fairly unreliable software. Lowering the threshold to effective action can make a big difference.

Yes, but if your company were actually presented with such reliable software, the answer would be "well obviously we meant software that otherwise does what we want. This stuff doesn't have half the features we need, and it's almost completely unusable. We can't deploy this, or we'll be getting five calls about usability issues for every call we used to get about crashes and compatibility problems."

Bottom line: what you trade away with the NASA approach isn't only money. It's also development speed. Okay if the application remains unchanged for three decades and the users spend a few years of their lives doing nothing but training, not so good otherwise.

And yes, we would be willing spend some extra money if a software manufacturer can back up these claims.

But how would one back up these claims? The difficulty of verification is one reason software markets sometime resembles lemon markets.

Have there been any studies showing whether code review reduces bugs? Because I can certainly think of a case or two where I have reviewed code and not noticed a bug -- even a non-subtle bug.

Congratulations, you've just made the first mistake. :)

Here's a quick exercise in detecting bias. Can you find the article's most glaring omission?

[This comment is no longer endorsed by its author]Reply

What comes to my mind is that a local optimum is being described -- minimum software errors given the shuttle program, the large budget, the nature of the industry, etc (which are mentioned) -- without consideration of the preferable (to many people) maximum of ubiquitous cheap spaceflight. Developments that have occurred since 1996 seem to highlight this.

My bad, I missed that this went on for several pages. What I had in mind is in fact covered. (But by no means featured as prominently as you'd expect.)

The "cost" part of the tradeoff. How much more should you expect to pay to get the same functionality.

This is alluded to at the end, but too briefly IMO:

the groups $35 million per year budget is a trivial slice of the NASA pie, but on a dollars-per-line basis, it makes the group among the nation's most expensive software organizations.

A secondary part of the same question is whether some of these precautions might in fact be excessive - is there any activity that you could not do and still have the same functionality at the same quality level.

More generally the question is one of credit assignment - an issue not just relevant in "methodology" but in learning theory as well, even up to AI theory: which parts of the "process" are to be held necessary and sufficient for the results, and in fact what ought to count as "process"? (For instance, is it even possible to change the people, holding "process" constant, and still get the same results? If it is possible, what are the relevant characteristics of the individuals?)

The following sentence from the article is obviously a lie, journalistic sensationalism of the same kind that leads to "IBM emulates a cat's brain" headlines:

The process can be reduced to four simple propositions

It's never that simple. And that is a key issue with the article overall: it wants to boil down something intrinsically complex to a few simplistic and easily stated conclusions.

Do you have any ideas about what the lurking complexity might be?

One is social structure. 260 people is a large group, and there has been research suggesting that social structure is a more effective predictor of software defects than "technical" metrics.

With this size group coordination issues are going to loom large, so the way meetings are planned, organized and run will play a critical role in the group's "distributed cognition". The description of the relationship between "coders" and "verifiers" is tantalizing but almost certainly oversimplified.

Another is design. You don't get near zero defects on half-million-line codebases without some strong design principles: modularity, avoidance of data coupling, and so on. Popular press won't mention that because people's eyes will glaze over, but I'm pretty sure that this code isn't sprinkled all over with global variables. The article says "one-third of the process of writing software happens before anyone writes a line of code" - but then goes on to reveal practically nothing about this early part of the process, other than the cliché that it produces a lot of documentation.

The article makes a dangerous claim that its four broad conclusions generalize widely: it claims to "illustrate what almost any team-based operation can do to boost its performance to achieve near-perfect results". The problem is that taking this kind of advice too literally leads directly to "cargo cult" software engineering.

You can easily make developers write thousand-page design specifications but that in no way guarantees defect-free code - in many cases reliance on written documentation is in fact a direct contributor to poor quality, insofar as a more interactive form of communication offers more opportunities for detecting and correcting errors.