Software Design Review

by Philip Greenspun and Andrew Grumet, October 2009
In the spring of 2009, a friend who runs an ecommerce Web site asked one of the authors (Philip) for help explaining why his application was running so slowly. He had paid an MIT-trained programmer with 20 years of experience $200,000 to build it and was paying a hosting service $1100 per month to run it. Despite the site having only one user every hour or so, being in a soft-launched state, pages took up to 5 minutes to load. Philip said "This will be easy. Just show me the design documentation." He replied "What do you mean?" Philip said he wanted the document where the programmer set forth what problem he was trying to solve, how large a data set was being handled, what each server did, what software had been selected and why, where the files and programs were on the server, and what the data model was.

"There isn't any documentation," replied the business guy who had created the idea and written the checks to the programmer. Queries to the programmer revealed that he was almost as ignorant of the answers to the preceding questions as his boss. He knew that he was using Ruby on Rails and MySQL, but not how many gigabytes of data were required to produce all of the public pages of the site. Philip eventually was able to get some good information from the hosting service's sysadmins, e.g., the size of the MySQL database and the amount of RAM on each virtual "slice" being used to run the HTTP servers and the RDBMS. By chucking the virtualized model and buying the cheapest Dell pizza box server with 16 GB of RAM (about $500 worth at the time), the amount of time required to produce a page fell from 5-10 minutes to no more than a few seconds. Hosting costs were reduced from $1100 per month to less than $100. However, our friend was not able to recover the months of customers who had been lost due to the poor performance of the service.

What would have saved this business? An external design review.

The Fundamental Problem: Business People Aren't Technical

It is almost impossible for business people to manage technical people. Because the business people have no authority to challenge technical decisions and because there are no published standards for how software development is to be done, the programmers can almost always snow the business people with convoluted stories about why something has to be done a certain way.

Adding to the challenge is America's corporate self-esteem culture. The average programmer does terrible work, producing bug-ridden code with non-existent documentation. However, it is outside the realm of acceptable discourse for a manager to say "this is terrible work."

Best Practices from the Most Successful Software Companies

How do the most successful software companies handle these problems? Many are run by technical people, who cannot be snowed. Bill Gates of Microsoft is an obvious example (and the company has stumbled ever since the accession to the throne of the less technical Steve Ballmer). Sergey Brin and Larry Page of Google provide another. Both Microsoft and Google have cultures of code review in which programmers are required to present designs to others within the organization. The most successful software companies tend to have a fairly blunt corporate culture, in which it is common for harsh criticism to be delivered (see this 2008 newspaper article about Bill Gates and Microsoft).

External Design Reviewers

What if your company doesn't have a technical management team like Microsoft's? Or if your company doesn't have an unbiased group of great software engineers working on a separate project? Or if your company culture doesn't allow for straight criticism?

Bring in an outsider.

Even if you can't attract excellent technical people to work all year ever year on your boring IT systems, you can probably find an excellent software developer to come in for a few half-day review sessions. The outsider won't have any bias or preconceived notions about particular divisions of your company. The outsider won't have to worry about hurting anyone's feelings by saying "You need to do X, Y, and Z."

The design review process outlined here is described in terms of the development of a multi-user application program, such as a Web-based service for a group of collaborating employees or for a public Web site. These services are typically backed by a relational database management system such as Oracle or SQL Server. However, the process should be useful for any other kind of computer application where there are decision-makers, programmers, and end-users.

The process outlined here is based on the experience of the authors with more than 300 database-backed Internet application programs and roughly 60 years of experience as computer programmers.

Review Stages

In an ideal world, here are the project stages at which you'd bring in an external design reviewer: At every stage, the developers should prepare for a meeting with the design reviewer by writing draft documentation. The design reviewer should submit questions raised by the documentation prior to the meeting, giving the developers time to revise the documentation. The actual meeting should be a working session in which the documentation is modified in real time, possibly with some sections marked for further research.

Let's go through the stages to see what questions should be answered by documentation.

Scope and Tool Selection

Most software need never have been written at all. Companies will spend tens of thousands of dollars on custom development of a Web application, never having asked "Perhaps we should just use a standard free and open-source Weblog toolkit with our own style sheet and four custom pages." (See "Weblog as Website for the Small Organization".)

The design document produced at this point should answer the following questions:

The document should contain, as attachments, a few user profile pages showing typical expected users of the software and what they will be doing with it. See the Planning chapter of Software Engineering for Internet Applications, Exercise 1b, for more on how to build user profile pages.

"How are specifications to be communicated to the development team?" is an important question. There is nothing more wasteful than a group of skilled programmers building the wrong thing.

The data size and computational intensity questions are important for figuring out what kinds of servers will be appropriate to host the application.

Within the "software development tools" section there should be at least one paragraph on version control. Is a standard system such as subversion or git going to be used? How can a programmer restore code as it existed at a previous point in time? Is the repository stored on a separate computer or hard drive so that it may be able to function as a backup copy? Can a big change be isolated from the current production line through branching?

Page flow and data model/User interface and data structures

The data model stage is where a tremendous amount of complexity can be engineered out of a system. For a standard Internet application, every SQL table typically requires the construction of five or more Web pages for the end-user (browse, search, view, add, edit, delete) and another five or so for the administrators (more or less the same functions, but with broader access). That's 10 computer programs, each of which may need to be debugged and maintained. Many very experienced C and Java developers are barely competent in SQL and data modeling. It is common for an expert SQL programmer to be able to reduce a 20-table data model down to 5 or 10 tables.

Conversely, an expert SQL developer can often tell whether or not the data model stores insufficient information to fulfill the requirements. As a simple example, suppose that an electronic medical record data model has first and last names are stored in one column. A SQL developer can glance at the table definition and observe that it won't be possible to produce a list of patients sorted by last name.

Page flow more or less determines the complexity of the application for end-users. If it requires 15 steps to accomplish a task, that will be slower and require more training than if it takes 5 steps. For a consumer-facing Internet site, a sufficiently complex page flow will almost guarantee commercial failure. If you can't make money unless every user has an IQ over 130 and is extremely motivated to learn a complex interface, well, you can't make money.

For a non-Web application, the equivalent items to review are the user interface and the data structures in memory and on disk.

In all cases, at this point a draft development standards document should be available to review. This lays out simple questions such as file, URL, and variable naming conventions. It also addresses planned documentation for modules and procedures. The development standards include how configuration variables are named and added. Finally user input data validation and security are addressed.

This is also the stage at which procedures for internal code review should be documented. The external design review process described here is not a substitute for continuous internal reviews. At Google, for example, every check-in to the version control system must be reviewed by at least one other programmer. This sounds cumbersome. What if the change is to fix a typo in a comment? It gets reviewed! But somehow Google has managed to prosper and this blog entry explains how the process is supported. We're not suggesting that Google's process is right for every project, but there should be some documented internal code review process.

Post-Prototype

At this stage, a skeletal version of the application is up and running and some testing has been done with potential users. The design review should be looking at the following questions: The usage and performance tracking plan is important because the people who paid for the application are going to ask "How many people are using this? Why aren't there more? Where are people giving up? How long are pages taking to load?"

Pre-Launch

At this point the software is installed on the production servers and the organization is a week or two away from "throwing the big switch". The documentation at this point should be good enough that if all of the programmers who worked on the application were hit by a bus, a replacement team could step in and keep the application running.

A critical set of documentation to review at this point concerns the hosting of the application. Where are the servers? If colocated, how does one get physical access to them? What is the network layout? Firewall configuration? What is each server named and what is its IP address? What software does each server run and in what directories is that software located? What hard drives are in each server and what does each drive do? What single disk drive failures will bring down the application? (The answer to this should be none!) What single machine failures will bring down the application? (Oftentimes the RDBMS server failing will bring down the application and this is more acceptable than the cost of redundant RDBMS servers.)

If using an RDBMS it becomes critical to document the RDBMS server configuration. A small RDBMS server might have 10-20 hard disk drives. Why so many? Consider a single update to a table with two indices. This requires writes to the table, index1, index2, and the transaction log file, i.e., to four separate files. If those four separate files are on four separate hard drives the four writes can be processed in parallel and database updates can proceed approximately four times as fast compared to keeping everything on one hard drive. The four drives will need to be mirrored so that the failure of a single drive does not result in data loss or application downtime. Now we have 8 physical disk drives on the server. You wouldn't want the operating system's day-to-day demands interfering with those of the database nor would you want the OS crashing down in the event of a hard drive failure. So we add two more disks in a mirrored pair to support the OS. Our minimum size server now has 10 physical disks. The design choices for the RDBMS server have huge implications for performance, reliability, recoverability, and maintainability. They need to be documented partly so that they can be reviewed but mostly so that the system can be maintained.

A release plan should describe how minor changes and new full releases are pushed to production. How are changes to procedural code and SQL data models to be coordinated? What are the names of the development and staging servers? What steps must be taken and who has to sign off before what is on staging can go to production? What is the procedure for backing out from a new release if things aren't going well?

Quality assurance and performance testing procedures and results should be reviewed at this point. Given that a lot of modifications are likely to be made shortly after launch, it is important that a testing plan is in place to make sure that new bugs aren't introduced when old bugs are fixed and when new features are added.

Conformance to the development standards should be evaluated at this point. Are file and variable names consistent? Are modules, procedures, and data models documented sufficiently and according to the standards?

Post-Launch/Maintenance

At this point the external reviewer should perform an audit to make sure that the hosting documentation is consistent with any new servers that might have been added. This is also the time to review the data recovery (programmer drops a table by mistake) and disaster recovery (server room is destroyed by fire) plans.

A sweep through all of the earlier documents should be made to ensure consistency with the final product. Remember that a new person coming onto the team should be able to go back to the documents produced during the Scope and Tool Selection review and figure out why custom software was built instead of adapting an existing open-source tool.

Finally, the development team should put together a writeup document that, on one Web page, explains what the application does and why it is useful, complete with screen shots so that the reader need not actually be sitting in front of the running application. See the Writeup chapter of Software Engineering for Internet Applications for examples.

Conclusion

Programmers will not keep themselves honest. If left to their own devices, they will skimp on anything that is necessary but not fun. This includes planning, documentation, and testing. Only a review by an unbiased external reviewer can give a non-technical management the ammunition it needs to get programmers to behave like engineers.

The cost of this process should be minimal. All of the documents that are required for the design review are documents that should be produced in any competently executed software development effort. The cumulative number of hours required for an external expert to conduct all five reviews suggested in this document should be roughly 100. With software experts available at anywhere from $100 to $300 per hour, that's $10,000 to $30,000 in costs to guard against the following horrifyingly expensive situations:

More

About the Authors

Philip Greenspun has spent more than a decade nagging industrial programmers and students to document their design decisions (resume).

Andrew Grumet is the Vice President of Engineering at Mevio, a Kleiner Perkins-funded Internet media company. Grumet has a Ph.D. in Electrical Engineering and Computer Science from M.I.T. (resume).

We are grateful to and have incorporated some thoughtful comments from Arthur Gleckler, a senior engineer at Google, and John Patrick Morgan, a recent graduate of Olin College of Engineering.


philg@mit.edu

Reader's Comments

In my years as a software engineer, it has always been business people that resisted the formal software engineering process. The engineers are the ones who see and fear the complexity of major projects but business people often do not. Perhaps your thesis ought to be that software firms run by engineers succeed due to an appreciation of these risks and hence are willing to finance the risks of mitigating them? Secondly, note that I did not claim that business manager are 'lazy' or incompetent. If one has respect for the professiona software engineeringl regiment, one should at least feign respect for the professionals who also advocate it.

-- Bob Zi bub, October 20, 2009
"Programmers will not keep themselves honest."

I've seen senior, principal, and senior principal programmers who can write beautiful algorithms, construct advanced object hierarchies, and implement what many would consider to be elegantly architected solutions. Those senior programmers will give you a perfect solution; yet still you'll run into issues mentioned where pages will take five minutes to render. Confront these senior programmers with the issue, and the response will be something along the order of "what's the problem, the solution is perfect?". And because these programmers are considered experienced, management will often back them up.

There is a certain class of programmers who will keep themselves honest. This class of programmers has had to deal with customers directly. They have experienced the mistakes of not documenting, not testing, and writing cool, clever, elegant code instead of code that may be ugly but robust. They will provide some documentation, but not enough that it is easily outdated. They will test the solution, but not to the point where the tests become academic. They will write cheap hacks into their code because they realize the end goal of a happy customer is infinitely more valuable than another page in their portfolio of a textbook perfect solution.

-- Fred Moyer, October 20, 2009

Remembering my control systems classes in college, there are three primary faults here: bad control module, bad sensors, and poor feedback loops. The control module takes in data from the sensors (feedback) and adjusts the system's outputs to keep it on track. If any one of the three (control module, sensors, feedback loops) are faulty, the outcomes will not be as expected. If it is an electrical system and the electricity goes out, then the system stops.

Let's use this as an analogy for the ecommerce project in your article and map control system components to the project:

If there is no goal or vision document, then there is no 'why' to the project. I have central heating and cooling for a reason: so I can be comfortable in my own home during hot and cold weather. That's the purpose of my HVAC system, and if I didn't have that goal or purpose, then there would be no reason for me to expend the money and effort to buy and operate the system.

For the ecommerce site, the purpose may be more complex -- at its root the purpose is to make more money than is spent on developing and operating the system, what I'll call the 'generic' purpose; others might call it profit. But for that outcome to occur, the system must provide something the customers who use it actually want and are willing to pay for, what I'll call the 'specific' purpose. Some people call the explanation of this specific purpose a business plan. It's important for the programmer to understand the purpose or business plan so they can assess their own progress in developing the system. The better the programmer understands that purpose, the less tight the communication feedback loop between the programmer and the business person has to be. I'm quite willing to admit that in many cases there is no initial vision or goal; people often putter around and end up with things like twitter and so on kind of by accident. They start somewhere simple, and evolve based on feedback or interest without having some grand scheme or goal in mind at the beginning. That's OK. Serendipity is a fine business plan if it doesn't make you broke, and these days it costs very little to create online applications.

If there is no design document, spec or whatever you want to call it, for the system you're creating (or in the simplest case, a clearly understood goal the system must meet) then there is no basis of measurement by which to determine whether the system needs to adjust itself, and therefore no basis for knowing when your implementation is working properly or 'done'. Projects that have no defined goals can't be considered failures because they have nothing to shoot for, no basis for measuring success. They're simply wastes of money, or jobs programs.

The system you are creating has its own goal or purpose that is separate and distinct from your business plan: the purpose or goal of my HVAC system is to maintain the temperature inside my house as close to what I have it set to as is reasonably possible. My HVAC system does not care whether I'm comfortable or not; its goal is only to maintain the temperature within specified bounds. This is where you can more easily define the boundaries around the system you're building -- you must decouple your business purpose, goals or reasons from the technical goals or reasons that the system will embody. The system doesn't know you want to make money, and it can't tell that you're losing money and come up with ideas for how it can adjust itself to make you happy.

The control module (i.e. thermostat) analyses the data coming in from its feedback loop(s) (the temperature sensor or sensors), compares that to the 'goal' I have given it (maintain the ambient air temperature inside my house at 75 degrees Fahrenheit), and turns on, shuts off or leaves in its current state the heater or A/C depending on whether the temperature is within the bounds I set for it or, if not, whether the measured temperature is too high or too low against what I've set it for.

The control module in our mapping is the programmer: if the programmer doesn't understand the system's technical goals (and ideally the business plan so perhaps he or she may catch where the system's design spec may fail to provide your business outcomes), or ignores the data coming in (which is the same thing -- not knowing the goal or not seeing whether what you're doing is leading to the goal is really the same as not having a goal), then there is no way for the project to be successful from a business plan perspective. And the business person, unless they themselves are technical, will have no chance of figuring out what needs to be done or why things aren't working the way they want.

So, to recap:

If the control module is faulty (not appropriate for the task at hand, ignoring the feedback inputs, etc.), or the feedback loops are bad (the data is being corrupted between the sensors and the control module), or the sensors are faulty or not appropriate to the data collection task (they are collecting stock prices instead of temperature readings), then it's hard to understand how you can possibly have a successful business outcome. Of course, having your system work as you specified doesn't mean your business idea will actually make you money, but that's not the system's problem. And if you cannot establish the parameters the technical system itself must meet, then it is an open-ended project that cannot succeed technically except by sheer luck or because you have a programmer smart enough and self-constrained enough who will figure out what the system's technical goals need to be as they do the work.

Keep in mind that this is only a thought experiment, a way to model the dynamics of the project to provide insight into where the issues might be. If you lose sight of the fact that the programmer, bill payer and possibly other 'parts' of this system are human beings and not machines or parts of a machine then you will likely treat your programmers and other people as machines and the means to your ends instead of as human beings.

/s.



-- Scott Goodwin, October 22, 2009
Great article, couldn't agree more. As a software architect with about 30 years' experience in the field, I also have spent untold hours "nagging" developers and architects alike to document not only their design decisions, but their code! Software development should be easy, but this article is another of many showing how cutting corners leads to disaster. Cutting corners almost *always* lowers value and ROI for software projects.

In my own most recent experience, inside one of the largest and (formerly) most successful online advertising networks, I saw at close hand the truly epic failures of a company-wide initiative to refresh the entire adserving platform (legacy UNIX to .NET). This involved a technology division of at least some 150 folks. Management made all the *classic* software development mistakes: hiring outside consultants to do the job, for instance; having *way* too few architects, for another. They soldiered on through YEARS of poor designs, missed dates, browbeaten teams, fired consultants, and literally tens of millions of dollars burned into thin air. I personally expended most of my political capital trying to get them to do the right thing, in as many ways as I knew how. Now this company, recently worth billions of dollars, is having its workforce slashed and is reduced to a mere shadow of its former self.

How did this happen? A total disregard for creating and fostering any type of software engineering culture. REAMS of requirements were drawn up -- but no one really understood them. Designs were presented -- but they weren't conceptually coherent and didn't work. Project plans were constructed, time and time again -- but they almost never delivered on their commitments. Software development was started and re-started -- but it was almost always late, and much of it had to be re-written. And so on. Classic.

And *why* did this happen? There was a management culture of arrogance, especially on the technical side. One simply cannot labor on for so many years without asking the simple question: "why could we not deliver on our commitments?" These hard questions *are* asked, as Phil points out, in companies like the Microsofts, the Googles, the Amazons, where the leadership is steeped in technical knowledge and cannot have the wool pulled over their eyes.

Personally, I'm proud of my record of success on projects: I have learned a lot of lessons from a lot of great folks. I lead teams with relative ease to meet committed dates, with high-quality, well factored, scalable, maintainable, monitorable and deployable software solutions.

To add to the points made by this article, it's a simple matter of discipline: knowing WHAT to pay attention to, namely all the elements of a software lifecycle and the value it creates for the business. The entire team must be clear on requirements. The design must be simple and justified against its alternatives. An appropriate level of test suites must prove out that the requirements have been met. The software must be easy to deploy, and when running it should be easy to monitor that it is, indeed, running correctly.

All these elements of software discipline are borne out of a single, simple, focused practice: the virtuous cycle of continuous improvement. It's OK to fail, once; but then one must stop and ask a sincere "why," decide how to prevent it in the future, and incorporate this literally into the DNA of the technology teams via an updated software process. Likewise, we want to also ask why we succeed and incorporate that as well. The result is that the team or organization always continues to get better and better with time.

So, an emphatic "yes!" to this article: most software is horribly written, and, unlike hardware design (circuit boards, bridges, etc), the creation of software machines is largely practiced as more of a black art than a science.

The one point I would take issue with is that "programmers will not keep themselves honest." While I agree with the result -- poor quality software -- most software developers really do want to do the right thing. It's just that they are not given the proper guidance on *how* to do it. Most developers are happy if not eager to understand how to frame, design, develop, and deploy software solutions that deliver real value to the business.

In the end, if your software project is having trouble, it's likely you may need more and better architects. No one would attempt to build a house of any note without an architect - imagine telling master bricklayers, carpenters, and plumbers to "just do it" without first drawing up detailed plans. It's the same with software.

In the end, you have to measure yourself by your results.

- Keith

-- Keith Bluestone, December 21, 2009

Add a comment | Add a link