How failed: the architectural aspects

(Read parts 12 and 4.)

If you were building a house that you didn’t quite know how it would turn out when you started, one strategy would be to build a house made of Lego. Okay, not literally as it would not be very livable. But you might borrow the idea of Lego. Each Lego part is interchangeable with each other. You press pieces into the shape you want. If you find out half way through the project that it’s not quite what you want, you might break off some of the Lego and restart that part, while keeping the part that you liked.

The architects of had some of this in their architecture: a “data hub” that would be a big and common message broker. You need something like this because to qualify someone for health insurance you have to verify a lot of facts against various external data sources. A common messaging system makes a lot of sense, but it apparently wasn’t built quite right. For one thing, it did not scale very well under peak demand. A messaging system is only as fast as its slowest component. If the pipe is not big enough you install a bigger pipe. Even the biggest pipe won’t be of much use if the response time to an external data source is slow. This is made worse because generally an engineer cannot control aspects of external systems. For example, the system probably needs to check a person’s adjusted gross income from their last tax return to determine their subsidy. However, the IRS system may only support ten queries per second. Throw a thousand queries per second at it and the IRS computer is going to say “too busy!” if it says anything at all and the transaction will fail. From the error messages seen on, a lot of stuff like this was going on.

There are solutions to problems like these and they lay in fixing the system’s architecture. The general solution is to replicate the data from these external sources inside the system where you can control them, and query replicas instead of querying the external sources directly. For each data source, you can also architect it so that new instances of it can be spawned on increased demand. Of course, this implies that you can acquire the information from the source. Since most of these are federal sources, it was possible, providing the Federal Chief Technology Officer used his leverage. Most likely, currency of these data is not a critical concern. Every new tax filing that came into the IRS would not have to be instantly replicated into a cloned instance. Updating the source once a day was probably plenty, and updating it once a month likely would have sufficed as well.

The network itself was almost certainly a private and encrypted network given that privacy data traverses it. A good network engineer will plan for traffic ten to a hundred times as large as the maximum anticipated in the requirements, and make sure that redundant circuits with failover detection and automatic switchover are engineered in too. In general, it’s good to keep this kind of architecture as simple as possible, but bells and whistles certainly were possible: for example, using message queues to transfer the data and strict routing rules to handle priority traffic.

When requirements arrive late, this can introduce big problems for software engineers. Based on what you do know though, it is possible to run simulations of system behavior early in the life cycle of the project. You can create a pretend data source for IRS data that, for example, always returns an “OK” while you test basic functionality of the system. I have no idea if something like this was done early on, but I doubt it. It should have been if it wasn’t. Once the interaction between these pretend external data sources was simulated, complexity could be added to the information returned by each source, perhaps error messages or messages like “No such tax record exists for this person” to see how the system itself would behave, with attention to the user experience through the web interface as well. The handshake with these external data sources has to be carefully defined. Using a common protocol is a smart way to go for this kind of messaging. Some sort of message broker on an application server probably has the business logic to order the sequence of calls. This too had to be made scalable so that multiple instances could be spawned based on demand.

This stuff is admittedly pretty hard to engineer, and is not the sort of systems engineering that is done every day, and probably not by a vendor like CGI Federal. But the firms and the talent are out there to do these things and would have been done with the proper kind of system engineer in charge. This kind of architecture also allows for business rule changes to be centralized, allowing for the introduction of different data sources late in the life cycle. Properly architected, this is one way to handle changing requirements, providing a business-rules server using business rules software is used.

None of this is likely to be obvious to a largely non-technical federal staff groomed for management and not systems engineering. So a technology advisory board filled with people who understand these advanced topics certainly was needed from project inception. Any project of sufficient size, scope, cost or of high political significance needs a body with teeth like this.

Today at a Congressional hearing officials at CGI Federal unsurprisingly declared that they were not at fault: their subsystems all met the specifications. It’s unclear if these subsystems were also engineered to be scalable on demand as well. The crux of the architectural problem though was clearly in message communications between these system components, as that is where it seems to break down.

A lesson to learn from this debacle is that as much effort needs to go into engineering a flexible system as goes into the engineering of each component. Testing the system early under simulated conditions, then as it matures under more complex conditions and higher loads would have detected these problems earlier. Presumably there would then have been time to address them before the system went live because it would have been a visible problem. System architecture and system testing is thus vital for complex message based systems like, and a top notch system engineering plan needed to be have been at its centerpiece, particularly since the work was split up between multiple vendors with each responsible for their subsystem.

Technical mistakes will be discussed in the last post on this topic.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: