Just like a bad yearbook picture, epic software failures are one of those things that just never go away. I was reminded of this when I saw in my news feed that the Federal Communications Commission (FCC) just released the findings of an audit they did on a failure Level 3 Communications experienced in October 2016. Here we are 18 months later, and all the details are again being hashed over for all to see.
The “Level 3 Nationwide Outage Report”1 states that it was the largest outage reported via the Network Outage Reporting System (NORS) in FCC history. As a result of the outage an estimated 111 million calls were blocked and fifteen 911 calls were blocked. Wow. That is not typically the kind of news that is good for business.
What Went Wrong?
According to the report, the issue was caused by an employee in the fraud department doing a routine update of numbers that were suspected of being associated with malicious activity. The process for blocking calls consisted of updating a series of fields in the company’s vendor-supplied network management software. While doing this, the employee left one of the fields blank. For some unknown reason the software read the blank field as a “wildcard” rather than as a null entry, and instead of alerting the user that they had a blank field, it blocked all non-native numbers in Level 3’s database. Ouch!
Could Testing Have Prevented the Failure?
The answer to that question depends on what type of testing would have been done. As more and more business has gone digital, an increasing number of companies are focusing more on testing as part of their continuous delivery process. When adopting agile and DevOps, DevTest teams move testing from QA to development (“Shift-Left”) – the earlier a bug is found, the easier it is to fix and the less it costs to fix. As more testing has shifted to development, more focus has been put on API and unit testing vs. UI testing. Many developers like to tout the advantages of API testing. It’s faster, you don’t have to worry about changes to the UI, you don’t even need a UI. You test the API directly and you also test the functionality of the application. But is this really the case? Is testing the API “just as good” as testing the UI? Would API testing have prevented Level 3’s software defect from stopping 111 million phone calls? Probably not.
The Case for API Testing
APIs have their own special format and language; they are designed to be efficient, not necessarily to be easy to understand and test. APIs are built to enable easy integration and data sharing between systems. APIs are written with different sets of requirements and typically use a common specification (think Swagger 2.0 or WSDL), which is used to document the data and formatting. The trade-off of a well-documented, verbose JSON or XML document vs. a simple binary payload API that is smaller and more efficient is a debate for systems that are processing high volumes of data, like a VoIP phone system. The API calls behind a UI are literally behind the UI. Their purpose is to move data as fast and efficiently as possible. Thus, the logical place to test the API is at the API level.
If the purpose of an API is to allow developers to consume and use the API, then it makes sense that unit and functional tests should accompany the integration code. This is just good hygiene in development. Since I have used this API before, I should have a test case that validates it works. This way I ensure the API works in the future.
The Case for UI Testing
When testing commercial off-the-shelf (COTS) software like the network management system responsible for the Level 3 outage, the logic built into the UI also must be considered. These options sit on top of the UI, and the choices the user makes within the UI then determine what data the API receives. The software therefore must be tested the same way a user interacts with the software. The Level 3 failure is a prime example of what happens when UI testing is not adequately done. The translation of the data from the UI to the API can only be tested and documented by running tests from the UI. Vendors delivering software will have APIs that are externally accessible and documented for the purpose of integration. The APIs that deliver the data for a UI are not usually documented and not meant for consumption; they are internal APIs used by the UI. This means there is probably not a nice Swagger 2.0 spec for the UI-to-API interaction because the software vendor kept that internal. The end-to-end testing or user journey testing for the COTS software should be at the UI layer.
Modern UIs Are Applications unto Themselves with Logic and Code Built into Them
“This effort to make UIs easier and more intuitive and provide higher functioning for looking up and decoding data means the UIs are complex applications within themselves.”
This effort to make UIs easier and more intuitive and provide higher functioning for looking up and decoding data means the UIs are complex applications within themselves.
We may never know if the space designation as a wildcard in the phone number field was documented functionality for the API, but if it was, it was either not well documented or understood from the user’s perspective in the UI. As a pretty savvy computer guy, I would think a blank in a required field would flag a warning in the UI. The user should have received some type of message or notice that explained if they wanted a wildcard, they needed to use a “*” or “%” to denote it. It is not terribly intuitive to a user that the lack of a character implies a wildcard. With UI testing, this problem would easily have been uncovered. This could only be found by testing the UI and then looking at the outcome or having a set of basic UI tests that try entering incomplete data into the UI and look for validation rules in the UI.
There will probably always be a debate about who’s responsible for testing the UI for COTS software – the vendor, the business or QA. Fortunately for the workers at Level 3, the FCC’s findings did not blame the team responsible for testing the application, and the corrective actions were published to help prevent this type of problem in the future.
When it comes to COTS software, preventing epic failures means testing the actual business processes and/or workflows (journeys) through the UI like a real user would use the system. Testing at the API layer only covers part of the potential risks, and as we see in the case of Level 3 can result in an epic-size failure. Assuming the API is the UI may lead you to fall into a large system failure.