It's been reported that 16,000 positive cases of Covid-19 were missed from the government's Test and Trace programme because of Microsoft Excel. Here's a data engineer's perspective on using Excel in data pipelines.
A big tech-based news headline emerged yesterday morning. The BBC News article explained how 16,000 positive Covid-19 cases went unreported through Test and Trace because of a data handling issue within Public Health England (PHE). This issue (and subsequent fierce criticism) centres around the use of Microsoft Excel as a data processing tool.
The issue was that PHE unintentionally limited the amount of data it passed through to the Test and Trace system because of an unknown data cap within the outdated Excel file format. A huge raft of critics are now emerging to lambaste government policy and handling. Whilst it will be difficult to deny that failings occurred - and that there are many tools far better suited to such an important job - as a data systems developer with almost 20 years’ experience, I wanted to be a little more constructive with my feedback.
From the details within the news article it seems that data was being provided in Comma Separated Values format or CSV for short. This data was then being manually opened in Excel so that it could be engineered into a template which was then passed into to the NHS Test and Trace system.
Excel has a default file association with CSV files which means that when you double click such a file, it will open within Excel. Unfortunately, Excel will then go on to interpret, and in doing so, sometimes change the data within that file. I could recount stories about this that would make people’s hair stand on end. But perhaps this would be better for another time.
A lot of flash talk is attributed to data processing pipelines these days and indeed cloud computing does open many interesting new opportunities for refinement of data. But whether we would like to admit it or not, the relatively simple diagram above does represent a pipeline. We will break that pipeline into two topics of discussion…
Whilst going through the process of automating a data processing pipeline, we would look at how the manual process worked, identify common and irritating failings, and then look to address these with the automation. Data validation is the order of business here. For example, let’s say we were provided with a data feed file of basket transactions from a customer website, and were informed that the site routinely processes an average of 10 thousand transactions a day. Would it then be appropriate for us to pick up their CSV data file and blindly pass it through the pipeline if it only contained five transactions? How about 100 million? Clearly not. Both of these examples represent huge departures from the expected data volume. Data validations can and should be set up to flag large variances, alerting the human users to investigate what’s going on before they become costly errors.
In the case of PHE and the Test and Trace system, there were clearly no data validations in place. The amount of data sent was never checked quantitatively against the volume of data entered into the system, and so the additional 16,000 positive cases of Covid-19 were simply missed. This is not a problem with Excel, per se.
In my personal experience it is somewhat of a rarity that data is provided in any kind of guaranteed usable quality. Further continuing our example above, what if we are provided with 10,000 records and 2% of those transactions have no value or customer associated with them. Should those records just be passed blindly through the pipeline? Any data engineer will agree that the answer is “No”. These records should be logged, excluded from downstream processes, and separately reported so that the upstream data provider can chase their resolution without erroneous data points compromising the downstream reporting.
Whether your pipeline operates in a manual or an automated way, best practice demands that you must maintain an audit record on your data. Much like an accountant checking that the books tally up, data handlers must be able to show that the data provided is equal to the data reported. Sadly for PHE (and the UK public), in their haste they overlooked this basic step.
We have briefly spoken about good practice for handling data. But orchestration refers to the particular technology choices we make to decide how that data is processed and transformed. In the Public Health England case, despite referring to it as an “automatic process” it appears that manual processing of the data must have been required at some point. Otherwise why would the data have been placed into such a conveniently human-processable format as Excel?
Sometimes human processing of data is unavoidable and must be included in a pipeline. But generally speaking, this should be avoided wherever possible. People labour is expensive, usually much slower than an automated pipeline and often prone to non-deterministic errors. Having said that, where a pipeline must utilise people effort, your data handling and validation practices must be both robust and comprehensive. Perhaps even more so than the automated equivalent.
In terms of the pipeline choices made by PHE, of course there are many more robust alternatives than Excel that could have been deployed. From SQL Server in a server-based context to Azure Data Factory in a Serverless context, there are many on-premise and cloud-based options. Using Excel supplemented with manual labour can often be seen as a fast, light and cheap option. But once you take a step back and look at your costs holistically, it’s easy to see that the Excel approach for data engineering is not feasible.
We have some sympathy for PHE, of course. They had to set up a pipeline unusually quickly, to perform a particularly high profile task. In retrospect, I’m sure that they would have preferred to make different decisions. However, blaming this fault on a "legacy system" (as the Health Secretary did in Parliament yesterday) seems to be making an overly simple scapegoat of Excel. New technology may have encouraged the PHE team to adopt best practices, that's true. But the real fault here cannot be blamed on technology, and certainly not solely on Excel. It was, as it is in most cases, a very human error.
Our new data science report, The Ambitious Marketer's Guide to Customer Retention Analytics, discusses three practical ways marketers can use analytics to retain more customers. Download your free copy now.Get the Free Report