Conflating inaccuracy in data with imprecision in software

I once worked on a job where the input data for a machine learning implementation was dirty. The data were encoded using various schemes, were incomplete, inconsistent, and sometimes, just plain wrong.

This lack of accuracy and integrity in the source data was used over and over again as a bludgeon to “short cut” the programming. The reasoning being that dirty data didn’t warrant clean code.

But these are two different notions entirely.

When the data is dirty, that’s even more reason for airtight, bulletproof, squeaky clean code. Especially in science.

Creating such code requires an engineering process with management backing.

There are no shortcuts.


Tags