Data Science Platforms

Feb 26, 2021

Data science is fundamentally different from software engineering. Data Science is a process of discovery while Software Engineering is a process of implementation. This distinction is overshadowed by the fact that there is a large overlap of tools, programming languages and libraries used by both groups. One has the word science in it, which refers to the scientific method. The other has the word engineering in it, which refers to building. And by that I mean that the inputs and output at the start of a data science project are unknown. For software engineering we do know the inputs and output at the start. Most of the agile methodology, for example, is focused on getting the inputs and outputs explicit before commencing the work. This is not true for data science. We might know the characteristics of the output (10% more accuracy) and we might have hypotheses about what input variables we have to consider but that's it. If we have more certainty about both the inputs and outputs then perhaps it's not a data science project ;)

For the record, in software engineering there is discovery involved: we build proof-of-concepts, we write pseudo code implementations, we sketch diagrams, and much more. But I would argue those are all activities that are part of getting ready to start the process of implementation or a form of requirement gathering. And of course, in data science there is implementation but that is almost a by-product of the process of discovery. Once the uncertainty has been overcome, the answer is clear and the implementation straight forward.

Let's take as a starting point that framing data science as a process of discovery and software engineering as a process of implementation is useful. In the remainder of this note I will focus on what the process of discovery means and what the implications are for the next iteration of data science platforms that is hitting the market. For the purpose of this note, I will limit the definition of a data science project to a project where the fundamental output is one of prediction.

A Stylized Data Science Workflow

I think there are a few common characteristics to every data scientist in their workflow, ie there are similarities in how data scientists work that are irrespective of the tools they use. The most pronounced similarities are:

A data science project is highly iterative (as illustrated in Figure 1). A major reason for this is that a data scientist is confronted with multiple levels of a-priori uncertainty:
- uncertainty about the actual problem that is being solved
- uncertainty about the underlying business process(es) that generate the data
- uncertainty about the quality of data
- uncertainty about the explanatory / predictive power of independent variables
- uncertainty about the distributions of the (in) dependent variables
- uncertainty about the first and second order correlations among the independent variables
- uncertainty about the appropriate algorithm
- ...
A data science project is highly collaborative. It's rare for a single data scientist to do *all* the cleaning / modeling / going through the iterations by themselves. Both tact and non-tacit knowledge are generated as each layer of uncertainty is conquered. It's the non-tacit knowledge that so easily gets lost, particularly in a collaborative setting. But even on single data scientist projects, people forget easily what they did and why they did it.
A data science project generates more knowledge than is apparent from the implementation. The implemented model or algorithm will tell you which variables and approach work well. It doesn’t tell you about all the dead ends that you faced during your process of discovery.

How to Support a Data Scientist Workflow?

I define a Data Science Platform as a:

“An integrated set of software tools to create, share, document and deploy the artefacts of the process of discovery.”

So ideally, a Data Science Platform would support the following:

A Data Science Platform should support reproducible workflows.
A Data Science Platform should support annotatable workflows.
A Data Science Platform should support social workflows.
A Data Science Platform should support shareable workflows.
A Data Science Platform should support deployable workflows.
A Data Science Platform should support data lineage across and within workflows.
A Data Science Platform should support compliant workflows (privacy regulations and regulatory policies).
A Data Science Platform should support algorithm agnostic workflows.
A Data Science Platform should support both small and big data workflows.

The big question, to me at least, is whether we will see the emergence of a single Data Science platform that offers support for all these requirements or whether we see a mesh platform that connects best-in-class tools / libraries in a single platform.

I do expect that the next-generation of tools will at least have these two features:

Social Workflows: sharing and collaborating will be native to the tools. No longer passing around datasets, files, snippets of code or wasting time installing the right set of dependencies (hello virtualenv and friends). Example of how this could look like is repli.it
No Code Deployment Workflows: The ultimate feedback to any model is to put it into production and use it. It’s often here where data scientists tend to struggle because the tooling is insufficient and they have to rely on help from either a DevOps team or a willing engineer from one of the engineering teams. I expect that we will see the emergence of more No Code Deployment workflows such as streamlit.io

Let me know what you think! What are the core workflows that a Data Science Platform should support?

Musings on data, teams and data teams

Discussion about this post

Ready for more?