Unicorns, Reproducibility, and other Machine Learning Myths

Unicorns, Reproducibility, and other Machine Learning Myths

“Everybody is a genius. But if you judge a fish by Its ability to climb a tree, it will live Its whole life believing that it is stupid.”
While commonly attributed to Einstein the above quote is likely apocryphal. Nonetheless, it’s instructive in pointing out that fact that, despite obvious misalignment, we in the tech industry often persevere in trying to make square pegs fit into round holes.
Recently, nowhere is this more endemic than in Machine Learning initiatives. Two of the most common “fish climbing a tree” scenarios include “data science unicorns” and “data-driven hypotheses.”
 

Avoiding the Reproducibility Crisis

According to the classic scientific method, we should be creating a hypothesis and then gathering the data to prove or disprove it. However, in the field of ML, we typically do the exact opposite: first we gather the data, and then we create hypotheses to test it.
As a number of recent blog posts have pointed out, ML is not alone in taking this data-driven approach. Studies in fields as varied as medicine, psychology and economics also use a data first paradigm. And like ML, many of the results from experiments in scientific fields have been found to suffer from poor reproducibility.
In ML, sometimes poor reproducibility can be attributed to p-hacking or overfitting. But when looked at through the lens of the reproducibility crisis in the sciences, there may be a more fundamental common denominator: the use of induction rather than deduction. In other words, we’re finding specific results in a dataset that look like insights but are really just spurious signals, and then drawing general conclusions from them.
Big data analysis is here to stay, so the only real answer is to be more careful. Some general guidelines can help here, including:

  • Do as little inference as possible to minimize reliance on induction
  • Rely on statisticians to examine initial results, and design procedures to prove the validity of those results
  • Data can be manipulated to tell us anything we want – stop torturing your data!

 

Unicorns versus Full Stack ML Teams

We all know that the best teams feature talented individuals that find ways to collaborate towards a common goal. But for some reason, ML initiatives get stuck on hiring “unicorns” – a single individual that’s an expert in not only data science and data engineering, but also coding and your business domain.
While the ridiculousness of such a job posting should be obvious, we really shouldn’t even expect our data scientists to be developers. Conversely, we shouldn’t expect developers to understand the model your data scientists hand them for implementation. They’ll need to work closely together to overcome the domain knowledge limitations.
Building the right team that can bridge the data science and engineering disciplines is difficult, but some companies have made a start in this direction. Most commonly called “full stack machine learning” it includes a range of personnel whose knowledge spans everything from how to build machine learning models and perform data wrangling, to software development, product management and production deployment knowledge.
This trend goes a long way to ensuring your ML project has all right skills in place to ensure success, but it can also promote silos. Fortunately, common, easy-to-use tooling like Jupyter and Python are breaking down those silos and encouraging collaboration.
To ensure collaboration and eliminate silos, you’ll want to consider the following tactics:

  • For prototyping, Jupyter Notebook’s polyglot support (Python, R, Julia, and many more) means all team members can contribute to prototypes in the language they prefer
  • For development work, Install a standard version of Python across all teams to reduce the number of “works on my machine” errors
  • Due to the popularity of ML, Python is one of the fastest moving languages in software, which means new package versions – and vulnerabilities – are coming out every day. Ensure all stakeholders can identify and resolve vulnerabilities across dev, test and production in order to avoid unsecured silos.

The ActiveState Platform was designed to deal with many of these aspects, from polyglot management to standard Python distro deployment to centralized security information/updates for all stakeholders.
For more information, see our ActiveState Platform overview.

Recent Posts

Scroll to Top