How to Ensure Supply Chain Security for AI / ML Apps

Machine Learning (ML) is at the heart of the boom in Artificial Intelligence (AI) apps, driving much of the innovation in chatbots (e.g., the Large Language Model [LLM] behind ChatGPT), text-to-AI image generators (e.g., the transformers in services like Stable Diffusion), as well as rapid advances in multiple fields from genetics to medicine to finance. It’s not hyperbole to say that ML will change your life, if it hasn’t already.

And yet, in an effort to be first to market, many of the ML solutions in these fields have relegated security to an afterthought. Take ChatGPT for example, which only recently reinstated users’ query history after fixing an issue in an open source library that allowed any user to potentially view the queries of others. A fairly worrying prospect if you were sharing personal information with the chatbot. 

Despite this software supply chain security issue, ChatGPT has had one of the fastest adoption rates of any commercial service in history:

ChatGPT Adoption Curve

Obviously, for most users, ChatGPT’s open source security issue didn’t even register. And despite generating misinformation, malinformation and even outright lies, the reward of using ChatGPT was seen as far greater than the risk.

But would you fly in a space shuttle designed by NASA yet built by a random mechanic in their home garage? For some, the opportunity to go into space might outweigh the risks, despite the fact that, short of disassembling it, there’s really no way to verify that everything inside was built to spec. What if the mechanic didn’t use aviation-grade welding equipment? Worse, what if they purposely missed tightening a bolt in order to sabotage your flight? 

Passengers would need to trust that the manufacturing process was as rigorous as the design process. The same principle applies to the open source software fueling the ML revolution. 

The AI Software Supply Chain Risk

In some respects, open source software design is considered inherently safe because the entire world can scrutinize the source code since it’s not compiled and therefore human readable. However, issues arise when authors that lack a rigorous process compile their code into machine language, aka binaries. Binaries are extremely hard to take apart once assembled, making them a great place to inadvertently or even overtly hide malware, as proven by Solarwinds, Kaseya, and 3CX

In the context of the Python ecosystem, which underlies the vast majority of ML/AI/data science implementations, pre-compiled binaries are combined with human readable Python code in a bundle called a wheel. The compiled components are usually derived from C++ source code and employed to speed up the processing of the mathematical business logic that would otherwise be too slow if executed by the Python interpreter. Wheels for Python are generally assembled by the community and uploaded to public repositories like the Python Package Index (PyPI). Unfortunately, these publicly available wheels have become an increasingly common way to obfuscate and distribute malware. 

Additionally, the software industry as a whole is generally very poor at managing software supply chain risk in traditional software development, let alone the free-for-all that now defines the gold rush to prematurely launch AI apps. The consequences can be disastrous:

  • The Solarwinds hack in 2020 exposed to attack:
    • 80% of the Fortune 500
    • Top 10 US telecoms
    • Top 5 US accounting firms
    • CISA, FBI, NSA and all 5 branches of the US military
  • The Kaseya hack in 2021 spread REvil ransomware to:
    • 50 Managed Service Provides (MSPs), and from there to 
    • 800–1,500 businesses worldwide
  • The 3CX hack in March 2023 affected the softphone VOIP system at:
    • 600,000 companies worldwide with
    • 12 million daily users

And the list continues to grow. Obviously, as an industry, we have learned nothing.

The implications for ML are dire, considering the real-world decisions being made by ML models such as evaluating creditworthiness, detecting cancer or guiding a missile. As ML moves from playground development environments into production, the time has come to address these risks. The US government has also recognized this, and is demanding their software suppliers secure their supply chain by June 2023

Speed and Security: AI Software Supply Chain Security At Scale

The recent call to pause the innovation in AI for six months was met with a resounding “No.” Similarly, any call for a pause to fix our software supply chain is unlikely to gain traction, but that means security-sensitive industries like defense, healthcare, and finance/banking are at a crossroads: they either have to accept an unreasonable amount of risk, or else stifle innovation by not allowing the usage of the latest and greatest ML tools. Given that their competitors (like the vast majority of all organizations that create their own software) depend on open source to build their ML applications, speed and security need to become compatible instead of competitive.  

We strongly believe that security and innovation can coexist. This joint mission is why we have partnered with Cloudera to bring trusted, open source ML runtimes to Cloudera Machine Learning (CML). Unlike other cloud-based ML platforms, which continue to use publicly available Python (such as SageMaker on AWS) or else build their Python distribution by hand (such as Anaconda), Cloudera can now help ensure their customer data analysis, ML and visualization routines executed in their cloud are secure from concept to deployment. 

This is because we automatically build Python from vetted PyPI source code using our ActiveState Platform, which adheres to Supply-chain Levels for Software Artifacts (SLSA) highest standards (Level 4). In effect, the ActiveState Platform acts as a secure factory to manufacture the Python components you require, rather than having to blindly trust the wheels built by the community. The Platform also provides tools to monitor, maintain and verify the integrity of your open source components. We even offer supporting SBOMs and software attestations that enable compliance with US government regulations.

Thanks to Cloudera’s new Powered by Jupyter (PBJ) Workbench, integrating your ActiveState Platform-built ML runtime with CML could not be easier. Simply use the ActiveState Platform to generate a custom ML Runtime as a docker image that you can import directly into CML. The days of data scientists needing to pull dangerous prebuilt wheels from PyPi are over, and the days of streamlined management, observability, and a secure software supply chain are here.

Next steps:

Sign up for a free ActiveState Platform account so you can use it to automatically build an ML runtime for your project. 

Read Similar Stories

How to Solve Vehicle Routing Problems Using Python Arcgis

Learn how to use ArcGIS for Python to solve complex vehicle routing problems in order to maximize delivery timeliness and minimize mileage.

Learn more >

python text summarization

How to Do Text Summarization With Deep Learning and Python

Python tutorial – use Abstractive Text Summarization and packages like newspaper2k, PyPDF2, and SPaCy to summarize text with deep learning. 

Learn More >

Can AI Detect AI-Generated Text?

Can an AI Classification Model Detect Ai-Generated Content?

As AI-generated text improves, can we programmatically distinguish it from human created text? Read this blog to find out.

Learn More >

Recent Posts

Tech Debt Best Practices: Minimizing Opportunity Cost & Security Risk

Tech debt is an unavoidable consequence of modern application development, leading to security and performance concerns as older open-source codebases become more vulnerable and outdated. Unfortunately, the opportunity cost of an upgrade often means organizations are left to manage growing risk the best they can. But it doesn’t have to be this way.

Read More
Scroll to Top