The Data-First Enterprise AI Revolution: A Software 2.0 Conversation with Snorkel AI’s CEO Alex Ratner
Alex Ratner and the Snorkel AI team have spent over five years developing a radically new approach to how enterprises build AI applications. This approach is represented in 40+ peer-reviewed research publications and is deployed powering high-impact applications at Google, Apple, Intel, Stanford Medicine, the Department of Justice, and many other large and small organizations. It is also used by thousands of data scientists and developers worldwide. And now large enterprises across finance, government, telecommunications, insurance, and more have adopted it via the company spun out of Stanford, Snorkel AI. This approach has defined a radically faster and more practical way of fueling a new era of data-driven AI application development, often called “Software 2.0.” The following is a conversation about Software 2.0 and what’s coming next for Enterprise technology with Alex.
What does the future of enterprise software look like? How is AI powering that transformation?
We are seeing one of the biggest transformations in enterprise software in our lifetimes–from “Software 1.0” specified by code to a new wave of “Software 2.0” systems that learn from data using AI. These new Software 2.0 systems learn how to do nuanced tasks over complex data that would have been impossible to specify by manually written code, by instead learning directly from labeled examples–often called training data. This shift unlocks new applications never before possible, with less engineering work needed than ever before–but all relies on large volumes of this carefully and custom-curated training data.
Driven by this promise of more powerful, adaptive enterprise software that goes beyond the capabilities of hand-written code, enterprises are spending billions of dollars attempting to put AI and Software 2.0 strategies to use–however, to date with mostly mixed results. One well-cited figure is that 87% of Software 2.0 projects today never make it into production. With so much technical progress, and so much of it making it into commoditized and robustly-supported open-source form, why is there so little real enterprise success? The answer all too often is that many enterprises continue to be bottlenecked by one key ingredient: the large amounts of labeled data to train these new systems.
In fact, over the last five-plus years, we’ve observed even organizations with the most sophisticated AI/ML technology and talent–the Googles and Microsofts of the world–struggling to overcome the training data bottleneck. The real challenge is tied to the various difficulties of labeling and managing training datasets for most real-world applications: data often requires highly-trained subject matter experts (e.g., doctors, or legal or financial experts, etc.) to label; often is highly private or regular and therefore must stay on-premise; and generally changes frequently, necessitating re-labeling. These factors together block even the most well-resourced organizations from building and maintaining the training datasets needed to fuel Software 2.0.
The key idea behind our platform, Snorkel Flow, is a novel approach developed over the years at the Stanford AI lab to turn subject matter knowledge into high-quality training data via a programmatic approach. This frees up the data scientists, developers, and domain experts to code using data, not instructions, thereby evolving to a Software 2.0 state. Snorkel’s approach has already enabled applications that shorten time to value for a set of radiology triage systems at the Stanford Hospital from 8 person-months to 8 hours; replaced 100’s of thousands of hand-labeled examples at Google; and automated contract processing with 99% accuracy in under 24 hours at a Top 3 US bank.
Can you define in more detail what you mean by Software 1.0 vs. 2.0?
In a “Software 1.0” approach, software is built by manually writing lines of code that specify specific instructions for a machine to carry out. This has been and still will continue to be a major paradigm for software development. However, it is a fundamentally limited approach when it comes to more complex data and problem types, and it is also fundamentally constrained by the amount of developer time an enterprise has.
In “Software 2.0”, i.e., AI approaches driven by modern machine learning methods, software is written in the weights learned by algorithms, and specified declaratively by showing labeled examples to these algorithms--evolving the role of a developer to that of a teacher who curates data and analyzes results. Training data in itself has become the programming interface. But it requires more than just engineers to participate in developing Software 2.0 applications. It requires subject matter experts to transfer knowledge into labeled data, and data scientists to train, tune and monitor machine learning models based on neural networks to produce accurate results.
What challenges do you see for enterprises getting access to the data needed to train software?
Over the years at Stanford and now Snorkel AI, we’ve talked to hundreds of enterprises. We’ve seen that enterprises in most verticals today face challenges around getting and maintaining training data. These challenges block even organizations with the world’s largest technology budgets from using ML. In particular, organizations face four key challenges with training data:
- The need for subject matter expertise: Most training data requires highly trained experts to label, e.g., doctors, legal analysts, network technicians, etc., who often also need to be well-versed in specific organization’s goals and datasets.
- The need for privacy: Most organizations cannot ship data off-prem to be labeled, making it impossible to use hand-labeling services. This means that development teams are stuck for months waiting on building training datasets.
- The need for auditability: Most organizations need to be able to audit how their data is being labeled, and therefore what their AI systems are learning from. Even when outsourcing labeling is an option, performing basic audits on hand-labeled data is a near impossibility.
- The need for adaptivity: Most organizations have to deal with constant change–both in input data and upstream systems and processes, and downstream goals and business objectives--rendering existing training data obsolete. This requires enterprises to constantly re-label training data.
How would you recommend enterprises wrap their arms around developing AI-powered applications?
One of the most important lessons of the ML/AI space over the last few years is that while models, infrastructure, and of course teams are critical, AI-powered systems are made or broken by the data they are trained on–how it is labeled, managed, its quality, and volume. The key takeaway is that a successful AI strategy must be data-first.
The team at Snorkel has spent over five years developing Snorkel Flow, an end-to-end ML platform that centrally focuses on data. Snorkel Flow uses a unique programmatic approach to create training data, enabling rapid development and deployment of custom AI applications. It drastically reduces the time to value for AI-powered solutions and addresses many of the practical challenges to adopting AI.
Rather than spending weeks or months labeling data by hand painstakingly, Snorkel Flow gives subject matter experts (SME) a no-code interface to generate massive amounts of training data in hours using labeling functions. For data scientists and developers, the platform is also deeply configurable, allowing them to train, deploy, monitor, and retrain models in minutes. Enterprises can adapt AI applications to changing inputs and objectives, easily modifying labeling functions as needed instead of repeating the painful hand-labeling process. Enterprises can also trace an ML model’s output back to specific labeling functions created by individual SMEs. This provenance and lineage help with auditability, explainability, and other compliance requirements.
Our mission at Snorkel is to make AI practical for all enterprises.
What does inclusion on the Enterprise Tech 30 mean for Snorkel?
We are honored by the nomination and selection in the Enterprise Tech 30 List. For years, The Enterprise Tech 30 has been a definitive list of the top 30 most promising private companies in enterprise technology as determined by leading venture capitalists. We’ve consistently heard from Fortune 500 CIOs that they have been disappointed with their progress using AI, primarily because they get stuck on the data. We believe this year’s list represents the companies that are on track to tectonically change how enterprises operate in the future, with their focus on data, automated business processes, APIs, and no-code platforms.
The views and opinions expressed herein are the views and opinions of the author and do not necessarily reflect those of Nasdaq, Inc.