Data Infrastructure in the AI Era: A $200B Opportunity

Data infrastructure and AI era investment landscape

In 2012, Marc Andreessen wrote that software was eating the world. In 2025, data is eating AI. The companies that will determine which AI applications succeed and which fail are not the ones building the most impressive models -- they are the ones solving the unglamorous, expensive, and genuinely hard problem of getting clean, well-structured, properly governed data to those models at production scale.

This thesis has informed a significant fraction of our investment activity since founding Milestone AI Ventures. We have made eight investments in data infrastructure companies over three years, and this category represents our single highest conviction area for continued investment from our second fund. In this piece, we want to explain exactly why we believe data infrastructure is a $200B opportunity that most investors are systematically underestimating, and where we see the most compelling companies being built within the category.

The Scale of the Data Problem

To understand the data infrastructure opportunity, you need to start with an honest accounting of how difficult it actually is to deploy AI in enterprise production environments. The public narrative about AI deployment is dominated by impressive demos and proof-of-concepts -- which are, frankly, easy. The hard part is getting AI out of the lab and into reliable, auditable, governable production deployment at enterprise scale.

Every enterprise AI deployment we have observed -- across our portfolio and through diligence conversations with hundreds of enterprise technology buyers -- follows the same pattern. Proof-of-concept with clean, curated, pre-formatted demo data: works great, takes two weeks, impressive demo for the executive sponsor. Moving to production with actual enterprise data: six to eighteen months of painful data integration work that no one planned for, significant infrastructure investment, and multiple false starts before something reliable is running in production.

The source of this pain is structural: enterprise data is distributed across dozens of systems in incompatible formats, rarely up-to-date, inconsistently labeled, subject to complex access controls, and full of quality issues that only become apparent when they cause an AI model to produce confidently wrong outputs in front of a customer. Solving this problem is not a software engineering challenge -- it is a systems engineering challenge that requires specialized tools, deep enterprise integration expertise, and patient partnership with customers who have been burned by previous data integration projects.

The Five Categories of Data Infrastructure We Are Investing In

The data infrastructure landscape for AI is not a single market -- it is a collection of distinct markets at different stages of maturity, addressing different layers of the data-to-AI pipeline. Here is how we organize our thinking about the five most compelling investment categories within data infrastructure.

1. Vector Databases and Semantic Search Infrastructure

The emergence of retrieval-augmented generation (RAG) as the dominant architecture for enterprise LLM applications has created an enormous and fast-growing market for vector databases and semantic search infrastructure. RAG enables AI applications to access relevant context from large document corpora in real time, dramatically improving response quality while avoiding the hallucination problems that plague pure language model responses.

The vector database market was essentially nonexistent three years ago. Today, we estimate it at $2-3B in annual spend and growing at over 100% year-over-year as enterprise RAG deployments accelerate. The market is large enough to support multiple significant companies -- there is room for specialized solutions in different segments of the enterprise market, different scales of deployment, and different technical tradeoffs between query latency, index freshness, and cost.

Our portfolio company DataNexus is competing in this market with a focus on the largest-scale enterprise deployments -- petabyte-scale document corpora where existing solutions struggle with query latency and index update performance. They have identified genuine technical differentiation in their approximate nearest-neighbor algorithm that is particularly well-suited to hybrid dense-sparse retrieval workloads. This is exactly the kind of technical depth that creates durable competitive advantage in infrastructure markets.

2. Data Labeling and Curation Platforms

The quality of AI model outputs is fundamentally limited by the quality of training data. For enterprise AI applications trained or fine-tuned on proprietary data, the data labeling and curation step is often the primary bottleneck to improving model performance. Traditional labeling approaches -- hiring large teams of annotators, building internal labeling platforms -- are slow, expensive, and produce inconsistent quality at scale.

The opportunity is to dramatically accelerate and reduce the cost of the data labeling and curation process using AI-assisted approaches. Active learning -- where the model identifies the examples it is most uncertain about and prioritizes those for human labeling -- can reduce the number of labels required to achieve a target accuracy by 50-80%. Self-supervised techniques can bootstrap labeling for new categories using existing labeled examples. And specialized labeling quality control systems can detect and correct systematic errors in human-generated labels that would otherwise degrade model performance.

OmniLabel, our portfolio company in this space, has built a platform that addresses multimodal labeling -- images, video, text, and 3D point clouds -- with a unified workflow that reduces end-to-end labeling cost by 80% compared to traditional approaches. Their competitive advantage is their proprietary active learning algorithm, which generates better quality training data with fewer human labels by identifying the most information-rich examples for human annotation. For the autonomous vehicle, robotics, and medical imaging companies that represent their core customer segments, better training data translates directly into higher model accuracy and faster development cycles.

3. Real-Time Feature Stores and Data Pipelines for AI

Machine learning models in production make decisions based on features -- computed representations of raw data that the model was trained to interpret. In simple batch inference scenarios, features can be computed ahead of time and cached. But as AI applications move toward real-time personalization, dynamic pricing, fraud detection, and other latency-sensitive use cases, the feature store must deliver freshly computed features in milliseconds while handling millions of concurrent requests.

Existing general-purpose data infrastructure was not designed for this pattern. Batch ETL pipelines introduce hours-long latency. Traditional databases cannot handle the query volume of high-throughput inference systems. And the operational complexity of maintaining consistency between the feature store used for model training and the feature store used for production inference -- the so-called "training-serving skew" problem -- creates subtle but consequential model degradation over time that is difficult to diagnose and fix.

The market for real-time AI data pipelines is in early innings but growing rapidly as enterprises move from batch AI to real-time AI. We have one portfolio company in this space and are actively evaluating several others. The key differentiator we look for is a clear technical solution to the training-serving skew problem -- which is the core reliability issue that prevents enterprises from deploying more real-time AI -- combined with a data model simple enough for ML engineers to use without deep infrastructure expertise.

4. AI Observability and Model Monitoring

Once an AI application is in production, the hard work of maintaining it begins. Language models drift as the world changes and user behavior evolves. Embedding models become stale as new vocabulary and concepts emerge. Feature distributions shift in ways that cause predictions to degrade silently. And hallucinations and toxicity failures can cause reputational and regulatory consequences that far exceed the cost of preventing them.

AI observability -- the ability to understand in real time how an AI system is performing, where it is failing, and why -- is rapidly becoming a mandatory capability for any enterprise deploying AI in customer-facing or high-stakes internal applications. The regulatory environment is accelerating this trend: the EU AI Act and emerging US AI regulations are requiring enterprises to maintain audit trails, bias monitoring, and explainability documentation for AI systems deployed in sensitive domains.

Guardrail Systems, our portfolio company in this space, addresses the highest-stakes segment of the AI observability market: financial services and healthcare applications where AI errors have material regulatory and reputational consequences. They have built a real-time monitoring platform that provides hallucination detection, bias monitoring, prompt injection detection, and regulatory compliance reporting for production LLM applications. Their technology is defensible because their detection algorithms are trained on millions of anonymized production AI interactions -- a proprietary dataset that gets stronger with each new customer deployment.

5. Data Governance and AI Compliance Infrastructure

The final category in our data infrastructure thesis is the governance and compliance layer that enterprise AI deployments require. As AI systems handle sensitive customer data, make consequential decisions, and interact with regulated industries, the governance infrastructure -- access controls, lineage tracking, consent management, audit logging, and regulatory reporting -- becomes a critical path dependency for deployment.

This market is in the earliest stages of development. Most enterprises today are cobbling together governance solutions from general-purpose tools that were not designed for AI-specific compliance requirements. As AI regulations crystallize -- and they are crystallizing faster than most people in the technology industry expect -- demand for purpose-built AI governance infrastructure will accelerate dramatically. We are actively looking for companies in this space and would be excited to meet teams building the governance layer for enterprise AI.

The $200B Market Sizing Rationale

We estimate the total addressable market for AI data infrastructure at $200B within a decade based on a straightforward bottom-up analysis. Global enterprise software spending is approximately $900B annually. AI-specific software spending is currently $50-60B but growing at 40-50% per year and will likely reach $300-400B within a decade as AI permeates every software category. Within that AI software spend, infrastructure historically represents 40-50% of total spend -- consistent with the pattern in cloud computing, where infrastructure spending on AWS, Azure, and GCP exceeds application spending.

Applying the infrastructure fraction to our projected AI software market gives us $120-200B in AI data infrastructure spend by the early 2030s. This is a conservative estimate that does not account for entirely new categories of infrastructure spending -- AI governance compliance, multimodal data infrastructure, agent coordination infrastructure -- that do not exist in meaningful scale today but will likely be sizable markets by the end of the decade.

Why Most Investors Are Getting This Wrong

Despite the compelling market opportunity, we observe that most VC investors are systematically underweighting data infrastructure relative to the application layer. Several factors explain this pattern. Data infrastructure companies are less exciting in pitches -- it is harder to build a compelling demo for a vector database query optimizer than for a chatbot that writes marketing copy. Enterprise sales cycles for infrastructure are long, which means revenue traction at seed stage is minimal. And the technical depth required to evaluate infrastructure companies appropriately is higher than for application companies.

We believe these factors create a persistent opportunity for investors with the technical depth to evaluate infrastructure companies accurately. At Milestone, we have two general partners with deep infrastructure experience -- Marcus Rivera from his time at Databricks, and our principal David Park from Microsoft Research -- which gives us an unusual ability to assess infrastructure companies' technical differentiation with confidence. This is a structural advantage in a category that most investors approach with insufficient technical depth.

The data infrastructure companies in our portfolio are consistently the ones with the highest follow-on rates, the lowest churn, and the most dramatic improvement in metrics from seed to Series A. This is not a coincidence -- infrastructure companies with genuine technical differentiation are extremely defensible businesses once they achieve deep enterprise integration, and that defensibility compounds over time in ways that application companies rarely match. We expect data infrastructure to represent at least 40% of our Fund II deployment, and we are actively looking for more companies in this category.

Marcus Rivera is a General Partner and Co-Founder of Milestone AI Ventures. He previously served as CTO at Databricks. The views expressed here are his own and do not constitute investment advice.