Insurance at the Forefront
The Berkeley study's 20 case studies explicitly include "Insurance claims workflow automation" as a primary use case under Business Operations, validating that this domain sits at the leading edge of production AI deployment. Finance and Banking dominates the application domains at 39.1%, far outpacing other industries.
The 95% Failure Rate is Real
The Berkeley study confirms what the industry has long suspected: 95% of AI agent deployments fail to reach production scale. But the research reveals something more important, the failures are not caused by inadequate AI models or insufficient orchestration frameworks. They are caused by missing infrastructure.
Production Agent Characteristics (UC Berkeley MAP Study, Dec 2025)
Off-the-Shelf Models Only
Bounded Autonomy (10 steps or fewer)
Minimal Autonomy (<5 steps)
LLM-as-Judge + Human Verify
Why Teams Build Agents
The study reveals that productivity gains drive adoption, not novel capabilities. Among practitioners with deployed agents:
- 72.7% cite "Increasing Productivity": speed of task completion over previous systems
- 63.6% cite "Reducing Human Task-Hours": direct labor cost reduction
- 50.0% cite "Automating Routine Labor": freeing experts for higher-value work
- Only 12.1% cite "Risk Mitigation": harder-to-measure benefits remain underexplored
This pattern explains why measurable document automation delivers faster ROI than speculative "AI transformation" initiatives.
The Framework Paradox
Perhaps the most striking finding: 85% of successful production deployments bypassed third-party agent frameworks entirely. Teams that started with LangChain, CrewAI, or similar frameworks during prototyping frequently migrated to custom implementations for production. The reason? Control and simplicity.
Practitioners find prompt engineering with frontier models sufficient for many target use-cases already. Teams prefer building minimal, purpose-built scaffolds rather than managing the dependency bloat and abstraction layers of large frameworks.
UC Berkeley, "Measuring Agents in Production," December 2025
Open-Source Models: Regulatory Driver
Only 3 of 20 case studies use open-source models, but the reasons matter. Open-source adoption is driven by specific constraints rather than general preference: high-volume workloads where inference costs become prohibitive, and regulatory requirements preventing data sharing with external providers. For regulated industries like insurance, data sovereignty requirements make open-source deployment on-premise increasingly attractive.
What Actually Works
The Berkeley research reveals a clear pattern: successful production agents are simple, constrained, and human-supervised:
- 92.5% serve human users directly (not other agents or automated systems). Agents augment human decision-making rather than replacing it.
- 66% allow response times of minutes or longer. Production agents do not need real-time speed; they need reliable outputs.
- 79% rely on manual prompt construction. Automated prompt optimization remains rare; teams prioritize controllability over sophistication.
- 80% use predefined static workflows over open-ended autonomous planning. Reliability over flexibility.
The Insurance Evaluation Challenge
The Berkeley study highlights a critical challenge for regulated industries: "In regulated fields like insurance underwriting, the absence of public data forces teams to handcraft benchmark datasets from scratch." As a result, 75% of teams forgo benchmark creation entirely, relying on human feedback instead.
Even more concerning: insurance agents receive true correctness signals only through real consequences such as financial losses or delayed patient approvals. These signals arrive slowly and in forms difficult to automate. This is why schema-driven extraction with confidence scores and human verification loops is essential for regulated deployments.
The Common Thread: Data Quality
Across all 20 case studies, a consistent pattern emerges: the teams that succeed are the ones that solve the data problem first. Cleanlab's 2025 survey of 95 engineering leaders found that 70% rebuild their AI stack every 3 months. Not because of model limitations, but because the data and reliability layers keep shifting underneath them.
The challenge is not building the agent. It's building on a surface that doesn't stop moving.
Cleanlab, "AI Agents in Production," August 2025
Galileo's December 2025 analysis goes further: poor data quality transforms agents from assets into "unpredictable liabilities." Agents cannot distinguish between missing data and intentionally empty fields, which leads to dangerous assumptions. The conclusion is stark: Garbage In, Failure Out.
Research Conclusion
The Berkeley study validates the "Missing Layer" thesis: AI failures in regulated industries are infrastructure failures, not model failures. Solving the data layer (converting unstructured documents to structured, schema-driven outputs) is the prerequisite for everything else.