What Really Is Agentic AI?
tl;dr
- Most "agentic AI" is mislabeled workflow automation: Many systems marketed as autonomous agents are deterministic processes with LLMs in one or two steps
- True autonomy exists on a spectrum, not as binary: Real-world deployments operate at different levels, from tool-using assistants to continuous action loops like Waymo
- The Practical Taxonomy clarifies five distinct levels: From non-agentic assistants (Level 0) to full physical-world autonomy (Level 5), helping cut through vendor claims
- Apply the Crisp Test to evaluate any agent: Five questions reveal whether a system is meaningfully agentic: goal pursuit, planning, tool execution, exception handling, and operational independence
- Software engineering shows the clearest wins: Nubank's 12x efficiency gain demonstrates where bounded autonomy actually delivers ROI in deterministic environments
- Customer experience highlights the limits: Klarna's January 2026 reversal from AI-only to human-hybrid proves that efficiency can't replace empathy in high-stakes interactions
If you've been following enterprise AI discussions lately, you've probably noticed "agentic AI" has become the new buzzword. Every vendor claims their system is autonomous. Every press release promises agents that work while you sleep. Every demo shows something remarkable.
But here's what we're seeing in actual deployments: a lot of what's marketed as "agentic" is workflow automation with an LLM in a couple of steps. The agent doesn't plan, a human designer did. The agent doesn't choose actions, a rules engine does. The AI fills slots in a predetermined sequence, then someone calls it autonomous.
This matters because the gap between marketing claims and operational reality affects both technology investment decisions and reasonable expectations about what AI can actually do. So let's establish a practical framework for separating genuine autonomy from rebranded automation.
Autonomy Isn't Binary, It's a Spectrum
Autonomy and agency aren't yes/no propositions. They exist on a spectrum, and most production systems sit somewhere in the middle for good operational reasons: risk management, compliance requirements, reversibility needs, and cost control.
Even Waymo operates within strict constraints: geofenced areas, defined operating conditions, fallback protocols, and remote support infrastructure. If your definition of autonomy is "no constraints," it doesn't exist in production.
A better definition: An agent is autonomous to the degree that it can choose actions, sequence them over time, and recover from setbacks in pursuit of a goal, with limited or no human intervention, within explicit safety and policy boundaries. This separates "rule set" from "autonomy." Rules define boundaries. Autonomy shows up in how the system behaves inside those boundaries.
The Practical Taxonomy: Five Levels of Agency
Based on patterns emerging across verified deployments, here's a taxonomy that maps to actual operational characteristics. Use this to evaluate vendor claims:
Level 0 — Assistant (Non-Agentic)
Pure question-and-answer or content generation. No tool use. No execution. ChatGPT in its basic form, Claude answering questions, most conversational interfaces. These systems respond to prompts but don't pursue goals.
The defining characteristic: it waits for your next instruction. There's no continuity of purpose between interactions. Each prompt starts fresh.
Level 1 — Tool-Using Assistant (Minimally Agentic)
The system can call tools, but typically under user micro-direction. You tell it "run this query" or "send this payload" and it executes that specific instruction. Autonomy is low because initiative sits entirely with the user.
Think of coding assistants that execute commands you specify. The user remains the planner and decision-maker. The system is a sophisticated tool, not an agent.
Level 2 — Orchestrated Workflow (Often Mislabeled "Agentic")
This is where the marketing confusion starts. You see a deterministic workflow engine, essentially a rules-based process, with an LLM used for classification, summarization, or content generation at specific steps. The planning happens in the workflow designer's head, not in the model.
These systems deliver value, but calling them autonomous is a stretch. The "agent" follows a predetermined path: if email contains X, route to queue Y, generate response Z using template. There's no dynamic replanning based on results. It's automation with AI components, not an autonomous agent.
Level 3 — Supervised Agent (Common in Enterprise)
Now we're getting to real agency. The agent plans and executes multi-step tasks. Human approval gates exist at critical points: payments, customer communications, record updates. Between those gates, real autonomy exists. Risk is bounded by the approval architecture.
Most enterprise "agents" land here, and it's a rational design choice. When actions can't be easily reversed, when liability matters, when auditability is required, you gate the autonomy. That's not a failure, it's appropriate engineering.
Healthcare's clinical documentation systems exemplify this level. Ambience Healthcare's assistant at Cleveland Clinic achieved 75% voluntary adoption across 4,000 physicians after using the tool just once.[1] The system listens to patient visits, extracts diagnoses, maps them to billing codes, and generates clinical notes. But a physician reviews and approves before anything enters the medical record. The approval gate maintains accountability while the agent delivers measurable time savings.
Level 4 — Semi-Autonomous Agent
High autonomy in a narrow domain. The system runs continuously or repeatedly without per-task human initiation. It handles exceptions by escalating to humans rather than failing. Common patterns include invoice triage, security alert handling, and IT remediation with rollback capabilities.
The key distinction from Level 3: the agent initiates work on its own schedule rather than waiting for human prompts. It monitors, detects conditions, and acts within bounded domains and with escalation paths for edge cases.
Level 5 — Full Autonomy in Operational Domain
Continuous perception, planning, and action loop. Hard safety constraints and fail-safes built in. Waymo's paid robotaxi service represents this level within defined operating domains. It makes thousands of micro-decisions per trip without human intervention, though remote support infrastructure exists for edge cases.
Level 5 exists almost exclusively in physical autonomy applications. In software domains, the complexity of business rules, liability concerns, and requirement for auditability keep most systems at Level 3 or 4.
The Crisp Test: Five Questions to Cut Through Claims
When someone tells you their system is an "autonomous agent," ask these five questions. A system is meaningfully agentic if it answers yes to at least three or four:
1. Goal Pursuit: Does it accept a goal, not just a prompt, and work until that goal is achieved?
2. Planning: Does it create and revise a plan based on intermediate outcomes, or just follow a predetermined sequence?
3. Tool Execution: Does it take actions in external systems: API calls, database writes, UI automation, or just generate text?
4. Exception Handling: Does it detect failure states and adapt through retries, alternate paths, or intelligent escalation?
5. Operational Independence: Can it complete tasks end-to-end without continuous user steering, even if occasional approvals are required?
Apply this test to any claimed agent. You'll quickly separate genuine autonomy from glorified chatbots.
Where Real Agentic AI Is Actually Working
The State of Agentic AI 2026 report analyzing Q4 2025 and Q1 2026 deployments reveals a clear split: spectacular success in deterministic domains, painful corrections in contexts requiring human judgment.
Software Engineering: The First "Solved" Vertical
Code is functional and testable. It either compiles or it doesn't. This binary feedback loop allows agents to plan, execute, debug, and iterate autonomously.
Nubank faced a classic tech debt crisis: an eight-year-old ETL monolith with dependencies 70 layers deep. Manual refactoring estimates suggested 1,000 engineers working 18 months. Instead, they deployed Cognition's Devin agents that completed the migration in weeks, achieving 12x improvement in engineering efficiency and 20x cost savings compared to the human baseline.[2]
The agents didn't just write code, they wrote tests to verify that new code produced identical outputs to legacy systems, creating a self-validating loop. This pattern, taking logic from Format A to Format B, with automated verification suggests that legacy modernization (banks off COBOL, healthcare off legacy ERPs) will be the first trillion-dollar industry disrupted by agentic AI.
Even in this success domain, Cognition's own assessment reveals constraints. Devin has "senior-level knowledge" in that it knows entire codebases instantly, but it only has "junior-level execution." It fails when requirements are vague, struggles with long tasks where goals shift, and lacks institutional awareness about team dynamics.[3] The operational reality: software engineering is shifting from "writing code" to "managing agents."
Customer Experience: The Efficiency Paradox
While software engineering celebrated, customer service faced a reckoning. Klarna's trajectory provides the most significant longitudinal data on automation limits.
Throughout 2025, Klarna aggressively pursued automation ahead of its IPO. By Q4, their OpenAI-powered agent handled two-thirds of customer interactions, equivalent to 853 full-time employees. The company reported $60 million in annualized savings and froze hiring.[4]
But actual costs told a different story. Customer service and operations costs rose despite theoretical savings. The agent excelled at simple tasks, but failed catastrophically at complex, emotional issues: financial hardship discussions, fraud disputes, escalation scenarios.
In January 2026, Klarna reversed course. The CEO admitted the strategy led to "lower quality" service, acknowledging that "cost seems to have been a too predominant evaluation factor."[5] The company began rehiring humans as remote "super-users" handling complex queries. A new tiering emerged: AI for routine queries (80% of volume), humans for disputes and emotional contexts.
The lesson: AI agents can't replace human empathy in high-stakes service recovery. Access to human agents is becoming a premium feature, while AI handles mass market interactions, and this a fundamental shift in service delivery economics.
When "Autonomous" Means "Supervised"—And Why That's OK
The most common critique: "But these systems have guardrails, approval gates, and human oversight. They're not really autonomous."
This misunderstands production systems. All real autonomy runs on constraints. Waymo operates in defined weather conditions, geofenced areas, with remote support. Boeing autopilots have numerous safety limits and pilot override capabilities. Medical devices have fail-safes and monitoring requirements.
The question isn't "Are there constraints?" but "Within the constraints, does the system choose and execute actions autonomously?" A supervised agent that plans multi-step workflows, retries failed attempts with different approaches, and escalates intelligently demonstrates real autonomy, even if policy rules constrain what it can do.
The critique becomes valid when the "agent" is just a deterministic state machine where an LLM fills slots. That's workflow automation with AI components, not an autonomous agent. The distinction: Does the system select among tools dynamically? Does it sequence steps based on intermediate results? Does it retry with alternative strategies when the first approach fails? If yes, it's agentic. If it just follows a flowchart, it's automation.
Final Thoughts
The state of agentic AI in early 2026 reveals a clear pattern: spectacular success in deterministic environments where verification is possible, significant limits in contexts requiring empathy and judgment, and a production reality far more nuanced than vendor marketing suggests.
True autonomy exists, but almost always as bounded autonomy with escalation paths and approval gates. That's not a limitation to apologize for; it's appropriate engineering for production systems where reliability, reversibility, and accountability matter.
The Practical Taxonomy and Crisp Test provide tools to evaluate claims. Most enterprise deployments land at Level 3 (supervised agents) or Level 4 (semi-autonomous in narrow domains). Level 5 exists almost exclusively in physical autonomy applications, and even there, within strict operational boundaries.
For business leaders evaluating agentic AI investments: ignore the autonomy rhetoric. Focus on the five Crisp Test questions. Look for deployments where verification is possible, where the domain is well-defined, and where human oversight can be appropriately architected. The winners aren't replacing people, they're building infrastructure that allows people to command machine workforces effectively.
The hype will continue. The term "agentic" will be applied to everything from chatbots to thermostats. But now you have a framework to separate genuine autonomy from rebranded automation, and to make technology decisions based on operational reality rather than marketing claims.
References
- Cracking the Medical Code: Why Cleveland Clinic Doctors Love Their Ambience Healthcare AI Scribe — The Cognitive Revolution
- Devin AI Autonomous Coding review 2025 — Devin AI
- Devin's 2025 Performance Review: Learnings From 18 Months — Cognition
- Klarna says its AI agent is doing the work of 853 employees — CX Dive
- Klarna changes its AI tune and again recruits humans for customer service — CX Dive