Multi-Agent AI Systems
Human-Agent Collaboration in Business Applications
THR770 was a 30-minute theatre session that covered more architectural ground than most 75-minute breakouts. The thesis: enterprise AI agents will not succeed by replacing humans. They will succeed by collaborating with them through well-defined patterns — oversight, escalation, approval, and in-app assistance. The session presented these collaboration patterns with concrete UX demonstrations and implementation guidance, and made the most honest case I heard at Ignite for why the human-in-the-loop requirement is not a limitation but a design feature.
Session: THR770 — Human-Agent Collaboration in Business Applications Date: Tuesday, Nov 18, 2025 Time: 1:30 PM - 2:00 PM PST Location: Moscone South, The Hub, Theater D
The collaboration spectrum nobody talks about
Every conversation about AI agents in the enterprise eventually arrives at the same binary question: "Should agents act autonomously or should humans approve everything?" This is the wrong question. Autonomy is not binary — it is a spectrum, and the right position on that spectrum varies by task, context, risk, and trust level.
THR770 presented this spectrum explicitly:
Level 1 — Fully supervised: Agent suggests, human acts. The agent is a recommendation engine with no execution capability.
Level 2 — Approval-gated: Agent proposes an action, human reviews and approves, agent executes. The agent can act but only with explicit permission.
Level 3 — Exception-based: Agent acts autonomously within defined boundaries. Humans are notified only when the agent encounters something outside its boundaries or confidence thresholds.
Level 4 — Fully autonomous with audit: Agent acts without human intervention. All actions are logged for post-hoc review. Humans intervene only when audit reveals problems.
Level 5 — Fully autonomous without oversight: Agent acts without logging or review. This level was presented solely to argue that it should never exist in enterprise applications.
The session's key argument: most enterprises should operate at Level 3 for mature, well-understood processes and Level 2 for everything else. Level 4 is appropriate only for low-risk, high-volume tasks where the cost of occasional errors is lower than the cost of human review. Level 1 is where every agent deployment should start, regardless of the target level, because trust must be earned through observed behaviour.
Pattern 1: Agent oversight dashboard
The first pattern demonstrated was a dedicated oversight interface where business users monitor autonomous agents assigned to their workflows.
What the demo showed:
A Dynamics 365 business application with an "Agent Activity" panel showing:
- Active agents and their current tasks
- Recent actions taken autonomously
- Pending actions awaiting approval
- Confidence scores for recent decisions
- Exceptions flagged for human review
The critical design choice: Progressive disclosure.
The dashboard does not show every model call, every token, every internal reasoning step. It shows a human-readable summary of actions. Users who want deeper detail can expand an action to see the underlying data, the model's reasoning, and the specific records accessed. But the default view is comprehensible, not comprehensive.
[14:23] Research Agent - Queried customer database for Q3 sales data
[14:24] Research Agent - Retrieved 1,247 records matching criteria
[14:25] Analysis Agent - Started trend analysis on Q3 dataset
[14:26] Analysis Agent - Identified 3 anomalies in regional sales data
[14:27] Analysis Agent - Flagged anomaly for review (confidence: 67%)
[14:28] Analysis Agent - Status changed to "Needs Attention"
The UX detail that mattered: Each agent action showed a "reasoning trace" — a plain-language explanation of why the agent took a specific action. Not the model's internal chain-of-thought, but a structured summary: "Approved expense claim for £142.50 because: amount is within policy limit (£250), expense category (travel) is pre-approved for this employee, receipt was verified by OCR."
{
"action": "expense_approved",
"amount": 142.50,
"currency": "GBP",
"reasoning": {
"policy_check": "Amount £142.50 is within £250 travel limit",
"category_check": "Travel expenses pre-approved for employee",
"receipt_check": "Receipt verified via OCR, merchant matches claim",
"anomaly_check": "No unusual patterns detected in recent claims"
},
"confidence": 0.94,
"autonomy_level": "exception_based",
"human_review_required": false
}
Why reasoning traces change the collaboration dynamic:
Without reasoning traces, the oversight dashboard is just a log. Users see that the agent approved an expense but do not know why. Trust builds slowly because every action is a black box.
With reasoning traces, users can evaluate the agent's judgement, not just its actions. When a user sees the agent's reasoning and agrees with it repeatedly, trust calibrates upward. When the reasoning is wrong — "approved because amount is within policy limit" when the policy limit recently changed — the user catches it immediately and corrects the boundary.
Why this matters: Most agent observability tools are built for developers. They show traces, spans, token counts, and latency metrics. These are useless for a sales manager overseeing an agent that analyses her team's pipeline. The oversight UX needs to speak the user's language, not the developer's language.
Pattern 2: Escalation patterns
The second pattern addressed what happens when an agent encounters something it cannot handle confidently.
The escalation taxonomy presented:
Confidence-based escalation: Agent's confidence score falls below a configured threshold. This is the most common escalation trigger.
# Escalation configuration
escalation_rules:
confidence_threshold: 0.75
escalation_target: "team_queue"
include_context: true
include_reasoning: true
timeout_minutes: 30
fallback_action: "hold_and_notify_manager"
Scope-based escalation: Agent encounters a task outside its defined scope. It does not attempt the task at all.
Agent: "The customer is requesting a contract modification. Contract
modifications are outside my scope. Routing to legal review team."
Novelty-based escalation: Agent encounters a request or situation that does not match any pattern in its training or tool capabilities. Rather than improvising (which is what language models do by default), the agent recognises the novel scenario and escalates.
Safety-based escalation: Agent detects a potential compliance, legal, or safety issue and immediately stops. No ambiguity. The escalation explicitly states the safety concern and what the agent has (and has not) done with the data.
The escalation UX demonstrated:
When an agent escalates, the human reviewer receives:
- The original request or trigger
- What the agent attempted and where it got stuck
- The agent's best-guess recommendation (if it has one)
- All relevant context gathered during the agent's work
- One-click actions: approve the agent's recommendation, modify and approve, reject and provide alternative, or return to the agent with additional guidance
The critical design decision: Escalations should be actionable, not informational. The session was emphatic about this: if an escalation requires the human to re-research the problem from scratch, the agent has not escalated — it has abdicated. A good escalation hands the human a nearly-complete decision with all context attached.
Anti-pattern discussed: Escalation fatigue. If an agent escalates too frequently, humans stop reviewing carefully and rubber-stamp approvals. The session compared this to alert fatigue in operations — the fix is tuning the escalation thresholds based on observed false-positive rates, not pressuring users to review more diligently.
Pattern 3: Approval workflows
The third pattern covered structured approval workflows where agents propose actions that require explicit human sign-off before execution.
The distinction from escalation: Escalation is unplanned — the agent did not expect to need human help. Approval is planned — the agent knows from the outset that certain actions require human authorisation, and the workflow is designed around that requirement.
The demo scenario: A procurement agent that can autonomously approve purchase orders under £1,000 but must request approval for orders above that threshold.
The approval workflow:
- Agent receives a purchase request for £3,500
- Agent validates the request (budget available, vendor approved, specifications match)
- Agent prepares the purchase order with all validated details
- Agent submits the order for approval with a recommendation: "Approve — all validations passed, vendor has preferred status, delivery timeline meets project schedule"
- Approver receives the prepared order in their normal workflow (Teams notification, email, in-app notification)
- Approver reviews, approves with one click (or modifies/rejects)
- Agent executes the approved order
The design principles for approval workflows:
Pre-validate before requesting approval. The agent should do all possible validation before presenting the approval request. The approver should be reviewing a fully prepared action, not a half-formed request that requires additional research.
Provide a recommendation. The agent should state whether it recommends approval and why. This frames the approver's review around confirming or overriding the agent's judgement rather than making the decision from scratch.
Make approval frictionless. If the agent's recommendation is correct 95% of the time, the approval action should be one click. Do not force approvers through a multi-step form for routine approvals. Reserve detailed forms for cases where the agent flags concerns.
Support delegation. Approval workflows must handle the approver being unavailable — auto-delegation to backup approvers, timeout escalation, and out-of-office handling.
# Approval workflow configuration
approval_config = {
"trigger": "purchase_order_above_threshold",
"threshold": 1000.00,
"currency": "GBP",
"approvers": [
{"primary": "line_manager", "timeout_hours": 4},
{"backup": "department_head", "timeout_hours": 8},
{"final_escalation": "finance_team", "timeout_hours": 24}
],
"agent_recommendation": True,
"auto_approve_if_expired": False,
"notification_channels": ["teams", "email", "in_app"]
}
The approval UX demonstrated:
+----------------------------------------------------------+
| Agent Action Pending Approval |
| |
| Agent: Procurement Assistant |
| Proposed Action: Purchase order for Contoso Supplies |
| Amount: £3,500.00 |
| |
| Validation Summary: |
| Budget available: Yes (£12,400 remaining in Q4 budget) |
| Vendor approved: Yes (preferred supplier since 2023) |
| Specifications: Match requirements document |
| Delivery: Within project timeline |
| |
| Recommendation: APPROVE |
| |
| [Approve] [Edit & Approve] [Reject] [Ask Agent] |
+----------------------------------------------------------+
The critical UX decision: Edit & Approve. The session emphasised that approval should not be binary. Humans should be able to modify the agent's proposed action before approving it. "Approve" means "the agent got it right." "Edit & Approve" means "the agent got it mostly right, and I am making it fully right." "Reject" means "start over." And "Ask Agent" means "I need more information before deciding."
Pattern 4: In-app AI assistance
The fourth pattern shifted from agents acting autonomously to agents assisting users within an application context.
The distinction: Patterns 1-3 cover agents that do work on behalf of users. Pattern 4 covers agents that help users do work more effectively. The agent is not a delegate — it is a collaborator embedded in the user's workflow.
The demo scenarios:
Form completion assistance: A CRM application where the agent pre-fills form fields based on conversation context, previous entries, and organisational data. The user can accept, modify, or ignore each suggestion.
Data exploration: A business intelligence dashboard where the agent suggests analyses based on the data the user is viewing. "You are looking at Q3 revenue by region. Your team typically compares this with Q3 last year — shall I add that comparison?"
Document drafting: A contract management tool where the agent drafts clauses based on the deal parameters, previous similar contracts, and current policy. The human reviewer edits the draft, and the agent adapts its future drafting based on the edits.
The design principle that makes in-app assistance work: Suggestions must be ignorable without friction. If dismissing an agent suggestion requires clicking a button, closing a modal, or explaining why the suggestion was rejected, the assistance becomes an annoyance. The best in-app assistance is like autocomplete — present when useful, invisible when not.
The anti-pattern the session warned against: Clippy syndrome. An agent that interrupts the user's workflow with unsolicited suggestions is worse than no agent at all. The session was specific: in-app assistance should respond to context changes (the user opens a form, views a dataset, starts a document) rather than interrupting unprompted.
The honest assessment of in-app assistance: This is the pattern most likely to drive immediate adoption because it does not require users to trust agent autonomy. The agent assists within the user's existing workflow. The user maintains full control. The risk is minimal — bad suggestions are ignored, not executed. This is the on-ramp to broader agent adoption in organisations that are not yet comfortable with autonomous agents.
Pattern 5: Trust calibration over time
The fifth pattern was the most conceptually important: the idea that agent autonomy should increase over time as trust is established through observed behaviour.
The trust calibration model:
Phase 1 — Observation (weeks 1-4): Agent operates at Level 1 (fully supervised). Every action is a suggestion. Humans make all decisions. The system records agreement rates — how often the human's decision matches the agent's suggestion.
Phase 2 — Supervised autonomy (weeks 5-8): If agreement rate exceeds 90%, the agent is promoted to Level 2 (approval-gated) for the action categories where agreement was highest. Low-agreement categories remain at Level 1.
Phase 3 — Exception-based autonomy (weeks 9+): If approval rate at Level 2 exceeds 95% for a category, the agent is promoted to Level 3 (exception-based) for that category. The agent now acts autonomously within those boundaries, escalating only exceptions.
Phase 4 — Steady state: The agent operates at different autonomy levels for different action types. High-confidence, well-understood actions are Level 3. Novel or high-risk actions remain at Level 2. Truly sensitive decisions may stay permanently at Level 1.
# Trust calibration configuration
trust_model:
initial_level: 1 # Fully supervised
promotion_criteria:
level_1_to_2:
agreement_rate: 0.90
minimum_observations: 50
evaluation_period_days: 28
level_2_to_3:
approval_rate: 0.95
minimum_observations: 100
evaluation_period_days: 28
demotion_criteria:
error_threshold: 0.05 # Demote if error rate exceeds 5%
review_period_days: 7
category_specific: true # Different levels for different action types
Why this model matters for enterprise adoption:
The biggest barrier to enterprise AI agent adoption is not technology — it is trust. Business leaders are reluctant to give agents autonomy because they cannot predict agent behaviour. The trust calibration model addresses this directly: start with zero autonomy, earn trust through demonstrated reliability, and always retain the ability to demote an agent that starts making errors.
The demotion mechanism is as important as the promotion mechanism. If an agent's error rate exceeds the configured threshold, it is automatically demoted to a lower autonomy level. This is the safety net that makes increasing autonomy acceptable to risk-conscious organisations.
The organisational implications nobody wants to discuss
The session's final section was its most important and most uncomfortable.
Agent oversight is someone's job
If agents work autonomously, someone must monitor them. Not occasionally glance at a dashboard, but actively oversee agent actions as part of their role. This has workforce planning implications. Do you hire "agent supervisors"? Do existing roles expand to include agent oversight?
The session's position: Agent oversight should be embedded in existing roles, not centralised in a new team. The sales manager who oversees her team's pipeline should also oversee the agent that analyses that pipeline. This requires training, tooling, and workload adjustment — not a new department.
Escalation volume determines ROI
If agents escalate 5% of decisions, they are useful. If they escalate 50% of decisions, they are generating work, not reducing it. The escalation rate is the single most important metric for agent ROI, and it is almost never discussed in agent marketing.
Scenario: Customer service agent handling 1,000 queries/day
Escalation rate: 5%
- Agent handles: 950 queries autonomously
- Human handles: 50 escalations (plus oversight of 950)
- Net human workload: Significantly reduced
Escalation rate: 50%
- Agent handles: 500 queries autonomously
- Human handles: 500 escalations (plus oversight of 500)
- Net human workload: Possibly increased
(escalations require context-switching, which is more
cognitively expensive than handling queries directly)
The insight: An agent with a high escalation rate can make human workload worse, not better. Every escalation requires the human to context-switch, understand the agent's partial work, make a decision, and potentially complete the task the agent started.
Approval fatigue is real
If every agent action requires approval, users will stop reading the approval requests and start clicking "Approve" reflexively. This is not hypothetical — it is the well-documented pattern behind alert fatigue in security operations.
The design guidance: Monitor approval times — instant approvals suggest the human is not actually reviewing. Periodically audit approved actions to catch reflexive approval patterns. Group low-risk approvals into batch reviews rather than individual notifications.
Agent errors are different from human errors
When a human makes a mistake, it is one mistake. When an agent makes a systematic mistake, it makes the same mistake across every instance until someone catches it. Agent errors are not random — they are patterned. A prompt injection vulnerability, a misinterpreted policy, or a data quality issue will produce consistent, repeatable errors at machine speed.
The oversight implication: Monitoring agents requires looking for patterns, not individual errors. Pattern detection in agent actions is a different skill from reviewing individual decisions.
What I would add to these patterns
The session covered the core patterns well but missed several areas that matter in practice.
Feedback loops: The trust calibration model assumes errors are detected. But in many business processes, the correctness of an agent's action is not immediately apparent. An expense approval might be incorrect but not discovered until an audit months later. The patterns need mechanisms for retrospective feedback, not just real-time escalation.
Multi-stakeholder approval: The procurement demo showed single-approver workflows. Enterprise procurement typically requires multiple approvers (budget holder, compliance, legal for large contracts). Multi-stakeholder agent approval workflows are significantly more complex and were not addressed.
Cross-agent collaboration patterns: The session focused on human-to-agent collaboration. But in multi-agent systems, the collaboration patterns between agents and between different agents' human stakeholders also matter. If Agent A escalates to Human A, and Agent B escalates to Human B, and the escalations are related, who coordinates?
Cultural variation: The trust calibration model assumes a universal appetite for automation. In practice, different business units, geographies, and cultures have different comfort levels with agent autonomy. A UK finance team might be comfortable at Level 3 whilst a German counterpart demands Level 2 for regulatory reasons. The configuration model supports this, but the session did not discuss multi-tenant trust calibration.
The uncomfortable truth about human-in-the-loop
The session presented human-in-the-loop as a design feature. I want to push back slightly on the framing.
Human-in-the-loop is often a workaround for model unreliability, not a collaboration pattern. We add human review because agents make mistakes at rates that are unacceptable for business processes. If agents were reliable enough, many approval and escalation patterns would be unnecessary overhead.
The honest framing is: human-in-the-loop is necessary today because agent reliability is insufficient for fully autonomous operation in most business contexts. As agent reliability improves — better models, better guardrails, better validation — the optimal autonomy level will shift upward. The collaboration patterns presented in THR770 are well-designed for the current state of technology. They may not be the permanent architecture.
That said, some human-in-the-loop requirements are permanent, not temporary. Decisions with significant ethical implications, irreversible consequences, or regulatory requirements for human oversight will always require human review regardless of model capability. The challenge is distinguishing "human review because the model is not reliable enough yet" from "human review because this decision should always involve a human."
The verdict
THR770 packed more practical architectural value into 30 minutes than most full-length sessions. The five collaboration patterns — oversight, escalation, approval, in-app assistance, and trust calibration — form a coherent framework for deploying AI agents in enterprise contexts.
The trust calibration model is the standout contribution. Starting at full supervision, earning autonomy through demonstrated reliability, and maintaining the ability to demote underperforming agents addresses the primary adoption barrier for enterprise AI. It turns the vague question "how much should we trust the agent?" into a measurable, configurable, auditable process.
The patterns are strongest where they are most specific — the escalation taxonomy, the approval workflow design principles, the reasoning trace requirement. They are weakest where they remain abstract — cross-agent collaboration, multi-stakeholder approval, and retrospective feedback mechanisms.
For anyone building AI agents for enterprise deployment, these five patterns should be your starting framework. Not because they are perfect, but because they address the organisational and governance realities that pure technology solutions ignore.
What to watch
Escalation rate benchmarks: As agent deployments scale, watch for published data on escalation rates by domain. These numbers will determine which use cases are actually viable for agent automation versus which generate more work than they save.
Approval fatigue research: Academic and industry research on how users interact with agent approval workflows over time. Do approval patterns degrade? How quickly? What design interventions prevent reflexive approval?
Agent oversight tooling: Purpose-built tools for non-technical users to oversee agents. Current observability tools are developer-focused. The market gap is tools for business users who need to monitor agents without understanding traces and spans.
Regulatory guidance on agent oversight: Watch for GDPR, SOX, and industry-specific regulators issuing guidance on agent oversight requirements. This will shape how approval and escalation patterns are implemented in regulated industries.
Organisational readiness assessments: Frameworks for evaluating whether an organisation is ready to progress from agent-as-assistant to agent-as-delegate. This is a change management challenge, not a technology challenge.
Related Coverage:
- Agent-Framework Unified Platform — LAB513 Series
- AI Agent Governance
- Building Multi-Agent Systems with Azure AI Foundry
- Multi-Agent MCP Application with Azure Cosmos DB
Session: THR770 | Nov 18, 2025 | Moscone South, The Hub, Theater D