Cloud Operations
AKS AI Ops - Self-Healing Kubernetes with AI Agents
Hands-on lab notes from LAB517-R1: AKS AI Ops.
The promise: Kubernetes that fixes itself
Every Kubernetes operator has lived the same nightmare. It is 2am, PagerDuty fires, and you are staring at a wall of pod evictions caused by a node that silently ran out of memory. You drain the node, scale the pool, watch the pods reschedule, and go back to bed knowing it will happen again next week.
LAB517-R1 at Ignite 2025 proposed a different future: AKS clusters that detect problems, diagnose root causes, and heal themselves using AI agents. This was not a slide deck session. It was a hands-on lab where participants simulated production failures and deployed AI-driven agents to handle them autonomously. Having managed AKS clusters in production for years, I walked in sceptical and walked out genuinely impressed by the direction -- though with clear reservations about production readiness.
Lab structure: From chaos to self-healing
The lab followed a deliberate progression that mirrors how you would actually adopt AI ops in production: start with observability, add intelligence, then extend to autonomous action.
Phase 1: Simulating production chaos
The lab began by deploying a sample application to AKS and then deliberately breaking it. The traffic simulation was well-designed -- not a simple load test, but a realistic pattern of traffic spikes combined with resource constraints that exposed the kind of cascading failures you see in real production environments.
Scenarios simulated:
- Sudden traffic spikes overwhelming pod resource limits
- Node memory pressure causing pod evictions across multiple workloads
- Slow cascading failures where upstream timeouts propagate downstream
- Resource quota exhaustion preventing new pod scheduling
Why the simulation matters: Most Kubernetes demos show clean failures with obvious causes. Production failures are messy. A pod crash might be caused by a memory leak, a noisy neighbour on the same node, a misconfigured resource request, or a genuine traffic spike. The lab's simulation captured that ambiguity, which is essential for evaluating whether AI-driven detection actually works.
Phase 2: AI-driven alerting and detection
With failures underway, the lab introduced AI-driven monitoring that goes beyond standard Kubernetes metrics. Traditional AKS monitoring gives you CPU, memory, and pod status. Useful, but not intelligent. You still need an engineer to correlate those signals and determine what is actually wrong.
What the AI layer adds:
- Pattern correlation -- Instead of individual alerts for each symptom (high CPU on node A, pod evictions on node B, request timeouts on service C), the AI correlates these into a single incident: "Memory pressure on node pool
systemcausing cascading pod evictions affecting services A, B, and C" - Anomaly detection -- Distinguishing between normal traffic variation and genuine anomalies based on historical patterns, not static thresholds
- Predictive alerting -- Identifying resource exhaustion trajectories before they cause failures. "Node pool will exhaust memory in approximately 12 minutes at current consumption rate"
The practical difference: In a traditional setup, the traffic spike scenario would generate dozens of individual alerts. Engineers waste the first 15 minutes just triaging which alerts are symptoms and which represent root causes. The AI layer compressed that triage into a single, contextual alert with a confidence score and recommended action.
The critical question: How well does anomaly detection work for clusters with irregular workload patterns? Batch processing clusters, clusters running ML training jobs, clusters with periodic data pipeline spikes -- these all have "normal" patterns that look anomalous to naive AI. The lab used a relatively uniform workload, which is the easy case. Production clusters are rarely that clean.
The aks-mcp server: Natural language meets kubectl
The most technically interesting component of the lab was the aks-mcp server -- an open-source Model Context Protocol server that enables natural-language interaction with AKS clusters.
What aks-mcp actually does
MCP (Model Context Protocol) provides a standardised way for AI agents to interact with external systems. The aks-mcp server implements this protocol for Azure Kubernetes Service, exposing cluster operations as capabilities that an AI agent can discover and invoke.
Instead of this:
kubectl get pods -n production --field-selector=status.phase!=Running -o json | \
jq '.items[] | {name: .metadata.name, status: .status.phase, reason: .status.containerStatuses[0].state.waiting.reason}'
You can say this:
"Show me all failing pods in the production namespace and explain why each one is failing."
And the agent returns:
3 pods in 'production' are not running:
1. order-processor-7b4d9-x2k4f — CrashLoopBackOff
- Container 'processor' OOMKilled 4 times in last 30 minutes
- Current memory limit: 256Mi, peak usage before kill: 254Mi
- Recommendation: Increase memory limit to 512Mi
2. cache-warmer-5c8a2-m9j3h — ImagePullBackOff
- Image 'acr.azurecr.io/cache-warmer:v2.3.1' not found
- Tag v2.3.1 does not exist in registry
- Recommendation: Check deployment manifest, latest available tag is v2.3.0
3. metrics-collector-9d1e6-p4n7q — Pending
- Insufficient CPU: requested 500m, available on schedulable nodes: 0
- Node pool 'workload' at capacity (5/5 nodes)
- Recommendation: Scale node pool or reduce CPU request
Beyond kubectl translation
The aks-mcp server is not just a natural-language wrapper around kubectl. It combines cluster state queries with contextual intelligence:
- Cross-references events with metrics -- "This pod was OOMKilled" becomes "This pod was OOMKilled because its actual memory consumption has been trending upward since deployment v2.3.0 three days ago"
- Suggests actions with impact analysis -- "Scaling this node pool will add approximately 4 minutes of provisioning time and increase monthly cost by an estimated $X"
- Understands cluster topology -- "This node pool hosts 3 critical services; draining it will cause 2 minutes of degraded performance for the payment processing pipeline"
The operational use case
For experienced Kubernetes operators, the value is not in avoiding kubectl commands. We can write those in our sleep. The value is in reducing cognitive load during incidents. During a complex incident affecting multiple services across multiple node pools, being able to ask "What changed in the last 30 minutes that could explain this failure pattern?" and getting a synthesised answer is genuinely faster than running a dozen kubectl and az commands and correlating the output mentally.
For less experienced team members, the value is more fundamental: they can investigate and potentially resolve Kubernetes issues without deep kubectl expertise. That has real implications for on-call rotation depth.
Self-healing agents: The autonomy spectrum
The lab's self-healing component deployed agents that could detect and remediate specific failure scenarios without human intervention.
What the agents handled autonomously
Node-level healing:
- Detecting nodes with persistent resource pressure
- Cordoning affected nodes to prevent new scheduling
- Draining workloads to healthy nodes
- Triggering node pool scale-up to maintain capacity
- Uncordoning nodes after resource pressure resolves
Pod-level healing:
- Detecting CrashLoopBackOff patterns and correlating with resource limits
- Adjusting resource requests/limits within defined bounds
- Restarting pods with transient failures (not crash loops)
- Scaling deployments up/down based on traffic patterns
Cluster-level healing:
- Adjusting Horizontal Pod Autoscaler (HPA) configurations based on observed patterns
- Scaling node pools proactively based on predicted demand
- Rebalancing workloads across node pools after scaling events
The guardrails that matter
The lab was careful to constrain autonomous action, and this is where the design shows maturity:
- Action boundaries -- Agents cannot delete namespaces, modify RBAC, or change network policies
- Resource ceilings -- Scaling operations have configurable maximums (e.g., node pool cannot exceed 10 nodes)
- Cost guardrails -- Estimated cost impact is calculated before scaling actions
- Approval gates -- High-impact actions (scaling beyond threshold, modifying production deployments) require human approval
- Dry-run mode -- Agents can operate in advisory mode, recommending actions without executing them
The practitioner's view: These guardrails are essential and well-designed. But the lab environment is controlled. In production, the interesting failures are the ones the guardrails did not anticipate. What happens when a self-healing action triggers a different failure? What happens when two agents recommend conflicting actions? What happens when the "correct" healing action is to do nothing and let a deployment rollback complete?
The lab acknowledged these edge cases exist but did not deeply explore them. For a hands-on session, that is acceptable. For production deployment, those edge cases define whether self-healing agents are trustworthy.
Pre-built MCP integrations: The ecosystem play
Beyond the aks-mcp server, the lab demonstrated integration with pre-built MCP servers for broader operational tooling:
- Azure Monitor MCP -- Querying metrics and logs via natural language
- Azure Resource Manager MCP -- Managing Azure resources that support the Kubernetes cluster
- GitHub MCP -- Correlating cluster issues with recent code deployments
The integration value: In production, Kubernetes issues rarely exist in isolation. A pod crash might be caused by a bad deployment (GitHub), detected via metrics anomaly (Azure Monitor), and require infrastructure scaling (ARM). The MCP integrations allow the AI agent to investigate across these boundaries without switching tools.
The open question: MCP is an open standard, but the implementations demonstrated were Azure-specific. If your operational toolchain includes Prometheus, Grafana, ArgoCD, or other non-Azure components, the MCP integration story is less clear. The protocol supports third-party implementations, but the ecosystem maturity is early.
What this means for AKS operations
The shift in operational model
Traditional AKS operations follow a reactive pattern: monitor, alert, investigate, fix. AI-driven operations shift this to: predict, prevent, auto-remediate, report. That is a meaningful change in how teams structure on-call rotations, incident response playbooks, and capacity planning.
The skills evolution
Self-healing Kubernetes does not eliminate the need for Kubernetes expertise. It shifts where that expertise is applied. Instead of spending time on routine node drains and pod restarts, engineers focus on:
- Designing workloads that are amenable to self-healing (proper resource requests, health probes, graceful shutdown)
- Configuring agent guardrails that match their risk tolerance
- Tuning detection thresholds for their specific workload patterns
- Building the observability that feeds the AI layer
The trust-building process
The lab implicitly demonstrated the right adoption path:
- Deploy in observe-only mode -- Let the agents detect and recommend without acting
- Validate recommendations -- Compare agent suggestions against what your team would have done
- Enable low-risk automation -- Start with pod restarts and node pool scaling
- Gradually extend scope -- Add more autonomous actions as confidence builds
- Monitor the monitors -- Track agent action success rates and false positive rates
This is the same process you would use to onboard a new team member to on-call: observe, shadow, handle easy issues, gradually take on complexity. The mental model is right.
The honest assessment
What worked well in the lab
- The traffic simulation was realistic enough to be credible
- The aks-mcp server genuinely reduced investigation time for complex scenarios
- Self-healing agents handled the scripted failure scenarios cleanly
- The guardrail design showed thoughtful constraint of autonomous action
What remains unproven
- Performance on real production clusters with heterogeneous workloads
- Agent behaviour during novel failure modes outside training patterns
- Interaction between multiple agents and potential conflicting recommendations
- Long-term learning and adaptation to evolving cluster configurations
- Cost implications of running AI agents alongside existing monitoring tooling
Who should pay attention
Teams running AKS at scale with frequent operational incidents will see immediate value from the AI detection and diagnosis capabilities. The self-healing agents are compelling for well-understood failure patterns. The aks-mcp server is useful today, regardless of whether you adopt the full AI ops stack.
Teams running small clusters with infrequent incidents should wait. The overhead of deploying and configuring AI agents does not justify the investment unless your incident volume and engineering costs create a clear ROI case.
Bottom line
LAB517-R1 demonstrated that AI-driven Kubernetes operations have moved from concept to working implementation. The aks-mcp server is genuinely useful as a natural-language interface to cluster management. The self-healing agents are promising but need production hardening. The detection and correlation layer is the most immediately valuable component.
The direction is right. Kubernetes operational toil -- the repetitive node drains, pod restarts, scaling adjustments, and resource tuning -- is exactly the kind of pattern-based work that AI can handle. The question is not whether AI will manage Kubernetes clusters. It is how quickly the tooling matures to the point where experienced operators trust it with production workloads.
Based on what this lab demonstrated, that maturity is closer than I expected.
Related coverage:
- Azure SRE Agent: AI-Powered SRE
- Azure SRE Agent Deep Dive: Pricing and ROI
- Microsoft Ignite 2025 Keynote Review
Session: LAB517-R1 | Nov 21, 2025 | Moscone West, Level 3, Room 3001