Azure AI Platform
Foundry Local: Microsoft's AI democratization play from cloud to edge
Microsoft Foundry Local brings AI inference to personal devices, mobile phones, and edge infrastructure—addressing privacy concerns, latency requirements, cost optimization, and offline scenarios that cloud-only AI cannot solve.
Session demonstrated expansion from Windows/macOS (4.5B+ devices) to Android mobile and Kubernetes edge deployments. PhonePe case study shows on-device transaction insights for 600M+ users under strict Indian financial regulations. Live demo: Foundry Local containers with GitOps-driven model versioning on GPU-enabled edge clusters.
Session context
Foundry Local: Bringing AI from Cloud to Edge
Speakers:
- Lior Kamrat (Microsoft)
- Sam Kemp (Microsoft, Product Manager - Foundry Local)
- Welton (PhonePe, Product Manager - Consumer Platforms)
- Dor Yitzhak (Microsoft, Azure Arc)
When: November 19, 2025 Where: Microsoft Ignite 2025, San Francisco Format: Technical session with live demos
Session focus:
"Microsoft Foundry Local enables AI on your phone, laptop, desktop, in containers, and on your own infrastructure. Outside of the public cloud. Because we believe that local AI is critical for democratizing AI, making AI available to every single person on the planet."
The four use cases driving local AI
Use case 1: Privacy and security
Challenge:
Super sensitive data—financial records, healthcare information—that cannot traverse the internet but still requires AI processing.
Solution:
Data never leaves user's device. AI inference happens locally. No round-trip to cloud.
Enterprise example:
Financial services firms handling trade communications. Specific, sensitive data that requires AI-powered translation into actual trades without exposing data to cloud services.
Use case 2: Latency
Challenge:
Instantaneous responses required. Speaking into device expecting real-time translation or transcription appearing immediately.
Solution:
Local inference eliminates network latency. Model responds within milliseconds on-device.
Consumer example:
Voice prompting travel assistant on device. Audio transcription happens instantly without waiting for cloud response.
Use case 3: Cost efficiency
Challenge:
Many inference calls happening throughout the month. Need to optimize costs using combination of cloud AI and local AI.
Solution:
Route simple queries to local models. Reserve cloud for complex reasoning requiring larger models.
Operational pattern:
Hybrid cloud-local architecture. Application decides routing based on query complexity and cost thresholds.
Use case 4: Bandwidth constraints and offline scenarios
Challenge:
Spotty airplane Wi-Fi. Warehouses and distribution centers in remote locations. Need AI functionality regardless of connectivity.
Solution:
Fully functional AI on-device. Works without internet connection.
Real-world scenario:
Traveling on airplane with unreliable Wi-Fi. Still want AI travel assistant answering questions about destination without "401 Cannot Connect" errors.
Foundry Local architecture
Cross-platform stack
Foundation: ONNX Runtime
- Introduced 2017
- High-performance inference engine
- Cross-platform (Windows, macOS, Android, Linux, Kubernetes)
- Supports multiple device architectures
Foundry Local Management Service
Responsibilities:
- Hosting and managing models on local device
- Bringing models from Foundry catalog on cloud to device
- Developers don't package models into application (keeps app size small)
- Models downloaded on-demand from cloud catalog
Developer interfaces:
Foundry Local CLI:
- Exploring models
- Testing which model provides right accuracy, response time, performance
- Prototyping phase
Foundry Local SDK:
- Integrating Foundry Local into application
- Production deployment
- Simplified from earlier versions (more on this below)
Platform-specific optimizations
Windows:
Foundry Local available as part of Microsoft Foundry on Windows. Uses Windows ML to intelligently select execution provider (NPU, GPU, CPU) based on device capabilities.
macOS:
Same developer experience as Windows. Abstracts heterogeneous hardware underneath.
Android:
New at Ignite 2025. Expands reach across billions of mobile devices.
Key advantage:
As developer, experience identical across platforms. Windows and macOS ML layers abstract hardware heterogeneity. Rich ecosystem supporting NPU, GPU, CPU—but developer doesn't worry about exact silicon on user device.
New announcements at Ignite 2025
1. Foundry Local on Android
Reach: Billions of Android devices now addressable
Use case: On-device AI for mobile apps without cloud dependency
Example: PhonePe financial insights (detailed case study below)
2. Speech-to-text models support
First in class: First local AI platform supporting speech-to-text on-device
Capability: Voice prompting AI applications. More natural interaction.
Model: Whisper running locally on device
Technical advantage: Audio transcription without sending audio to cloud. Privacy-preserving voice interfaces.
3. Improved SDK
Design principles:
Simple deployment:
- Single library import (
foundry_local) - No extra dependencies or installers
- Model inference happens within app
Small component size:
- SDK adds ~20MB to application size
- Faster download for end users
- Meets app store size constraints
Familiar cross-platform APIs:
- OpenAI-compatible input/output format
- Seamless cloud-to-edge transition
- Same code structure for cloud or local models
Beyond chat completions:
- Audio transcription support
- Image processing capabilities
- Multimodal on-device inference
Momentum and customer adoption
Scale achieved
4.5 billion+ devices with Foundry Local active
Model support expanded:
- GPT-4o-small
- Phi models
- Llama models
- Custom fine-tuned models
NPU coverage:
- Almost every major NPU in market now supported
- Qualcomm, Intel, AMD, Apple Silicon
Customer integrations
Enterprise:
- Fidelity (wealth management)
- Great Lakes (institutional customers providing trade communications)
- Dell Technologies (integrated into their applications)
Developer tools:
- Anything LLM (local AI assistant)
Mobile:
- PhonePe (600M+ users in India)
Live demo: Travel assistant with audio transcription
Scenario
Sam Kemp demonstrated adding speech-to-text to travel assistant app with minimal code changes.
Application: Travel assistant for British family booking holiday to Tresco island
Critical holiday requirements (for British travelers):
- Is there a pub?
- Can you get fish and chips?
Initial state
Chat completions working: Using cloud LLM for text queries
Audio disabled: Microphone button disabled when Wi-Fi unavailable
Problem: Users get "401 Cannot Connect" when offline but still need AI functionality
Adding audio transcription
Code required: 6 lines
Step 1: Load Whisper model
# Load Whisper model onto device into memory
whisper_model = load_model("whisper-small")
# Get audio file
audio_client = OpenAIAudioClient() # OpenAI-compatible format
Step 2: Transcribe audio
# Transcribe audio streaming
response = transcribe_audio_streaming(audio_file)
# Returns OpenAI-compatible response format
Key technical detail:
Input/output format matches OpenAI API. Switching from cloud Whisper to local Whisper requires no format changes. Same code works for cloud or edge deployment.
Demo result
Voice prompt: "Is there a pub on Tresco?"
Transcription: Happened instantly on-device
Chat response: Sent transcribed text to local chat model, received answer
User experience: Seamless AI interaction even when offline
Developer takeaways
No user installation required: Everything bundled in application
Small binary: SDK adds ~20MB to app size
Easy integration: 6 lines of code to add audio transcription
Cloud-to-edge transition: Same API format, seamless migration
PhonePe case study: On-device financial insights for 600M+ Indians
PhonePe background
Scale:
- One of India's largest fintech platforms
- 600M+ users
- Think: Cash App + Venmo + Credit Karma + Stripe + Robinhood combined
- Mission: Digital payments and financial services for every Indian
Regulatory environment:
India has strict regulations on data storage, usage, and sharing. Financial data extremely personal and critical. Cannot compromise on data security, data privacy, and reliability.
The challenge
User need: Derive meaningful insights from transaction history
Technical constraints:
- All offerings exposed as mobile app
- Need edge deployment (on-device AI)
- Highly regulated industry
- Data cannot leave device
- Must work offline
Business requirement:
Users transact multiple times daily. Want to understand spending patterns without manual analysis. Where is extra money going? Utility payments up? Unusual expenses?
Foundry Local implementation
Architecture:
Foundry Local app for Android:
- Installed as part of simple setup process
- Pick standard models or custom fine-tuned models
- Run inference on-device
- PhonePe app leverages for transaction insights
User experience:
Transaction insights:
- "How much did I spend on electricity this month?" → Model detects recent bill significantly higher than usual
- "How much money did I send to [person]?" → Model analyzes transaction history
- Rich visualizations of spending patterns
- Natural language queries over personal financial data
Privacy preservation:
All analysis happens on-device. Transaction data never leaves phone. Foundry Local processes locally. Insights generated without cloud upload.
Technical enablement
Full spectrum of AI capabilities:
Standard models: Out-of-the-box models from Foundry catalog
Custom models: PhonePe fine-tunes own models for financial domain
Data layers: Seamless data handling at scale PhonePe operates
Partnership value:
Not just technology provider, but roadmap collaboration. PhonePe and Microsoft working together on technical direction, experimenting with emerging technologies.
Welton (PhonePe Product Manager):
"We were looking for partners who could really solve a bunch of different problems for us, and Foundry Local has really helped us tackle the most complex problem that we have. As a partner, Foundry Local has enabled the full spectrum of AI capabilities for us."
Edge deployment: Foundry Local on Kubernetes with Azure Arc
The edge use case
Customer requirements:
1. Bring your own Kubernetes distribution
- Don't break existing patterns
- DevOps teams, developers, platform engineers work in certain way
- Foundry Local must complement, not replace
2. Microservices architecture
- Container as atomic unit in Kubernetes
- Foundry Local needs containerized distribution
3. Rich Kubernetes ecosystem
- Leverage existing tools and patterns
- Complement ecosystem, not compete
4. Bring your own model
- Custom fine-tuned models
- LoRA adapters
- Proprietary models not in catalog
5. Unified control plane
- Azure Arc provides single control plane across edge deployments
Foundry Local containerized architecture
What's containerized:
Foundry Local SDK: Bundled as OCI-compliant container
Models: Packaged as OCI artifacts (not just Docker images)
Deployment target: Any Kubernetes distribution (connected, disconnected, air-gapped)
Key technical detail:
OCI (Open Container Initiative) registry enables bundling different file formats together. Model files packaged as OCI artifacts alongside container image.
Live demo: GitOps-driven model versioning
Setup:
Consumer-grade edge devices running single-node Kubernetes clusters on Ubuntu with NVIDIA RTX 4080 GPUs.
Architecture:
GitOps pattern:
- Helm release in Git repository (source of truth)
- Flux GitOps operator watches repository
- Automated deployment when configuration changes
OCI registry:
- Holds Foundry Local container
- Holds model packaged as OCI artifact
- Llama model versions (v1, v2)
Demo flow:
Phase 1: Initial deployment
- Kubernetes cluster with Foundry Local pod
- OpenWebUI pod for interaction
- Llama model v1 in Foundry Local cache folder
- Model ready for inference
Phase 2: Model upgrade
- Push Llama model v2 to OCI registry using ORAS tool
- Update Helm release in Git repo to point to v2
- GitOps operator detects change
- Initiates rolling upgrade of pods
- New pod spins up with v2 model
- Old pod terminates
Technical detail:
ORAS (OCI Registry As Storage) - Microsoft open-source tool for interacting with OCI artifacts. Like Docker, but designed for OCI artifacts, not running containers.
GPU utilization monitoring:
Demo showed nvidia-smi monitoring GPU usage during inference. Model configured to use CUDA libraries, hammering GPU during question answering (nostalgic 80s cartoon trivia: Optimus Prime vs. Megatron).
Production patterns demonstrated
Disconnected scenarios: Works air-gapped, no internet required
Sovereignty compliance: European customers with data residency requirements
Rolling upgrades: Zero-downtime model versioning
GPU acceleration: Automatic execution provider selection (CUDA in demo)
GitOps automation: Infrastructure-as-code for model deployments
SDK improvements: Developer experience
Four design principles
1. Simple deployment
Developers: Easy to deploy AI-infused applications
End users: Easy to consume AI-infused applications
Implementation: Single library import. No complex dependency chains.
2. Small component size
Why it matters:
- Faster download for end users
- App stores have size constraints
- Better application experience
Result: SDK adds ~20MB to application
3. Familiar cross-platform APIs
OpenAI-compatible format: Seamless cloud-to-edge transition
Example:
# Same code works for cloud or local
audio_client = OpenAIAudioClient()
response = audio_client.transcribe(audio_file)
# Format identical whether using OpenAI cloud or Foundry Local
4. Beyond chat completions
Not just text chat. Audio, image, multimodal capabilities on-device.
What wasn't demonstrated
Cost modeling for hybrid cloud-local
Challenge:
Session emphasized cost efficiency as use case #3, but no concrete guidance on:
- How to calculate break-even point cloud vs. local inference
- Cost of model downloads to devices
- Storage costs for on-device models
- Energy consumption on mobile devices
Production question:
At what query volume does local inference become cheaper than cloud? How do you measure this per application?
Model selection and accuracy tradeoffs
Challenge:
Foundry Local supports smaller models (Phi, GPT-4o-small). How do developers evaluate accuracy tradeoffs?
Unanswered:
- Benchmark comparisons local vs. cloud models
- Accuracy degradation quantified
- Guidance on which tasks suitable for local models
Security and model tampering
Challenge:
Models cached on user devices. What prevents:
- Model extraction and IP theft
- Model tampering by malicious actors
- Adversarial inputs exploiting local models
Production concern:
Enterprise deploying Foundry Local needs threat model for on-device AI security.
Multi-tenant edge deployments
Challenge:
Kubernetes demo showed single-tenant edge cluster. Production edge environments often multi-tenant.
Unanswered:
- Resource isolation between tenants
- Cost allocation per tenant
- Security boundaries for multi-tenant edge AI
Battery life and thermal impact on mobile
Challenge:
PhonePe demo showed on-device inference on Android. What about:
- Battery drain from local LLM inference
- Thermal impact on phone performance
- User experience degradation during heavy AI usage
Production concern:
Mobile users will abandon apps that kill battery. How do you measure and optimize this?
The honest assessment
What's genuinely valuable
Privacy-first architecture:
PhonePe case study proves local AI solves real regulatory and privacy constraints. 600M+ users getting AI-powered financial insights without data leaving device is legitimate use case, not marketing.
Offline functionality:
Travel assistant demo showed practical benefit. AI shouldn't stop working when Wi-Fi fails. Local inference solves this.
Platform maturity:
4.5B+ devices with Foundry Local active indicates real deployment, not lab prototype. Windows ML integration shows Microsoft leveraging OS capabilities properly.
Kubernetes integration:
GitOps-driven edge deployment pattern is production-ready. Containerized Foundry Local with OCI artifact models enables enterprise edge AI at scale.
SDK improvements:
20MB overhead, single library import, OpenAI-compatible APIs—these are developer-friendly decisions that reduce adoption friction.
What's concerning
Model catalog limitations:
Smaller models on-device means accuracy tradeoffs. Session didn't address how developers evaluate when local models insufficient.
Android in private preview:
Mobile is critical distribution channel. Keeping Android in private preview (not GA) limits immediate adoption.
Cost modeling gaps:
Emphasized cost efficiency but provided no tools or guidance for calculating cloud vs. local cost tradeoffs.
Battery life unaddressed:
Running LLMs on mobile drains battery. Session ignored thermal and power concerns that will impact user experience.
Security model unclear:
Models on user devices create attack surface. No discussion of model protection, extraction prevention, or adversarial input defense.
Production considerations
When local AI makes sense
Strong use cases:
Regulated industries: Healthcare, financial services where data cannot leave device
Offline-critical: Remote locations, unreliable connectivity, airplane use
Latency-sensitive: Real-time transcription, instant translation, responsive UI
High-volume simple queries: Cost optimization by routing simple queries locally
When cloud AI still necessary
Complex reasoning: Large models (GPT-4 class) cannot run on consumer devices
Rapidly changing knowledge: Local models have stale training data
Computational intensity: Image generation, video processing require cloud GPUs
Compliance and auditability: Some regulations require cloud-based audit trails
Hybrid architecture pattern
Routing logic in application:
- Classify query complexity
- If simple → route to Foundry Local
- If complex → route to cloud LLM
- If offline → degrade gracefully to local model
Operational complexity:
Maintaining two inference paths increases testing surface, monitoring complexity, and failure modes.
Strategic implications
Microsoft's edge AI bet
Platform play:
Foundry Local positions Microsoft as enabler of edge AI, not just cloud AI provider. Windows ML integration shows leveraging OS capabilities as competitive moat.
Open standards:
OpenAI-compatible APIs and OCI artifacts signal commitment to interoperability. Reduces vendor lock-in concerns.
Ecosystem expansion:
PhonePe partnership (India), Dell integration (enterprise PCs), Kubernetes support (edge infrastructure)—Microsoft covering distribution channels systematically.
Competitive landscape
Apple: On-device AI with Apple Intelligence, tight hardware-software integration
Google: Android ecosystem, TensorFlow Lite for mobile
Qualcomm/MediaTek: NPU silicon vendors enabling on-device AI
Microsoft differentiation:
Cross-platform (Windows, macOS, Android, Linux, Kubernetes). Not tied to single hardware vendor. Foundry catalog simplifies model distribution.
What to watch
Android GA timeline: Private preview limits adoption. Watch for general availability announcement.
Battery life optimizations: Mobile success depends on power efficiency. Monitor updates to SDK for battery optimization.
Model catalog expansion: Limited models on-device today. Track Foundry catalog additions optimized for edge.
Enterprise case studies: PhonePe proves mobile use case. Watch for enterprise edge deployments (manufacturing, healthcare, retail).
Kubernetes adoption: Arc-enabled edge with Foundry Local enables industrial IoT scenarios. Track customer deployments.
Hybrid cloud-local patterns: Watch for reference architectures showing routing logic between cloud and local inference.
Learn more
Official resources:
- Foundry Local - Getting started documentation
- Android Private Preview Signup
- Foundry Local SDK documentation
- Foundry Local on Kubernetes
Technologies demonstrated:
- ONNX Runtime (high-performance inference engine)
- Windows ML (execution provider abstraction)
- Whisper (speech-to-text on-device)
- Azure Arc (unified edge control plane)
- GitOps with Flux (automated Kubernetes deployments)
- ORAS (OCI Registry As Storage)
Customer examples:
- PhonePe (600M+ users, financial insights on Android)
- Fidelity (wealth management)
- Great Lakes (institutional trade communications)
- Dell Technologies (integrated into applications)
- Anything LLM (developer tools)
Related Ignite sessions:
- AI Fleet Operations (Foundry)
- Building Multi-Agent Systems with Azure AI Foundry
- Pizza Ordering Agent Lab
- A2A and MCP Systems Lab