SteveITpro - Learning AI & Cloud in Public

Microsoft Foundry Local brings AI inference to personal devices, mobile phones, and edge infrastructure—addressing privacy concerns, latency requirements, cost optimization, and offline scenarios that cloud-only AI cannot solve.

Session demonstrated expansion from Windows/macOS (4.5B+ devices) to Android mobile and Kubernetes edge deployments. PhonePe case study shows on-device transaction insights for 600M+ users under strict Indian financial regulations. Live demo: Foundry Local containers with GitOps-driven model versioning on GPU-enabled edge clusters.

Session context

Foundry Local: Bringing AI from Cloud to Edge

Speakers:

Lior Kamrat (Microsoft)
Sam Kemp (Microsoft, Product Manager - Foundry Local)
Welton (PhonePe, Product Manager - Consumer Platforms)
Dor Yitzhak (Microsoft, Azure Arc)

When: November 19, 2025 Where: Microsoft Ignite 2025, San Francisco Format: Technical session with live demos

Session focus:

"Microsoft Foundry Local enables AI on your phone, laptop, desktop, in containers, and on your own infrastructure. Outside of the public cloud. Because we believe that local AI is critical for democratizing AI, making AI available to every single person on the planet."

The four use cases driving local AI

Use case 1: Privacy and security

Challenge:

Super sensitive data—financial records, healthcare information—that cannot traverse the internet but still requires AI processing.

Solution:

Data never leaves user's device. AI inference happens locally. No round-trip to cloud.

Enterprise example:

Financial services firms handling trade communications. Specific, sensitive data that requires AI-powered translation into actual trades without exposing data to cloud services.

Use case 2: Latency

Challenge:

Instantaneous responses required. Speaking into device expecting real-time translation or transcription appearing immediately.

Solution:

Local inference eliminates network latency. Model responds within milliseconds on-device.

Consumer example:

Voice prompting travel assistant on device. Audio transcription happens instantly without waiting for cloud response.

Use case 3: Cost efficiency

Challenge:

Many inference calls happening throughout the month. Need to optimize costs using combination of cloud AI and local AI.

Solution:

Route simple queries to local models. Reserve cloud for complex reasoning requiring larger models.

Operational pattern:

Hybrid cloud-local architecture. Application decides routing based on query complexity and cost thresholds.

Use case 4: Bandwidth constraints and offline scenarios

Challenge:

Spotty airplane Wi-Fi. Warehouses and distribution centers in remote locations. Need AI functionality regardless of connectivity.

Solution:

Fully functional AI on-device. Works without internet connection.

Real-world scenario:

Traveling on airplane with unreliable Wi-Fi. Still want AI travel assistant answering questions about destination without "401 Cannot Connect" errors.

Foundry Local architecture

Cross-platform stack

Foundation: ONNX Runtime

Introduced 2017
High-performance inference engine
Cross-platform (Windows, macOS, Android, Linux, Kubernetes)
Supports multiple device architectures

Foundry Local Management Service

Responsibilities:

Hosting and managing models on local device
Bringing models from Foundry catalog on cloud to device
Developers don't package models into application (keeps app size small)
Models downloaded on-demand from cloud catalog

Developer interfaces:

Foundry Local CLI:

Exploring models
Testing which model provides right accuracy, response time, performance
Prototyping phase

Foundry Local SDK:

Integrating Foundry Local into application
Production deployment
Simplified from earlier versions (more on this below)

Platform-specific optimizations

Windows:

Foundry Local available as part of Microsoft Foundry on Windows. Uses Windows ML to intelligently select execution provider (NPU, GPU, CPU) based on device capabilities.

macOS:

Same developer experience as Windows. Abstracts heterogeneous hardware underneath.

Android:

New at Ignite 2025. Expands reach across billions of mobile devices.

Key advantage:

As developer, experience identical across platforms. Windows and macOS ML layers abstract hardware heterogeneity. Rich ecosystem supporting NPU, GPU, CPU—but developer doesn't worry about exact silicon on user device.

New announcements at Ignite 2025

1. Foundry Local on Android

Reach: Billions of Android devices now addressable

Use case: On-device AI for mobile apps without cloud dependency

Example: PhonePe financial insights (detailed case study below)

2. Speech-to-text models support

First in class: First local AI platform supporting speech-to-text on-device

Capability: Voice prompting AI applications. More natural interaction.

Model: Whisper running locally on device

Technical advantage: Audio transcription without sending audio to cloud. Privacy-preserving voice interfaces.

3. Improved SDK

Design principles:

Simple deployment:

Single library import (foundry_local)
No extra dependencies or installers
Model inference happens within app

Small component size:

SDK adds ~20MB to application size
Faster download for end users
Meets app store size constraints

Familiar cross-platform APIs:

OpenAI-compatible input/output format
Seamless cloud-to-edge transition
Same code structure for cloud or local models

Beyond chat completions:

Audio transcription support
Image processing capabilities
Multimodal on-device inference

Momentum and customer adoption

Scale achieved

4.5 billion+ devices with Foundry Local active

Model support expanded:

GPT-4o-small
Phi models
Llama models
Custom fine-tuned models

NPU coverage:

Almost every major NPU in market now supported
Qualcomm, Intel, AMD, Apple Silicon

Customer integrations

Enterprise:

Fidelity (wealth management)
Great Lakes (institutional customers providing trade communications)
Dell Technologies (integrated into their applications)

Developer tools:

Anything LLM (local AI assistant)

Mobile:

PhonePe (600M+ users in India)

Live demo: Travel assistant with audio transcription

Scenario

Sam Kemp demonstrated adding speech-to-text to travel assistant app with minimal code changes.

Application: Travel assistant for British family booking holiday to Tresco island

Critical holiday requirements (for British travelers):

Is there a pub?
Can you get fish and chips?

Initial state

Chat completions working: Using cloud LLM for text queries

Audio disabled: Microphone button disabled when Wi-Fi unavailable

Problem: Users get "401 Cannot Connect" when offline but still need AI functionality

Adding audio transcription

Code required: 6 lines

Step 1: Load Whisper model

# Load Whisper model onto device into memory
whisper_model = load_model("whisper-small")

# Get audio file
audio_client = OpenAIAudioClient()  # OpenAI-compatible format

Step 2: Transcribe audio

# Transcribe audio streaming
response = transcribe_audio_streaming(audio_file)
# Returns OpenAI-compatible response format

Key technical detail:

Input/output format matches OpenAI API. Switching from cloud Whisper to local Whisper requires no format changes. Same code works for cloud or edge deployment.

Demo result

Voice prompt: "Is there a pub on Tresco?"

Transcription: Happened instantly on-device

Chat response: Sent transcribed text to local chat model, received answer

User experience: Seamless AI interaction even when offline

Developer takeaways

No user installation required: Everything bundled in application

Small binary: SDK adds ~20MB to app size

Easy integration: 6 lines of code to add audio transcription

Cloud-to-edge transition: Same API format, seamless migration

PhonePe case study: On-device financial insights for 600M+ Indians

PhonePe background

Scale:

One of India's largest fintech platforms
600M+ users
Think: Cash App + Venmo + Credit Karma + Stripe + Robinhood combined
Mission: Digital payments and financial services for every Indian

Regulatory environment:

India has strict regulations on data storage, usage, and sharing. Financial data extremely personal and critical. Cannot compromise on data security, data privacy, and reliability.

The challenge

User need: Derive meaningful insights from transaction history

Technical constraints:

All offerings exposed as mobile app
Need edge deployment (on-device AI)
Highly regulated industry
Data cannot leave device
Must work offline

Business requirement:

Users transact multiple times daily. Want to understand spending patterns without manual analysis. Where is extra money going? Utility payments up? Unusual expenses?

Foundry Local implementation

Architecture:

Foundry Local app for Android:

Installed as part of simple setup process
Pick standard models or custom fine-tuned models
Run inference on-device
PhonePe app leverages for transaction insights

User experience:

Transaction insights:

"How much did I spend on electricity this month?" → Model detects recent bill significantly higher than usual
"How much money did I send to [person]?" → Model analyzes transaction history
Rich visualizations of spending patterns
Natural language queries over personal financial data

Privacy preservation:

All analysis happens on-device. Transaction data never leaves phone. Foundry Local processes locally. Insights generated without cloud upload.

Technical enablement

Full spectrum of AI capabilities:

Standard models: Out-of-the-box models from Foundry catalog

Custom models: PhonePe fine-tunes own models for financial domain

Data layers: Seamless data handling at scale PhonePe operates

Partnership value:

Not just technology provider, but roadmap collaboration. PhonePe and Microsoft working together on technical direction, experimenting with emerging technologies.

Welton (PhonePe Product Manager):

"We were looking for partners who could really solve a bunch of different problems for us, and Foundry Local has really helped us tackle the most complex problem that we have. As a partner, Foundry Local has enabled the full spectrum of AI capabilities for us."

Edge deployment: Foundry Local on Kubernetes with Azure Arc

The edge use case

Customer requirements:

1. Bring your own Kubernetes distribution

Don't break existing patterns
DevOps teams, developers, platform engineers work in certain way
Foundry Local must complement, not replace

2. Microservices architecture

Container as atomic unit in Kubernetes
Foundry Local needs containerized distribution

3. Rich Kubernetes ecosystem

Leverage existing tools and patterns
Complement ecosystem, not compete

4. Bring your own model

Custom fine-tuned models
LoRA adapters
Proprietary models not in catalog

5. Unified control plane

Azure Arc provides single control plane across edge deployments

Foundry Local containerized architecture

What's containerized:

Foundry Local SDK: Bundled as OCI-compliant container

Models: Packaged as OCI artifacts (not just Docker images)

Deployment target: Any Kubernetes distribution (connected, disconnected, air-gapped)

Key technical detail:

OCI (Open Container Initiative) registry enables bundling different file formats together. Model files packaged as OCI artifacts alongside container image.

Live demo: GitOps-driven model versioning

Setup:

Consumer-grade edge devices running single-node Kubernetes clusters on Ubuntu with NVIDIA RTX 4080 GPUs.

Architecture:

GitOps pattern:

Helm release in Git repository (source of truth)
Flux GitOps operator watches repository
Automated deployment when configuration changes

OCI registry:

Holds Foundry Local container
Holds model packaged as OCI artifact
Llama model versions (v1, v2)

Demo flow:

Phase 1: Initial deployment

Kubernetes cluster with Foundry Local pod
OpenWebUI pod for interaction
Llama model v1 in Foundry Local cache folder
Model ready for inference

Phase 2: Model upgrade

Push Llama model v2 to OCI registry using ORAS tool
Update Helm release in Git repo to point to v2
GitOps operator detects change
Initiates rolling upgrade of pods
New pod spins up with v2 model
Old pod terminates

Technical detail:

ORAS (OCI Registry As Storage) - Microsoft open-source tool for interacting with OCI artifacts. Like Docker, but designed for OCI artifacts, not running containers.

GPU utilization monitoring:

Demo showed nvidia-smi monitoring GPU usage during inference. Model configured to use CUDA libraries, hammering GPU during question answering (nostalgic 80s cartoon trivia: Optimus Prime vs. Megatron).

Production patterns demonstrated

Disconnected scenarios: Works air-gapped, no internet required

Sovereignty compliance: European customers with data residency requirements

Rolling upgrades: Zero-downtime model versioning

GPU acceleration: Automatic execution provider selection (CUDA in demo)

GitOps automation: Infrastructure-as-code for model deployments

SDK improvements: Developer experience

Four design principles

1. Simple deployment

Developers: Easy to deploy AI-infused applications

End users: Easy to consume AI-infused applications

Implementation: Single library import. No complex dependency chains.

2. Small component size

Why it matters:

Faster download for end users
App stores have size constraints
Better application experience

Result: SDK adds ~20MB to application

3. Familiar cross-platform APIs

OpenAI-compatible format: Seamless cloud-to-edge transition

Example:

# Same code works for cloud or local
audio_client = OpenAIAudioClient()
response = audio_client.transcribe(audio_file)
# Format identical whether using OpenAI cloud or Foundry Local

4. Beyond chat completions

Not just text chat. Audio, image, multimodal capabilities on-device.

What wasn't demonstrated

Cost modeling for hybrid cloud-local

Challenge:

Session emphasized cost efficiency as use case #3, but no concrete guidance on:

How to calculate break-even point cloud vs. local inference
Cost of model downloads to devices
Storage costs for on-device models
Energy consumption on mobile devices

Production question:

At what query volume does local inference become cheaper than cloud? How do you measure this per application?

Model selection and accuracy tradeoffs

Challenge:

Foundry Local supports smaller models (Phi, GPT-4o-small). How do developers evaluate accuracy tradeoffs?

Unanswered:

Benchmark comparisons local vs. cloud models
Accuracy degradation quantified
Guidance on which tasks suitable for local models

Security and model tampering

Challenge:

Models cached on user devices. What prevents:

Model extraction and IP theft
Model tampering by malicious actors
Adversarial inputs exploiting local models

Production concern:

Enterprise deploying Foundry Local needs threat model for on-device AI security.

Multi-tenant edge deployments

Challenge:

Kubernetes demo showed single-tenant edge cluster. Production edge environments often multi-tenant.

Unanswered:

Resource isolation between tenants
Cost allocation per tenant
Security boundaries for multi-tenant edge AI

Battery life and thermal impact on mobile

Challenge:

PhonePe demo showed on-device inference on Android. What about:

Battery drain from local LLM inference
Thermal impact on phone performance
User experience degradation during heavy AI usage

Production concern:

Mobile users will abandon apps that kill battery. How do you measure and optimize this?

The honest assessment

What's genuinely valuable

Privacy-first architecture:

PhonePe case study proves local AI solves real regulatory and privacy constraints. 600M+ users getting AI-powered financial insights without data leaving device is legitimate use case, not marketing.

Offline functionality:

Travel assistant demo showed practical benefit. AI shouldn't stop working when Wi-Fi fails. Local inference solves this.

Platform maturity:

4.5B+ devices with Foundry Local active indicates real deployment, not lab prototype. Windows ML integration shows Microsoft leveraging OS capabilities properly.

Kubernetes integration:

GitOps-driven edge deployment pattern is production-ready. Containerized Foundry Local with OCI artifact models enables enterprise edge AI at scale.

SDK improvements:

20MB overhead, single library import, OpenAI-compatible APIs—these are developer-friendly decisions that reduce adoption friction.

What's concerning

Model catalog limitations:

Smaller models on-device means accuracy tradeoffs. Session didn't address how developers evaluate when local models insufficient.

Android in private preview:

Mobile is critical distribution channel. Keeping Android in private preview (not GA) limits immediate adoption.

Cost modeling gaps:

Emphasized cost efficiency but provided no tools or guidance for calculating cloud vs. local cost tradeoffs.

Battery life unaddressed:

Running LLMs on mobile drains battery. Session ignored thermal and power concerns that will impact user experience.

Security model unclear:

Models on user devices create attack surface. No discussion of model protection, extraction prevention, or adversarial input defense.

Production considerations

When local AI makes sense

Strong use cases:

Regulated industries: Healthcare, financial services where data cannot leave device

Offline-critical: Remote locations, unreliable connectivity, airplane use

Latency-sensitive: Real-time transcription, instant translation, responsive UI

High-volume simple queries: Cost optimization by routing simple queries locally

When cloud AI still necessary

Complex reasoning: Large models (GPT-4 class) cannot run on consumer devices

Rapidly changing knowledge: Local models have stale training data

Computational intensity: Image generation, video processing require cloud GPUs

Compliance and auditability: Some regulations require cloud-based audit trails

Hybrid architecture pattern

Routing logic in application:

Classify query complexity
If simple → route to Foundry Local
If complex → route to cloud LLM
If offline → degrade gracefully to local model

Operational complexity:

Maintaining two inference paths increases testing surface, monitoring complexity, and failure modes.

Strategic implications

Microsoft's edge AI bet

Platform play:

Foundry Local positions Microsoft as enabler of edge AI, not just cloud AI provider. Windows ML integration shows leveraging OS capabilities as competitive moat.

Open standards:

OpenAI-compatible APIs and OCI artifacts signal commitment to interoperability. Reduces vendor lock-in concerns.

Ecosystem expansion:

PhonePe partnership (India), Dell integration (enterprise PCs), Kubernetes support (edge infrastructure)—Microsoft covering distribution channels systematically.

Competitive landscape

Apple: On-device AI with Apple Intelligence, tight hardware-software integration

Google: Android ecosystem, TensorFlow Lite for mobile

Qualcomm/MediaTek: NPU silicon vendors enabling on-device AI

Microsoft differentiation:

Cross-platform (Windows, macOS, Android, Linux, Kubernetes). Not tied to single hardware vendor. Foundry catalog simplifies model distribution.

What to watch

Android GA timeline: Private preview limits adoption. Watch for general availability announcement.

Battery life optimizations: Mobile success depends on power efficiency. Monitor updates to SDK for battery optimization.

Model catalog expansion: Limited models on-device today. Track Foundry catalog additions optimized for edge.

Enterprise case studies: PhonePe proves mobile use case. Watch for enterprise edge deployments (manufacturing, healthcare, retail).

Kubernetes adoption: Arc-enabled edge with Foundry Local enables industrial IoT scenarios. Track customer deployments.

Hybrid cloud-local patterns: Watch for reference architectures showing routing logic between cloud and local inference.

Learn more

Official resources:

Foundry Local - Getting started documentation
Android Private Preview Signup
Foundry Local SDK documentation
Foundry Local on Kubernetes

Technologies demonstrated:

ONNX Runtime (high-performance inference engine)
Windows ML (execution provider abstraction)
Whisper (speech-to-text on-device)
Azure Arc (unified edge control plane)
GitOps with Flux (automated Kubernetes deployments)
ORAS (OCI Registry As Storage)

Customer examples:

PhonePe (600M+ users, financial insights on Android)
Fidelity (wealth management)
Great Lakes (institutional trade communications)
Dell Technologies (integrated into applications)
Anything LLM (developer tools)

Related Ignite sessions: