SteveITpro - Learning AI & Cloud in Public

The azure-search-openai-demo repository is Microsoft's canonical example of building ChatGPT-like experiences over private enterprise data using Retrieval-Augmented Generation (RAG). It's also a masterclass in what "sample code" means versus production readiness.

This Python-based demo showcases Azure AI Search for document retrieval and Azure OpenAI for generation, deployed via Azure Container Apps. The fictitious company "Contoso" demonstrates employee questions about benefits and policies. Microsoft warns explicitly: "strongly advise customers not to make demo code part of production environments." Worth understanding why.

What the demo provides

Core RAG pattern implementation

Retrieval-Augmented Generation architecture:

User query → Azure AI Search (retrieval) → Azure OpenAI GPT (generation) → Response with citations

Not just chat:

Multi-turn conversational interface
Q&A mode for single questions
Citation rendering showing source documents
Thought process visibility (chain-of-thought reasoning)

Multimodal capabilities:

Speech input (Azure Speech Service)
Speech output (text-to-speech)
GPT-4 Vision for image analysis in documents

Technical stack

Backend:

Python (primary implementation)
Alternative implementations: JavaScript, .NET, Java

Azure services:

Azure AI Search (document indexing, vector search, hybrid search)
Azure OpenAI Service (GPT models for generation)
Azure Container Apps or Azure App Service (hosting)
Optional: Azure Speech Service, Azure Document Intelligence

Developer tooling:

Azure Developer CLI (azd) for one-command deployment
Bicep for infrastructure-as-code
Dev containers for consistent development environments

Deployment options:

GitHub Codespaces (instant cloud development)
VS Code Dev Containers (local development with container)
Local environment (Python 3.10-3.14, Node.js 20+)

The "azd up" magic

Single command deployment:

azd auth login
azd env new
azd up

What happens:

Provisions Azure resources (AI Search, OpenAI, Container Apps)
Deploys application code
Builds search index from sample documents
Returns URL to running application

Time to working demo: ~10-15 minutes

Why this matters:

Reduces friction from "interested in RAG" to "working RAG application" from days to minutes. Critical for developer adoption.

The fictitious company pattern

Contoso employee benefits scenario

Use case:

Employees ask questions about:

Benefits and compensation
Internal policies
Job descriptions and roles

Sample documents:

Benefits handbooks, policy PDFs, role descriptions—realistic enterprise content types.

Example queries:

"What is the company policy on remote work?"
"How much vacation time do I get?"
"What are the qualifications for Senior Engineer role?"

Why this scenario:

HR and policy documents are:

Common enterprise use case
Manageable document corpus for demo
Relatable to anyone who's worked at mid-large company
Low risk (fictional data, no real PII/PHI)

What Contoso hides

Real enterprise complexity not demonstrated:

Document heterogeneity: Contoso docs are clean PDFs and text. Real enterprises have scanned images, handwritten notes, legacy formats, inconsistent structures.

Access control: All Contoso employees see all documents. Real enterprises need row-level security, role-based access, data classification.

Compliance: No GDPR, HIPAA, SOC 2 considerations. No audit trails, data residency requirements, or retention policies.

Scale: Contoso has hundreds of documents. Real enterprises have millions, with daily updates.

Multi-tenancy: Single tenant demo. Real SaaS providers need customer data isolation.

Architecture deep dive

Classic RAG flow

Step 1: Document ingestion

Documents → Azure Document Intelligence → Chunks with embeddings → Azure AI Search index

Chunking strategy:

Split documents into manageable segments (default: 1000 tokens with 100 token overlap)
Preserve semantic boundaries (paragraphs, sections)
Generate embeddings for each chunk

Step 2: Query processing

User query → Generate embedding → Vector search + keyword search (hybrid)

Hybrid search:

Vector search: Semantic similarity using embeddings
Keyword search: Traditional full-text search
Combined ranking: Best of both approaches

Step 3: Context assembly

Top K retrieved chunks → Formatted as context → Injected into LLM prompt

Step 4: Generation

Azure OpenAI GPT receives:

User query
Retrieved document context
System prompt defining behavior

Generates response grounded in provided context.

Step 5: Citation rendering

Response includes references to source chunks, rendered as clickable citations in UI.

Modern RAG (agentic retrieval)

What changed:

LLM acts as query planner, breaking complex questions into subqueries.

Example:

User: "Compare remote work policies before and after 2020"

Classic RAG: Single search for "remote work policies 2020"

Agentic RAG:

LLM generates subqueries: "remote work policy before 2020", "remote work policy after 2020"
Executes searches in parallel
LLM synthesizes comparison from multiple result sets

Advantage:

Better handling of multi-faceted questions requiring information synthesis across document corpus.

What the demo does well

Developer experience optimization

One-command deployment works:

azd up genuinely provisions everything and results in working application. Not marketing—actually functional.

Local development simplified:

Dev containers ensure consistent Python/Node versions, dependencies, environment configuration. No "works on my machine" problems.

Clear documentation:

README walks through deployment, configuration, troubleshooting. Code comments explain RAG pattern implementation.

RAG pattern education

Visibility into reasoning:

UI shows retrieved documents, chunk text, citations. Developers see how retrieval affects generation quality.

Experimentation-friendly:

Easy to swap models, adjust chunk sizes, tune search parameters. Learn by modifying and observing results.

Multi-language implementations:

Python, JavaScript, .NET, Java versions teach same patterns in different ecosystems.

Azure integration showcase

Service orchestration:

Demonstrates how Azure AI Search, OpenAI, Container Apps work together. Infrastructure-as-code (Bicep) shows production provisioning patterns.

Managed identity:

Uses Azure AD authentication between services. No hardcoded keys in code (critical security pattern).

Monitoring integration:

Application Insights traces requests, errors, performance. Shows telemetry integration from start.

The production gap

What Microsoft explicitly warns against

From the repository:

"This sample is designed to be a starting point only. We strongly advise customers not to make demo code part of production environments without implementing additional security features."

Why this warning exists:

1. Authentication is optional

Demo ships with no authentication. Anyone with URL can access. Production requires:

User authentication (Azure AD, OAuth)
Authorization (who can see which documents)
Audit logging (who accessed what, when)

2. No document-level security

All indexed documents accessible to all users. Production needs:

Row-level security (users see only documents they're authorized for)
Dynamic filtering based on user identity
Security trimming in search results

3. Minimal input validation

Demo trusts user input. Production requires:

Prompt injection defense
Input sanitization
Rate limiting per user
Cost controls (token usage caps)

4. No PII/PHI handling

Contoso documents contain no real sensitive data. Production with actual PII/PHI requires:

Data classification and labeling
Encryption at rest and in transit
DLP (Data Loss Prevention) policies
Compliance controls (GDPR, HIPAA, etc.)

5. Limited error handling

Demo shows happy path. Production needs:

Graceful degradation when services unavailable
Retry logic with exponential backoff
Circuit breakers for failing dependencies
User-friendly error messages (not stack traces)

Recommended production path

Microsoft's guidance:

Use chat-with-your-data-solution-accelerator instead. It includes:

Production security controls
Multi-tenant isolation
Advanced monitoring and observability
Enterprise-grade deployment patterns
Best practices for compliance

Or:

Follow Azure OpenAI Landing Zone reference architecture for:

Network isolation (VNets, private endpoints)
WAF and API Management for API security
Key Vault for secrets management
RBAC and managed identities throughout
DR and backup strategies

Real-world customization challenges

Document diversity problem

Demo assumption: Clean PDFs and text files

Reality:

Enterprises have:

Scanned images requiring OCR
Tables and charts requiring layout understanding
Multi-language documents requiring translation
Legacy formats (WordPerfect, Lotus Notes)
Email threads with attachments
SharePoint sites with permissions inheritance

Solution complexity:

Azure Document Intelligence helps, but requires:

Custom preprocessing pipelines
Format-specific handling
Quality validation (OCR errors)
Metadata extraction and preservation

Retrieval quality tuning

Demo uses defaults:

Default chunk size, default embedding model, default search parameters.

Production requires experimentation:

Chunk size optimization:

Too small: Fragments lack context
Too large: Dilutes semantic meaning
Domain-specific: Legal contracts need different chunking than chat logs

Embedding model selection:

text-embedding-ada-002 vs. text-embedding-3-large
Domain-specific fine-tuning for specialized vocabulary
Multilingual embedding models for global enterprises

Search parameter tuning:

Hybrid search weighting (vector vs. keyword)
Semantic reranking thresholds
Top K value (how many chunks to retrieve)
Relevance scoring adjustments

Measurement:

Need golden question set with known correct answers. Measure precision, recall, NDCG. Iterate on configuration. This is ongoing work, not one-time setup.

Cost management at scale

Demo cost: Negligible (sample documents, low query volume)

Production cost drivers:

1. Azure AI Search:

Index size (storage cost scales with document volume)
Query volume (pay per search query)
Replica count for availability

2. Azure OpenAI:

Prompt tokens (retrieved chunks add significant context)
Completion tokens (generated responses)
Model tier (GPT-4 vs. GPT-3.5 pricing difference)

3. Azure Document Intelligence:

Page processing charges for document ingestion
Volume scales with document corpus size

Cost optimization strategies:

Caching:

Cache frequent queries and responses
TTL based on document update frequency
Reduces redundant LLM calls

Query routing:

Simple questions → GPT-3.5
Complex reasoning → GPT-4
Threshold-based routing

Chunk deduplication:

Don't retrieve duplicate chunks
Remove redundant context before LLM call
Reduces prompt token costs

Monitoring:

Cost per query tracking
Anomaly detection for runaway costs
Alerting when thresholds exceeded

Advanced patterns not in demo

Session management and personalization

Demo:

Stateless conversation. Each query independent.

Production needs:

Conversation history:

Store chat history per user
Inject relevant prior context into prompts
Manage context window limits (can't include infinite history)

Personalization:

User preferences (response style, verbosity)
Role-based content filtering
Previous interaction learning

Implementation:

Azure Cosmos DB or Azure SQL for conversation state. Redis for session caching. Logic to prune old history based on relevance.

Hybrid cloud-local deployment

Demo:

Cloud-only deployment.

Enterprise reality:

Some data cannot leave premises (regulatory, contractual).

Hybrid pattern:

On-premises:

Sensitive document storage
Document processing and chunking
Embedding generation

Cloud:

Azure OpenAI for generation
Orchestration logic
UI hosting

Challenge:

Network latency between on-prem retrieval and cloud generation. Bandwidth costs for large context transmission.

Multi-tenant isolation

Demo:

Single tenant (one Contoso company).

SaaS reality:

Thousands of customers, each with own document corpus.

Isolation options:

1. Index-per-tenant:

Separate Azure AI Search index per customer
Complete data isolation
Scales poorly (Azure limits on index count)

2. Shared index with filtering:

Single index, documents tagged with tenant ID
Filter queries by tenant ID
Risk of filter bypass vulnerabilities

3. Search service per tenant:

Complete service isolation
Expensive at scale
Operational complexity

Tradeoff:

Cost vs. isolation vs. operational complexity. No perfect answer.

What developers actually do with this demo

Common customization paths

1. Replace Contoso documents with own data

Most immediate step. Upload company's actual PDFs, re-run indexing, test retrieval quality.

Lesson learned: Retrieval quality often poor on first try. Leads to chunking experiments.

2. Add authentication

Integrate Azure AD. Restrict access to authenticated users.

Lesson learned: Row-level security harder than expected. Demo doesn't show document ACL patterns.

3. Customize UI

Replace generic ChatGPT interface with company branding, specific workflows.

Lesson learned: Frontend is React. Requires JavaScript skills beyond Python backend.

4. Integrate with enterprise systems

Connect to SharePoint, Confluence, internal wikis as document sources.

Lesson learned: Each system has different API, permissions model, update patterns. Significant integration work.

Where projects get stuck

Retrieval quality plateau:

Developers tune parameters but hit ceiling. Need domain experts to evaluate answer quality, identify failure patterns. Requires systematic evaluation framework.

Security implementation:

Adding authentication easy. Implementing proper authorization (who sees what) complex. Requires understanding Azure RBAC, custom security trimming logic.

Cost runaway:

Initial testing cheap. Production query volume reveals costs. Scramble to implement caching, optimize prompts, reduce token usage.

Production deployment:

Demo uses Container Apps. Enterprise might require App Service, AKS, or on-prem Kubernetes. Adapting infrastructure-as-code non-trivial.

Chat-with-your-data solution accelerator

Difference from demo:

Production-focused from start. Includes:

Security controls
Multi-source connectors (SharePoint, blob, SQL)
Admin interface for configuration
Advanced telemetry

Tradeoff:

More complex, harder to understand internals. Less educational, more operational.

When to use:

When goal is production deployment, not learning RAG patterns.

Semantic Kernel integration

Pattern:

Use demo's retrieval logic, but Semantic Kernel for orchestration and agent capabilities.

Advantage:

Extends RAG with function calling, plugins, multi-agent patterns.

Complexity:

Adds orchestration layer. Useful for complex workflows beyond simple Q&A.

LangChain/LlamaIndex alternatives

Community preference:

Some developers prefer LangChain or LlamaIndex over Microsoft-specific patterns.

Compatibility:

Azure AI Search integrates with both frameworks. Can use demo's indexing strategy with different orchestration.

Consideration:

Vendor lock-in vs. ecosystem flexibility tradeoff.

The honest assessment

What the demo accomplishes

Lowers barrier to RAG experimentation:

azd up eliminating setup friction is genuine achievement. Developers go from zero to working RAG in minutes, not days.

Teaches core pattern clearly:

Retrieval → Context → Generation flow well-demonstrated. Code structure clean, educational.

Azure service integration showcase:

Shows how AI Search, OpenAI, Container Apps work together. Infrastructure-as-code valuable reference.

Multi-language implementations:

Python, JavaScript, .NET, Java versions help developers in their preferred ecosystem.

What the demo doesn't prepare you for

Production security requirements:

Gap between "no auth" demo and enterprise security substantial. Demo doesn't show the hard parts.

Retrieval quality optimization:

Demo uses defaults. Real-world retrieval tuning is iterative, domain-specific, requires evaluation framework.

Cost management at scale:

Demo cost negligible. Production cost optimization requires architecture changes, not configuration tweaks.

Document processing complexity:

Clean PDF assumption breaks on real enterprise documents. Preprocessing becomes significant project.

Operational concerns:

Monitoring, alerting, incident response, DR/backup—not addressed. Production requires operational maturity.

Production readiness checklist

Before deploying RAG system built on this demo to production:

Security

[ ] User authentication implemented (Azure AD, OAuth)
[ ] Document-level authorization (who can see what)
[ ] Prompt injection defenses
[ ] Input validation and sanitization
[ ] Rate limiting per user
[ ] Audit logging of all access
[ ] PII/PHI handling controls
[ ] Data encryption at rest and in transit

Reliability

[ ] Error handling and graceful degradation
[ ] Retry logic with exponential backoff
[ ] Circuit breakers for dependencies
[ ] Health check endpoints
[ ] Disaster recovery plan
[ ] Backup and restore procedures
[ ] Load testing completed

Observability

[ ] Application Insights integration
[ ] Custom metrics for retrieval quality
[ ] Cost tracking per query
[ ] Alerting for anomalies
[ ] Dashboard for operational metrics
[ ] Log aggregation and search

Performance

[ ] Caching strategy implemented
[ ] Query optimization based on load testing
[ ] CDN for static assets
[ ] Database query optimization
[ ] Auto-scaling configured

Compliance

[ ] GDPR/CCPA compliance controls
[ ] Data retention policies
[ ] Right to be forgotten implementation
[ ] Data residency requirements met
[ ] Compliance audit trail
[ ] Legal review completed

Cost Management

[ ] Cost per query tracking
[ ] Budget alerts configured
[ ] Query routing optimization
[ ] Token usage optimization
[ ] Reserved capacity vs. pay-as-you-go analysis

Learn more

Official repository:

azure-search-openai-demo - Primary Python implementation
azure-search-openai-javascript - JavaScript version
azure-search-openai-demo-csharp - .NET version
azure-search-openai-demo-java - Java version

Production-ready alternatives:

chat-with-your-data-solution-accelerator - Enterprise solution accelerator with production features

Microsoft Learn:

RAG and generative AI - Azure AI Search
Quickstart: Generative Search (RAG)
Azure OpenAI Landing Zone - Production architecture reference

Related architectures:

Azure AI Document Intelligence for document processing
Azure Semantic Kernel for orchestration
Azure API Management for API governance

Related Ignite coverage:

What the demo provides

Core RAG pattern implementation

Technical stack

The "azd up" magic

The fictitious company pattern

Contoso employee benefits scenario

What Contoso hides

Architecture deep dive

Classic RAG flow

Modern RAG (agentic retrieval)

What the demo does well

Developer experience optimization

RAG pattern education

Azure integration showcase

The production gap

What Microsoft explicitly warns against

Recommended production path

Real-world customization challenges

Document diversity problem

Retrieval quality tuning

Cost management at scale

Advanced patterns not in demo

Session management and personalization

Hybrid cloud-local deployment

Multi-tenant isolation

What developers actually do with this demo

Common customization paths

Where projects get stuck

Alternatives and related patterns

Chat-with-your-data solution accelerator

Semantic Kernel integration

LangChain/LlamaIndex alternatives

The honest assessment

What the demo accomplishes

What the demo doesn't prepare you for

Production readiness checklist

Security

Reliability

Observability

Performance

Compliance

Cost Management

Learn more