Azure AI Platform

Azure Search OpenAI Demo: The RAG reference implementation everyone copies

The azure-search-openai-demo repository is Microsoft's canonical example of building ChatGPT-like experiences over private enterprise data using Retrieval-Augmented Generation (RAG). It's also a masterclass in what "sample code" means versus production readiness.

This Python-based demo showcases Azure AI Search for document retrieval and Azure OpenAI for generation, deployed via Azure Container Apps. The fictitious company "Contoso" demonstrates employee questions about benefits and policies. Microsoft warns explicitly: "strongly advise customers not to make demo code part of production environments." Worth understanding why.


What the demo provides

Core RAG pattern implementation

Retrieval-Augmented Generation architecture:

User queryAzure AI Search (retrieval)Azure OpenAI GPT (generation)Response with citations

Not just chat:

  • Multi-turn conversational interface
  • Q&A mode for single questions
  • Citation rendering showing source documents
  • Thought process visibility (chain-of-thought reasoning)

Multimodal capabilities:

  • Speech input (Azure Speech Service)
  • Speech output (text-to-speech)
  • GPT-4 Vision for image analysis in documents

Technical stack

Backend:

  • Python (primary implementation)
  • Alternative implementations: JavaScript, .NET, Java

Azure services:

  • Azure AI Search (document indexing, vector search, hybrid search)
  • Azure OpenAI Service (GPT models for generation)
  • Azure Container Apps or Azure App Service (hosting)
  • Optional: Azure Speech Service, Azure Document Intelligence

Developer tooling:

  • Azure Developer CLI (azd) for one-command deployment
  • Bicep for infrastructure-as-code
  • Dev containers for consistent development environments

Deployment options:

  • GitHub Codespaces (instant cloud development)
  • VS Code Dev Containers (local development with container)
  • Local environment (Python 3.10-3.14, Node.js 20+)

The "azd up" magic

Single command deployment:

azd auth login
azd env new
azd up

What happens:

  1. Provisions Azure resources (AI Search, OpenAI, Container Apps)
  2. Deploys application code
  3. Builds search index from sample documents
  4. Returns URL to running application

Time to working demo: ~10-15 minutes

Why this matters:

Reduces friction from "interested in RAG" to "working RAG application" from days to minutes. Critical for developer adoption.


The fictitious company pattern

Contoso employee benefits scenario

Use case:

Employees ask questions about:

  • Benefits and compensation
  • Internal policies
  • Job descriptions and roles

Sample documents:

Benefits handbooks, policy PDFs, role descriptions—realistic enterprise content types.

Example queries:

  • "What is the company policy on remote work?"
  • "How much vacation time do I get?"
  • "What are the qualifications for Senior Engineer role?"

Why this scenario:

HR and policy documents are:

  • Common enterprise use case
  • Manageable document corpus for demo
  • Relatable to anyone who's worked at mid-large company
  • Low risk (fictional data, no real PII/PHI)

What Contoso hides

Real enterprise complexity not demonstrated:

Document heterogeneity: Contoso docs are clean PDFs and text. Real enterprises have scanned images, handwritten notes, legacy formats, inconsistent structures.

Access control: All Contoso employees see all documents. Real enterprises need row-level security, role-based access, data classification.

Compliance: No GDPR, HIPAA, SOC 2 considerations. No audit trails, data residency requirements, or retention policies.

Scale: Contoso has hundreds of documents. Real enterprises have millions, with daily updates.

Multi-tenancy: Single tenant demo. Real SaaS providers need customer data isolation.


Architecture deep dive

Classic RAG flow

Step 1: Document ingestion

Documents → Azure Document Intelligence → Chunks with embeddings → Azure AI Search index

Chunking strategy:

  • Split documents into manageable segments (default: 1000 tokens with 100 token overlap)
  • Preserve semantic boundaries (paragraphs, sections)
  • Generate embeddings for each chunk

Step 2: Query processing

User query → Generate embedding → Vector search + keyword search (hybrid)

Hybrid search:

  • Vector search: Semantic similarity using embeddings
  • Keyword search: Traditional full-text search
  • Combined ranking: Best of both approaches

Step 3: Context assembly

Top K retrieved chunks → Formatted as context → Injected into LLM prompt

Step 4: Generation

Azure OpenAI GPT receives:

  • User query
  • Retrieved document context
  • System prompt defining behavior

Generates response grounded in provided context.

Step 5: Citation rendering

Response includes references to source chunks, rendered as clickable citations in UI.

Modern RAG (agentic retrieval)

What changed:

LLM acts as query planner, breaking complex questions into subqueries.

Example:

User: "Compare remote work policies before and after 2020"

Classic RAG: Single search for "remote work policies 2020"

Agentic RAG:

  1. LLM generates subqueries: "remote work policy before 2020", "remote work policy after 2020"
  2. Executes searches in parallel
  3. LLM synthesizes comparison from multiple result sets

Advantage:

Better handling of multi-faceted questions requiring information synthesis across document corpus.


What the demo does well

Developer experience optimization

One-command deployment works:

azd up genuinely provisions everything and results in working application. Not marketing—actually functional.

Local development simplified:

Dev containers ensure consistent Python/Node versions, dependencies, environment configuration. No "works on my machine" problems.

Clear documentation:

README walks through deployment, configuration, troubleshooting. Code comments explain RAG pattern implementation.

RAG pattern education

Visibility into reasoning:

UI shows retrieved documents, chunk text, citations. Developers see how retrieval affects generation quality.

Experimentation-friendly:

Easy to swap models, adjust chunk sizes, tune search parameters. Learn by modifying and observing results.

Multi-language implementations:

Python, JavaScript, .NET, Java versions teach same patterns in different ecosystems.

Azure integration showcase

Service orchestration:

Demonstrates how Azure AI Search, OpenAI, Container Apps work together. Infrastructure-as-code (Bicep) shows production provisioning patterns.

Managed identity:

Uses Azure AD authentication between services. No hardcoded keys in code (critical security pattern).

Monitoring integration:

Application Insights traces requests, errors, performance. Shows telemetry integration from start.


The production gap

What Microsoft explicitly warns against

From the repository:

"This sample is designed to be a starting point only. We strongly advise customers not to make demo code part of production environments without implementing additional security features."

Why this warning exists:

1. Authentication is optional

Demo ships with no authentication. Anyone with URL can access. Production requires:

  • User authentication (Azure AD, OAuth)
  • Authorization (who can see which documents)
  • Audit logging (who accessed what, when)

2. No document-level security

All indexed documents accessible to all users. Production needs:

  • Row-level security (users see only documents they're authorized for)
  • Dynamic filtering based on user identity
  • Security trimming in search results

3. Minimal input validation

Demo trusts user input. Production requires:

  • Prompt injection defense
  • Input sanitization
  • Rate limiting per user
  • Cost controls (token usage caps)

4. No PII/PHI handling

Contoso documents contain no real sensitive data. Production with actual PII/PHI requires:

  • Data classification and labeling
  • Encryption at rest and in transit
  • DLP (Data Loss Prevention) policies
  • Compliance controls (GDPR, HIPAA, etc.)

5. Limited error handling

Demo shows happy path. Production needs:

  • Graceful degradation when services unavailable
  • Retry logic with exponential backoff
  • Circuit breakers for failing dependencies
  • User-friendly error messages (not stack traces)

Microsoft's guidance:

Use chat-with-your-data-solution-accelerator instead. It includes:

  • Production security controls
  • Multi-tenant isolation
  • Advanced monitoring and observability
  • Enterprise-grade deployment patterns
  • Best practices for compliance

Or:

Follow Azure OpenAI Landing Zone reference architecture for:

  • Network isolation (VNets, private endpoints)
  • WAF and API Management for API security
  • Key Vault for secrets management
  • RBAC and managed identities throughout
  • DR and backup strategies

Real-world customization challenges

Document diversity problem

Demo assumption: Clean PDFs and text files

Reality:

Enterprises have:

  • Scanned images requiring OCR
  • Tables and charts requiring layout understanding
  • Multi-language documents requiring translation
  • Legacy formats (WordPerfect, Lotus Notes)
  • Email threads with attachments
  • SharePoint sites with permissions inheritance

Solution complexity:

Azure Document Intelligence helps, but requires:

  • Custom preprocessing pipelines
  • Format-specific handling
  • Quality validation (OCR errors)
  • Metadata extraction and preservation

Retrieval quality tuning

Demo uses defaults:

Default chunk size, default embedding model, default search parameters.

Production requires experimentation:

Chunk size optimization:

  • Too small: Fragments lack context
  • Too large: Dilutes semantic meaning
  • Domain-specific: Legal contracts need different chunking than chat logs

Embedding model selection:

  • text-embedding-ada-002 vs. text-embedding-3-large
  • Domain-specific fine-tuning for specialized vocabulary
  • Multilingual embedding models for global enterprises

Search parameter tuning:

  • Hybrid search weighting (vector vs. keyword)
  • Semantic reranking thresholds
  • Top K value (how many chunks to retrieve)
  • Relevance scoring adjustments

Measurement:

Need golden question set with known correct answers. Measure precision, recall, NDCG. Iterate on configuration. This is ongoing work, not one-time setup.

Cost management at scale

Demo cost: Negligible (sample documents, low query volume)

Production cost drivers:

1. Azure AI Search:

  • Index size (storage cost scales with document volume)
  • Query volume (pay per search query)
  • Replica count for availability

2. Azure OpenAI:

  • Prompt tokens (retrieved chunks add significant context)
  • Completion tokens (generated responses)
  • Model tier (GPT-4 vs. GPT-3.5 pricing difference)

3. Azure Document Intelligence:

  • Page processing charges for document ingestion
  • Volume scales with document corpus size

Cost optimization strategies:

Caching:

  • Cache frequent queries and responses
  • TTL based on document update frequency
  • Reduces redundant LLM calls

Query routing:

  • Simple questions → GPT-3.5
  • Complex reasoning → GPT-4
  • Threshold-based routing

Chunk deduplication:

  • Don't retrieve duplicate chunks
  • Remove redundant context before LLM call
  • Reduces prompt token costs

Monitoring:

  • Cost per query tracking
  • Anomaly detection for runaway costs
  • Alerting when thresholds exceeded

Advanced patterns not in demo

Session management and personalization

Demo:

Stateless conversation. Each query independent.

Production needs:

Conversation history:

  • Store chat history per user
  • Inject relevant prior context into prompts
  • Manage context window limits (can't include infinite history)

Personalization:

  • User preferences (response style, verbosity)
  • Role-based content filtering
  • Previous interaction learning

Implementation:

Azure Cosmos DB or Azure SQL for conversation state. Redis for session caching. Logic to prune old history based on relevance.

Hybrid cloud-local deployment

Demo:

Cloud-only deployment.

Enterprise reality:

Some data cannot leave premises (regulatory, contractual).

Hybrid pattern:

On-premises:

  • Sensitive document storage
  • Document processing and chunking
  • Embedding generation

Cloud:

  • Azure OpenAI for generation
  • Orchestration logic
  • UI hosting

Challenge:

Network latency between on-prem retrieval and cloud generation. Bandwidth costs for large context transmission.

Multi-tenant isolation

Demo:

Single tenant (one Contoso company).

SaaS reality:

Thousands of customers, each with own document corpus.

Isolation options:

1. Index-per-tenant:

  • Separate Azure AI Search index per customer
  • Complete data isolation
  • Scales poorly (Azure limits on index count)

2. Shared index with filtering:

  • Single index, documents tagged with tenant ID
  • Filter queries by tenant ID
  • Risk of filter bypass vulnerabilities

3. Search service per tenant:

  • Complete service isolation
  • Expensive at scale
  • Operational complexity

Tradeoff:

Cost vs. isolation vs. operational complexity. No perfect answer.


What developers actually do with this demo

Common customization paths

1. Replace Contoso documents with own data

Most immediate step. Upload company's actual PDFs, re-run indexing, test retrieval quality.

Lesson learned: Retrieval quality often poor on first try. Leads to chunking experiments.

2. Add authentication

Integrate Azure AD. Restrict access to authenticated users.

Lesson learned: Row-level security harder than expected. Demo doesn't show document ACL patterns.

3. Customize UI

Replace generic ChatGPT interface with company branding, specific workflows.

Lesson learned: Frontend is React. Requires JavaScript skills beyond Python backend.

4. Integrate with enterprise systems

Connect to SharePoint, Confluence, internal wikis as document sources.

Lesson learned: Each system has different API, permissions model, update patterns. Significant integration work.

Where projects get stuck

Retrieval quality plateau:

Developers tune parameters but hit ceiling. Need domain experts to evaluate answer quality, identify failure patterns. Requires systematic evaluation framework.

Security implementation:

Adding authentication easy. Implementing proper authorization (who sees what) complex. Requires understanding Azure RBAC, custom security trimming logic.

Cost runaway:

Initial testing cheap. Production query volume reveals costs. Scramble to implement caching, optimize prompts, reduce token usage.

Production deployment:

Demo uses Container Apps. Enterprise might require App Service, AKS, or on-prem Kubernetes. Adapting infrastructure-as-code non-trivial.


Chat-with-your-data solution accelerator

Difference from demo:

Production-focused from start. Includes:

  • Security controls
  • Multi-source connectors (SharePoint, blob, SQL)
  • Admin interface for configuration
  • Advanced telemetry

Tradeoff:

More complex, harder to understand internals. Less educational, more operational.

When to use:

When goal is production deployment, not learning RAG patterns.

Semantic Kernel integration

Pattern:

Use demo's retrieval logic, but Semantic Kernel for orchestration and agent capabilities.

Advantage:

Extends RAG with function calling, plugins, multi-agent patterns.

Complexity:

Adds orchestration layer. Useful for complex workflows beyond simple Q&A.

LangChain/LlamaIndex alternatives

Community preference:

Some developers prefer LangChain or LlamaIndex over Microsoft-specific patterns.

Compatibility:

Azure AI Search integrates with both frameworks. Can use demo's indexing strategy with different orchestration.

Consideration:

Vendor lock-in vs. ecosystem flexibility tradeoff.


The honest assessment

What the demo accomplishes

Lowers barrier to RAG experimentation:

azd up eliminating setup friction is genuine achievement. Developers go from zero to working RAG in minutes, not days.

Teaches core pattern clearly:

Retrieval → Context → Generation flow well-demonstrated. Code structure clean, educational.

Azure service integration showcase:

Shows how AI Search, OpenAI, Container Apps work together. Infrastructure-as-code valuable reference.

Multi-language implementations:

Python, JavaScript, .NET, Java versions help developers in their preferred ecosystem.

What the demo doesn't prepare you for

Production security requirements:

Gap between "no auth" demo and enterprise security substantial. Demo doesn't show the hard parts.

Retrieval quality optimization:

Demo uses defaults. Real-world retrieval tuning is iterative, domain-specific, requires evaluation framework.

Cost management at scale:

Demo cost negligible. Production cost optimization requires architecture changes, not configuration tweaks.

Document processing complexity:

Clean PDF assumption breaks on real enterprise documents. Preprocessing becomes significant project.

Operational concerns:

Monitoring, alerting, incident response, DR/backup—not addressed. Production requires operational maturity.


Production readiness checklist

Before deploying RAG system built on this demo to production:

Security

  • [ ] User authentication implemented (Azure AD, OAuth)
  • [ ] Document-level authorization (who can see what)
  • [ ] Prompt injection defenses
  • [ ] Input validation and sanitization
  • [ ] Rate limiting per user
  • [ ] Audit logging of all access
  • [ ] PII/PHI handling controls
  • [ ] Data encryption at rest and in transit

Reliability

  • [ ] Error handling and graceful degradation
  • [ ] Retry logic with exponential backoff
  • [ ] Circuit breakers for dependencies
  • [ ] Health check endpoints
  • [ ] Disaster recovery plan
  • [ ] Backup and restore procedures
  • [ ] Load testing completed

Observability

  • [ ] Application Insights integration
  • [ ] Custom metrics for retrieval quality
  • [ ] Cost tracking per query
  • [ ] Alerting for anomalies
  • [ ] Dashboard for operational metrics
  • [ ] Log aggregation and search

Performance

  • [ ] Caching strategy implemented
  • [ ] Query optimization based on load testing
  • [ ] CDN for static assets
  • [ ] Database query optimization
  • [ ] Auto-scaling configured

Compliance

  • [ ] GDPR/CCPA compliance controls
  • [ ] Data retention policies
  • [ ] Right to be forgotten implementation
  • [ ] Data residency requirements met
  • [ ] Compliance audit trail
  • [ ] Legal review completed

Cost Management

  • [ ] Cost per query tracking
  • [ ] Budget alerts configured
  • [ ] Query routing optimization
  • [ ] Token usage optimization
  • [ ] Reserved capacity vs. pay-as-you-go analysis

Learn more

Official repository:

Production-ready alternatives:

Microsoft Learn:

Related architectures:

  • Azure AI Document Intelligence for document processing
  • Azure Semantic Kernel for orchestration
  • Azure API Management for API governance

Related Ignite coverage:


Previous
Foundry Local: Cloud to Edge
Built: Mar 13, 2026, 12:43 PM PDT
80d1fe5