Azure AI Platform
Azure Search OpenAI Demo: The RAG reference implementation everyone copies
The azure-search-openai-demo repository is Microsoft's canonical example of building ChatGPT-like experiences over private enterprise data using Retrieval-Augmented Generation (RAG). It's also a masterclass in what "sample code" means versus production readiness.
This Python-based demo showcases Azure AI Search for document retrieval and Azure OpenAI for generation, deployed via Azure Container Apps. The fictitious company "Contoso" demonstrates employee questions about benefits and policies. Microsoft warns explicitly: "strongly advise customers not to make demo code part of production environments." Worth understanding why.
What the demo provides
Core RAG pattern implementation
Retrieval-Augmented Generation architecture:
User query → Azure AI Search (retrieval) → Azure OpenAI GPT (generation) → Response with citations
Not just chat:
- Multi-turn conversational interface
- Q&A mode for single questions
- Citation rendering showing source documents
- Thought process visibility (chain-of-thought reasoning)
Multimodal capabilities:
- Speech input (Azure Speech Service)
- Speech output (text-to-speech)
- GPT-4 Vision for image analysis in documents
Technical stack
Backend:
- Python (primary implementation)
- Alternative implementations: JavaScript, .NET, Java
Azure services:
- Azure AI Search (document indexing, vector search, hybrid search)
- Azure OpenAI Service (GPT models for generation)
- Azure Container Apps or Azure App Service (hosting)
- Optional: Azure Speech Service, Azure Document Intelligence
Developer tooling:
- Azure Developer CLI (
azd) for one-command deployment - Bicep for infrastructure-as-code
- Dev containers for consistent development environments
Deployment options:
- GitHub Codespaces (instant cloud development)
- VS Code Dev Containers (local development with container)
- Local environment (Python 3.10-3.14, Node.js 20+)
The "azd up" magic
Single command deployment:
azd auth login
azd env new
azd up
What happens:
- Provisions Azure resources (AI Search, OpenAI, Container Apps)
- Deploys application code
- Builds search index from sample documents
- Returns URL to running application
Time to working demo: ~10-15 minutes
Why this matters:
Reduces friction from "interested in RAG" to "working RAG application" from days to minutes. Critical for developer adoption.
The fictitious company pattern
Contoso employee benefits scenario
Use case:
Employees ask questions about:
- Benefits and compensation
- Internal policies
- Job descriptions and roles
Sample documents:
Benefits handbooks, policy PDFs, role descriptions—realistic enterprise content types.
Example queries:
- "What is the company policy on remote work?"
- "How much vacation time do I get?"
- "What are the qualifications for Senior Engineer role?"
Why this scenario:
HR and policy documents are:
- Common enterprise use case
- Manageable document corpus for demo
- Relatable to anyone who's worked at mid-large company
- Low risk (fictional data, no real PII/PHI)
What Contoso hides
Real enterprise complexity not demonstrated:
Document heterogeneity: Contoso docs are clean PDFs and text. Real enterprises have scanned images, handwritten notes, legacy formats, inconsistent structures.
Access control: All Contoso employees see all documents. Real enterprises need row-level security, role-based access, data classification.
Compliance: No GDPR, HIPAA, SOC 2 considerations. No audit trails, data residency requirements, or retention policies.
Scale: Contoso has hundreds of documents. Real enterprises have millions, with daily updates.
Multi-tenancy: Single tenant demo. Real SaaS providers need customer data isolation.
Architecture deep dive
Classic RAG flow
Step 1: Document ingestion
Documents → Azure Document Intelligence → Chunks with embeddings → Azure AI Search index
Chunking strategy:
- Split documents into manageable segments (default: 1000 tokens with 100 token overlap)
- Preserve semantic boundaries (paragraphs, sections)
- Generate embeddings for each chunk
Step 2: Query processing
User query → Generate embedding → Vector search + keyword search (hybrid)
Hybrid search:
- Vector search: Semantic similarity using embeddings
- Keyword search: Traditional full-text search
- Combined ranking: Best of both approaches
Step 3: Context assembly
Top K retrieved chunks → Formatted as context → Injected into LLM prompt
Step 4: Generation
Azure OpenAI GPT receives:
- User query
- Retrieved document context
- System prompt defining behavior
Generates response grounded in provided context.
Step 5: Citation rendering
Response includes references to source chunks, rendered as clickable citations in UI.
Modern RAG (agentic retrieval)
What changed:
LLM acts as query planner, breaking complex questions into subqueries.
Example:
User: "Compare remote work policies before and after 2020"
Classic RAG: Single search for "remote work policies 2020"
Agentic RAG:
- LLM generates subqueries: "remote work policy before 2020", "remote work policy after 2020"
- Executes searches in parallel
- LLM synthesizes comparison from multiple result sets
Advantage:
Better handling of multi-faceted questions requiring information synthesis across document corpus.
What the demo does well
Developer experience optimization
One-command deployment works:
azd up genuinely provisions everything and results in working application. Not marketing—actually functional.
Local development simplified:
Dev containers ensure consistent Python/Node versions, dependencies, environment configuration. No "works on my machine" problems.
Clear documentation:
README walks through deployment, configuration, troubleshooting. Code comments explain RAG pattern implementation.
RAG pattern education
Visibility into reasoning:
UI shows retrieved documents, chunk text, citations. Developers see how retrieval affects generation quality.
Experimentation-friendly:
Easy to swap models, adjust chunk sizes, tune search parameters. Learn by modifying and observing results.
Multi-language implementations:
Python, JavaScript, .NET, Java versions teach same patterns in different ecosystems.
Azure integration showcase
Service orchestration:
Demonstrates how Azure AI Search, OpenAI, Container Apps work together. Infrastructure-as-code (Bicep) shows production provisioning patterns.
Managed identity:
Uses Azure AD authentication between services. No hardcoded keys in code (critical security pattern).
Monitoring integration:
Application Insights traces requests, errors, performance. Shows telemetry integration from start.
The production gap
What Microsoft explicitly warns against
From the repository:
"This sample is designed to be a starting point only. We strongly advise customers not to make demo code part of production environments without implementing additional security features."
Why this warning exists:
1. Authentication is optional
Demo ships with no authentication. Anyone with URL can access. Production requires:
- User authentication (Azure AD, OAuth)
- Authorization (who can see which documents)
- Audit logging (who accessed what, when)
2. No document-level security
All indexed documents accessible to all users. Production needs:
- Row-level security (users see only documents they're authorized for)
- Dynamic filtering based on user identity
- Security trimming in search results
3. Minimal input validation
Demo trusts user input. Production requires:
- Prompt injection defense
- Input sanitization
- Rate limiting per user
- Cost controls (token usage caps)
4. No PII/PHI handling
Contoso documents contain no real sensitive data. Production with actual PII/PHI requires:
- Data classification and labeling
- Encryption at rest and in transit
- DLP (Data Loss Prevention) policies
- Compliance controls (GDPR, HIPAA, etc.)
5. Limited error handling
Demo shows happy path. Production needs:
- Graceful degradation when services unavailable
- Retry logic with exponential backoff
- Circuit breakers for failing dependencies
- User-friendly error messages (not stack traces)
Recommended production path
Microsoft's guidance:
Use chat-with-your-data-solution-accelerator instead. It includes:
- Production security controls
- Multi-tenant isolation
- Advanced monitoring and observability
- Enterprise-grade deployment patterns
- Best practices for compliance
Or:
Follow Azure OpenAI Landing Zone reference architecture for:
- Network isolation (VNets, private endpoints)
- WAF and API Management for API security
- Key Vault for secrets management
- RBAC and managed identities throughout
- DR and backup strategies
Real-world customization challenges
Document diversity problem
Demo assumption: Clean PDFs and text files
Reality:
Enterprises have:
- Scanned images requiring OCR
- Tables and charts requiring layout understanding
- Multi-language documents requiring translation
- Legacy formats (WordPerfect, Lotus Notes)
- Email threads with attachments
- SharePoint sites with permissions inheritance
Solution complexity:
Azure Document Intelligence helps, but requires:
- Custom preprocessing pipelines
- Format-specific handling
- Quality validation (OCR errors)
- Metadata extraction and preservation
Retrieval quality tuning
Demo uses defaults:
Default chunk size, default embedding model, default search parameters.
Production requires experimentation:
Chunk size optimization:
- Too small: Fragments lack context
- Too large: Dilutes semantic meaning
- Domain-specific: Legal contracts need different chunking than chat logs
Embedding model selection:
- text-embedding-ada-002 vs. text-embedding-3-large
- Domain-specific fine-tuning for specialized vocabulary
- Multilingual embedding models for global enterprises
Search parameter tuning:
- Hybrid search weighting (vector vs. keyword)
- Semantic reranking thresholds
- Top K value (how many chunks to retrieve)
- Relevance scoring adjustments
Measurement:
Need golden question set with known correct answers. Measure precision, recall, NDCG. Iterate on configuration. This is ongoing work, not one-time setup.
Cost management at scale
Demo cost: Negligible (sample documents, low query volume)
Production cost drivers:
1. Azure AI Search:
- Index size (storage cost scales with document volume)
- Query volume (pay per search query)
- Replica count for availability
2. Azure OpenAI:
- Prompt tokens (retrieved chunks add significant context)
- Completion tokens (generated responses)
- Model tier (GPT-4 vs. GPT-3.5 pricing difference)
3. Azure Document Intelligence:
- Page processing charges for document ingestion
- Volume scales with document corpus size
Cost optimization strategies:
Caching:
- Cache frequent queries and responses
- TTL based on document update frequency
- Reduces redundant LLM calls
Query routing:
- Simple questions → GPT-3.5
- Complex reasoning → GPT-4
- Threshold-based routing
Chunk deduplication:
- Don't retrieve duplicate chunks
- Remove redundant context before LLM call
- Reduces prompt token costs
Monitoring:
- Cost per query tracking
- Anomaly detection for runaway costs
- Alerting when thresholds exceeded
Advanced patterns not in demo
Session management and personalization
Demo:
Stateless conversation. Each query independent.
Production needs:
Conversation history:
- Store chat history per user
- Inject relevant prior context into prompts
- Manage context window limits (can't include infinite history)
Personalization:
- User preferences (response style, verbosity)
- Role-based content filtering
- Previous interaction learning
Implementation:
Azure Cosmos DB or Azure SQL for conversation state. Redis for session caching. Logic to prune old history based on relevance.
Hybrid cloud-local deployment
Demo:
Cloud-only deployment.
Enterprise reality:
Some data cannot leave premises (regulatory, contractual).
Hybrid pattern:
On-premises:
- Sensitive document storage
- Document processing and chunking
- Embedding generation
Cloud:
- Azure OpenAI for generation
- Orchestration logic
- UI hosting
Challenge:
Network latency between on-prem retrieval and cloud generation. Bandwidth costs for large context transmission.
Multi-tenant isolation
Demo:
Single tenant (one Contoso company).
SaaS reality:
Thousands of customers, each with own document corpus.
Isolation options:
1. Index-per-tenant:
- Separate Azure AI Search index per customer
- Complete data isolation
- Scales poorly (Azure limits on index count)
2. Shared index with filtering:
- Single index, documents tagged with tenant ID
- Filter queries by tenant ID
- Risk of filter bypass vulnerabilities
3. Search service per tenant:
- Complete service isolation
- Expensive at scale
- Operational complexity
Tradeoff:
Cost vs. isolation vs. operational complexity. No perfect answer.
What developers actually do with this demo
Common customization paths
1. Replace Contoso documents with own data
Most immediate step. Upload company's actual PDFs, re-run indexing, test retrieval quality.
Lesson learned: Retrieval quality often poor on first try. Leads to chunking experiments.
2. Add authentication
Integrate Azure AD. Restrict access to authenticated users.
Lesson learned: Row-level security harder than expected. Demo doesn't show document ACL patterns.
3. Customize UI
Replace generic ChatGPT interface with company branding, specific workflows.
Lesson learned: Frontend is React. Requires JavaScript skills beyond Python backend.
4. Integrate with enterprise systems
Connect to SharePoint, Confluence, internal wikis as document sources.
Lesson learned: Each system has different API, permissions model, update patterns. Significant integration work.
Where projects get stuck
Retrieval quality plateau:
Developers tune parameters but hit ceiling. Need domain experts to evaluate answer quality, identify failure patterns. Requires systematic evaluation framework.
Security implementation:
Adding authentication easy. Implementing proper authorization (who sees what) complex. Requires understanding Azure RBAC, custom security trimming logic.
Cost runaway:
Initial testing cheap. Production query volume reveals costs. Scramble to implement caching, optimize prompts, reduce token usage.
Production deployment:
Demo uses Container Apps. Enterprise might require App Service, AKS, or on-prem Kubernetes. Adapting infrastructure-as-code non-trivial.
Alternatives and related patterns
Chat-with-your-data solution accelerator
Difference from demo:
Production-focused from start. Includes:
- Security controls
- Multi-source connectors (SharePoint, blob, SQL)
- Admin interface for configuration
- Advanced telemetry
Tradeoff:
More complex, harder to understand internals. Less educational, more operational.
When to use:
When goal is production deployment, not learning RAG patterns.
Semantic Kernel integration
Pattern:
Use demo's retrieval logic, but Semantic Kernel for orchestration and agent capabilities.
Advantage:
Extends RAG with function calling, plugins, multi-agent patterns.
Complexity:
Adds orchestration layer. Useful for complex workflows beyond simple Q&A.
LangChain/LlamaIndex alternatives
Community preference:
Some developers prefer LangChain or LlamaIndex over Microsoft-specific patterns.
Compatibility:
Azure AI Search integrates with both frameworks. Can use demo's indexing strategy with different orchestration.
Consideration:
Vendor lock-in vs. ecosystem flexibility tradeoff.
The honest assessment
What the demo accomplishes
Lowers barrier to RAG experimentation:
azd up eliminating setup friction is genuine achievement. Developers go from zero to working RAG in minutes, not days.
Teaches core pattern clearly:
Retrieval → Context → Generation flow well-demonstrated. Code structure clean, educational.
Azure service integration showcase:
Shows how AI Search, OpenAI, Container Apps work together. Infrastructure-as-code valuable reference.
Multi-language implementations:
Python, JavaScript, .NET, Java versions help developers in their preferred ecosystem.
What the demo doesn't prepare you for
Production security requirements:
Gap between "no auth" demo and enterprise security substantial. Demo doesn't show the hard parts.
Retrieval quality optimization:
Demo uses defaults. Real-world retrieval tuning is iterative, domain-specific, requires evaluation framework.
Cost management at scale:
Demo cost negligible. Production cost optimization requires architecture changes, not configuration tweaks.
Document processing complexity:
Clean PDF assumption breaks on real enterprise documents. Preprocessing becomes significant project.
Operational concerns:
Monitoring, alerting, incident response, DR/backup—not addressed. Production requires operational maturity.
Production readiness checklist
Before deploying RAG system built on this demo to production:
Security
- [ ] User authentication implemented (Azure AD, OAuth)
- [ ] Document-level authorization (who can see what)
- [ ] Prompt injection defenses
- [ ] Input validation and sanitization
- [ ] Rate limiting per user
- [ ] Audit logging of all access
- [ ] PII/PHI handling controls
- [ ] Data encryption at rest and in transit
Reliability
- [ ] Error handling and graceful degradation
- [ ] Retry logic with exponential backoff
- [ ] Circuit breakers for dependencies
- [ ] Health check endpoints
- [ ] Disaster recovery plan
- [ ] Backup and restore procedures
- [ ] Load testing completed
Observability
- [ ] Application Insights integration
- [ ] Custom metrics for retrieval quality
- [ ] Cost tracking per query
- [ ] Alerting for anomalies
- [ ] Dashboard for operational metrics
- [ ] Log aggregation and search
Performance
- [ ] Caching strategy implemented
- [ ] Query optimization based on load testing
- [ ] CDN for static assets
- [ ] Database query optimization
- [ ] Auto-scaling configured
Compliance
- [ ] GDPR/CCPA compliance controls
- [ ] Data retention policies
- [ ] Right to be forgotten implementation
- [ ] Data residency requirements met
- [ ] Compliance audit trail
- [ ] Legal review completed
Cost Management
- [ ] Cost per query tracking
- [ ] Budget alerts configured
- [ ] Query routing optimization
- [ ] Token usage optimization
- [ ] Reserved capacity vs. pay-as-you-go analysis
Learn more
Official repository:
- azure-search-openai-demo - Primary Python implementation
- azure-search-openai-javascript - JavaScript version
- azure-search-openai-demo-csharp - .NET version
- azure-search-openai-demo-java - Java version
Production-ready alternatives:
- chat-with-your-data-solution-accelerator - Enterprise solution accelerator with production features
Microsoft Learn:
- RAG and generative AI - Azure AI Search
- Quickstart: Generative Search (RAG)
- Azure OpenAI Landing Zone - Production architecture reference
Related architectures:
- Azure AI Document Intelligence for document processing
- Azure Semantic Kernel for orchestration
- Azure API Management for API governance
Related Ignite coverage: