// AIINFRA 301 · Semester 3

Production RAG & LLMOps — Observability and Evaluation

Building, Evaluating, and Operating Retrieval-Augmented LLM Systems in Production

This course prepares learners to design, evaluate, and operate retrieval-augmented generation systems using modern vector databases, hybrid search, and reranking techniques. Students build RAG evaluation pipelines with RAGAS and DeepEval as automated quality gates, then instrument production systems with tracing, prompt versioning, and drift monitoring using industry-standard LLMOps tooling. Emphasis is placed on hands-on labs that mirror real workplace tasks, from indexing strategy selection through cost-aware, observable deployment. Learners exit ready to support or own the retrieval and evaluation layer of an enterprise AI application.

Contact hours54 hrs

Credit equivalent3-unit

PrerequisiteAIINFRA 300

Length16 weeks

01 / outcomes

Outcomes

Course objectives

Select and configure a vector database (Qdrant, Milvus, Weaviate, or pgvector) based on scale, filtering, and indexing tradeoffs
Design chunking and hybrid retrieval pipelines combining dense embeddings, BM25, and cross-encoder reranking
Build automated RAG evaluation suites with RAGAS and DeepEval to gate retrieval and generation quality in CI/CD
Instrument LLM applications with tracing and observability platforms to diagnose latency, cost, and quality regressions
Implement production monitoring, prompt versioning, and feedback loops to detect and respond to model and data drift

Student learning outcomes

Select and configure a vector database (Qdrant, Milvus, Weaviate, or pgvector) based on scale, filtering, and indexing tradeoffs.
Design chunking and hybrid retrieval pipelines combining dense embeddings, BM25, and cross-encoder reranking.
Build automated RAG evaluation suites with RAGAS and DeepEval to gate retrieval and generation quality in CI/CD.
Instrument LLM applications with tracing and observability platforms to diagnose latency, cost, and quality regressions.
Implement production monitoring, prompt versioning, and feedback loops to detect and respond to model and data drift.

02 / schedule

16-week schedule

Wk 01

Introduction to Production RAG Architecture and the LLMOps Lifecycle

Introduces why production RAG systems exist and surveys the LLMOps lifecycle that governs how they are built and operated.

Wk 02

Embeddings and Vector Representations for Retrieval

Covers embeddings and vector representations, including the indexing pipeline that turns documents and queries into comparable vectors.

Wk 03

Vector Database Selection: Qdrant, Milvus, Weaviate, and pgvector

Compares Qdrant, Milvus, Weaviate, and pgvector to select a vector database based on scale and filtering needs.

Wk 04

Indexing Deep Dive: HNSW vs IVF and Metadata Filtering

Covers HNSW versus IVF indexing algorithms and metadata filtering strategies for vector search.

Wk 05

Chunking Strategies for Real-World Documents

Covers fixed-size, recursive, and semantic chunking strategies for splitting real-world documents.

Wk 06

Hybrid Search: Combining Dense Retrieval with BM25

Covers hybrid search techniques that combine dense embedding retrieval with sparse BM25 keyword search.

Wk 07

Cross-Encoder Rerankers and Query Rewriting/Expansion

Covers cross-encoder reranking and query rewriting/expansion techniques such as HyDE to improve retrieval quality.

Wk 08

Advanced Retrieval Patterns: GraphRAG and Agentic RAG

Midterm week: covers advanced retrieval patterns including GraphRAG and agentic RAG alongside the course midterm assessment.

Midterm · covers Wks 1–7

Wk 09

RAG Evaluation Foundations with RAGAS

Introduces RAG evaluation foundations using the RAGAS framework to score retrieval and generation quality.

Wk 10

Deep Evaluation and Failure Analysis with DeepEval

Covers deep evaluation and failure analysis of LLM outputs using DeepEval's Pytest-style testing approach.

Wk 11

Wiring Evaluation into CI/CD Quality Gates

Covers wiring automated RAG evaluation suites into CI/CD pipelines as quality gates.

Wk 12

Tracing and Observability: Langfuse and Arize Phoenix

Covers instrumenting LLM applications with tracing and observability using Langfuse and Arize Phoenix.

Wk 13

Observability at Scale: LangSmith, OpenLLMetry, and MLflow

Covers scaling observability practices using LangSmith, OpenLLMetry, and MLflow.

Wk 14

Prompt Management, Versioning, and Release Workflows

Covers prompt registries, versioning, immutability, and environment-based release workflows for prompts.

Wk 15

Production Monitoring: Drift Detection, Feedback Loops, Cost, and Latency

Covers production monitoring for model and data drift, feedback loops, cost, and latency.

Wk 16

Capstone Project & Course Review

Final capstone week: students design, build, and present a production-grade RAG/LLMOps system and review the course.

Capstone

03 / tools

Tools & frameworks

Vector Databases

QdrantMilvusWeaviatepgvector

Embedding & Retrieval Libraries

Sentence TransformersOpenAI Embeddings APIrank_bm25LangChain retrievers

Rerankers & Query Enhancement

Cohere Rerankcross-encoder models (Hugging Face)HyDE query expansion

RAG Evaluation Frameworks

RAGASDeepEval

LLM Observability & Tracing

LangfuseArize PhoenixLangSmithOpenLLMetry

MLOps & Experiment Tracking

MLflowWeights & Biases

Low-Code Prototyping On-Ramps

n8nLangflowFlowise

← AIINFRA 300: Agentic AI & the Model Context Protocol (MCP) AIINFRA 302: AI Security, Guardrails & Governance →