← All courses
// AIINFRA 201 · Semester 2
Production Inference Serving & GPU Orchestration
Deploying, Scaling, and Operating LLMs in Production on GPUs
This course trains learners to move beyond local, single-user model runners and into production-grade LLM inference serving. Students build a lean Linux and Docker foundation, then learn high-throughput serving engines such as vLLM, TGI, TensorRT-LLM, and SGLang alongside serving platforms like Triton, Ray Serve, and KServe. The back half of the course covers GPU orchestration on Kubernetes, including the GPU Operator, Dynamic Resource Allocation, MIG partitioning, multi-GPU parallelism, and disaggregated prefill/decode. Learners finish able to benchmark, monitor, autoscale, and right-size real inference workloads for cost and performance.
01 / outcomes
Outcomes
Course objectives
- Deploy containerized Linux services and distinguish dev/prototyping tools from production-grade inference stacks
- Configure and tune production inference servers (vLLM, TGI, TensorRT-LLM, SGLang) for high-throughput, low-latency serving
- Orchestrate GPU resources on Kubernetes using the GPU Operator, Dynamic Resource Allocation, MIG, and the KAI Scheduler
- Implement multi-GPU tensor and pipeline parallelism with autoscaling and load balancing for large models
- Benchmark, monitor, and cost-optimize inference deployments using Prometheus, Grafana, and disaggregated serving architectures
Student learning outcomes
- Deploy containerized Linux services and distinguish dev/prototyping tools from production-grade inference stacks.
- Configure and tune production inference servers (vLLM, TGI, TensorRT-LLM, SGLang) for high-throughput, low-latency serving.
- Orchestrate GPU resources on Kubernetes using the GPU Operator, Dynamic Resource Allocation, MIG, and the KAI Scheduler.
- Implement multi-GPU tensor and pipeline parallelism with autoscaling and load balancing for large models.
- Benchmark, monitor, and cost-optimize inference deployments using Prometheus, Grafana, and disaggregated serving architectures.
02 / schedule
16-week schedule
Wk 01
Linux Server Foundations for AI Workloads
Builds the Linux server and systemd foundation needed to host AI workloads ahead of containerized and GPU-backed inference work.
Wk 02
Docker, GPU Runtime, and Dev-Only Tools (Ollama, LM Studio)
Covers Docker and the NVIDIA Container Toolkit while framing Ollama and LM Studio as dev-only tools, not production serving.
Wk 03
Production Inference Fundamentals and the vLLM Engine
Introduces production inference serving fundamentals and configuring the vLLM engine for high-throughput serving.
Wk 04
PagedAttention and Continuous Batching Deep Dive
Examines PagedAttention and continuous batching, the core techniques behind high-throughput LLM serving engines.
Wk 05
Hugging Face TGI and SGLang Serving Engines
Covers configuring and tuning the Hugging Face TGI and SGLang serving engines as alternatives to vLLM.
Wk 06
TensorRT-LLM Compilation and Optimization
Covers compiling and optimizing models for serving with NVIDIA TensorRT-LLM.
Wk 07
NVIDIA Triton Inference Server and Model Repositories
Introduces NVIDIA Triton Inference Server and structuring model repositories for production serving.
Wk 08
Serving Engine Benchmarking Lab
Midterm week: students benchmark serving engines covered so far in a hands-on lab.
Midterm · covers Wks 1–7Wk 09
Ray Serve and KServe for Scalable Model Serving
Covers Ray Serve and KServe as scalable serving platforms for production model deployment.
Wk 10
GPU Orchestration on Kubernetes with the NVIDIA GPU Operator
Introduces GPU orchestration on Kubernetes using the NVIDIA GPU Operator to manage GPU resources for inference workloads.
Wk 11
Dynamic Resource Allocation and MIG Partitioning
Covers Kubernetes Dynamic Resource Allocation and NVIDIA MIG partitioning for fine-grained GPU resource sharing.
Wk 12
Multi-GPU Tensor and Pipeline Parallelism
Covers implementing multi-GPU tensor and pipeline parallelism to serve large models across multiple GPUs.
Wk 13
Autoscaling, Load Balancing, and the KAI Scheduler
Covers autoscaling, load balancing, and the KAI Scheduler for managing inference workload demand on Kubernetes.
Wk 14
Disaggregated Prefill/Decode with NVIDIA Dynamo
Introduces disaggregated prefill/decode serving architectures using NVIDIA Dynamo.
Wk 15
Observability, Cost Analysis, and Right-Sizing
Covers observability, cost analysis, and right-sizing techniques for production inference deployments.
Wk 16
Capstone Project & Course Review
Final capstone week: students design, deploy, and present a production inference serving and GPU orchestration project.
Capstone03 / tools
Tools & frameworks
OS & Containers
Dev/Prototyping Tools
Inference Engines
Serving Platforms
Kubernetes & GPU Orchestration
Observability & Benchmarking
Cloud & Hardware