← All courses

// AIINFRA 201 · Semester 2

Production Inference Serving & GPU Orchestration

Deploying, Scaling, and Operating LLMs in Production on GPUs

This course trains learners to move beyond local, single-user model runners and into production-grade LLM inference serving. Students build a lean Linux and Docker foundation, then learn high-throughput serving engines such as vLLM, TGI, TensorRT-LLM, and SGLang alongside serving platforms like Triton, Ray Serve, and KServe. The back half of the course covers GPU orchestration on Kubernetes, including the GPU Operator, Dynamic Resource Allocation, MIG partitioning, multi-GPU parallelism, and disaggregated prefill/decode. Learners finish able to benchmark, monitor, autoscale, and right-size real inference workloads for cost and performance.

Contact hours54 hrs
Credit equivalent3-unit
PrerequisiteAIINFRA 200
Length16 weeks
01 / outcomes

Outcomes

Course objectives

  1. Deploy containerized Linux services and distinguish dev/prototyping tools from production-grade inference stacks
  2. Configure and tune production inference servers (vLLM, TGI, TensorRT-LLM, SGLang) for high-throughput, low-latency serving
  3. Orchestrate GPU resources on Kubernetes using the GPU Operator, Dynamic Resource Allocation, MIG, and the KAI Scheduler
  4. Implement multi-GPU tensor and pipeline parallelism with autoscaling and load balancing for large models
  5. Benchmark, monitor, and cost-optimize inference deployments using Prometheus, Grafana, and disaggregated serving architectures

Student learning outcomes

  • Deploy containerized Linux services and distinguish dev/prototyping tools from production-grade inference stacks.
  • Configure and tune production inference servers (vLLM, TGI, TensorRT-LLM, SGLang) for high-throughput, low-latency serving.
  • Orchestrate GPU resources on Kubernetes using the GPU Operator, Dynamic Resource Allocation, MIG, and the KAI Scheduler.
  • Implement multi-GPU tensor and pipeline parallelism with autoscaling and load balancing for large models.
  • Benchmark, monitor, and cost-optimize inference deployments using Prometheus, Grafana, and disaggregated serving architectures.
02 / schedule

16-week schedule

Wk 01
Linux Server Foundations for AI Workloads
Builds the Linux server and systemd foundation needed to host AI workloads ahead of containerized and GPU-backed inference work.
Wk 02
Docker, GPU Runtime, and Dev-Only Tools (Ollama, LM Studio)
Covers Docker and the NVIDIA Container Toolkit while framing Ollama and LM Studio as dev-only tools, not production serving.
Wk 03
Production Inference Fundamentals and the vLLM Engine
Introduces production inference serving fundamentals and configuring the vLLM engine for high-throughput serving.
Wk 04
PagedAttention and Continuous Batching Deep Dive
Examines PagedAttention and continuous batching, the core techniques behind high-throughput LLM serving engines.
Wk 05
Hugging Face TGI and SGLang Serving Engines
Covers configuring and tuning the Hugging Face TGI and SGLang serving engines as alternatives to vLLM.
Wk 06
TensorRT-LLM Compilation and Optimization
Covers compiling and optimizing models for serving with NVIDIA TensorRT-LLM.
Wk 07
NVIDIA Triton Inference Server and Model Repositories
Introduces NVIDIA Triton Inference Server and structuring model repositories for production serving.
Wk 08
Serving Engine Benchmarking Lab
Midterm week: students benchmark serving engines covered so far in a hands-on lab.
Midterm · covers Wks 1–7
Wk 09
Ray Serve and KServe for Scalable Model Serving
Covers Ray Serve and KServe as scalable serving platforms for production model deployment.
Wk 10
GPU Orchestration on Kubernetes with the NVIDIA GPU Operator
Introduces GPU orchestration on Kubernetes using the NVIDIA GPU Operator to manage GPU resources for inference workloads.
Wk 11
Dynamic Resource Allocation and MIG Partitioning
Covers Kubernetes Dynamic Resource Allocation and NVIDIA MIG partitioning for fine-grained GPU resource sharing.
Wk 12
Multi-GPU Tensor and Pipeline Parallelism
Covers implementing multi-GPU tensor and pipeline parallelism to serve large models across multiple GPUs.
Wk 13
Autoscaling, Load Balancing, and the KAI Scheduler
Covers autoscaling, load balancing, and the KAI Scheduler for managing inference workload demand on Kubernetes.
Wk 14
Disaggregated Prefill/Decode with NVIDIA Dynamo
Introduces disaggregated prefill/decode serving architectures using NVIDIA Dynamo.
Wk 15
Observability, Cost Analysis, and Right-Sizing
Covers observability, cost analysis, and right-sizing techniques for production inference deployments.
Wk 16
Capstone Project & Course Review
Final capstone week: students design, deploy, and present a production inference serving and GPU orchestration project.
Capstone
03 / tools

Tools & frameworks

OS & Containers
Ubuntu ServersystemdDockerNVIDIA Container Toolkit
Dev/Prototyping Tools
OllamaLM Studio
Inference Engines
vLLMHugging Face TGITensorRT-LLMSGLang
Serving Platforms
NVIDIA Triton Inference ServerRay ServeKServe
Kubernetes & GPU Orchestration
KubernetesNVIDIA GPU OperatorDynamic Resource AllocationMIGKAI SchedulerNVIDIA Dynamo
Observability & Benchmarking
PrometheusGrafanak6NVIDIA Nsight Systems
Cloud & Hardware
NVIDIA A100/H100 GPUsHetzner/DigitalOcean GPU instancesNCCL

What this course trains you for

Software Developers$179,292 median
Computer Occupations, All Other$138,203 median

CA median wages, 2024–34 projections (EDD/OEWS). See the full labor-market dashboard on the program overview.