// AIINFRA 201 · Semester 2

Production Inference Serving & GPU Orchestration

Deploying, Scaling, and Operating LLMs in Production on GPUs

This course trains learners to move beyond local, single-user model runners and into production-grade LLM inference serving. Students build a lean Linux and Docker foundation, then learn high-throughput serving engines such as vLLM, TGI, TensorRT-LLM, and SGLang alongside serving platforms like Triton, Ray Serve, and KServe. The back half of the course covers GPU orchestration on Kubernetes, including the GPU Operator, Dynamic Resource Allocation, MIG partitioning, multi-GPU parallelism, and disaggregated prefill/decode. Learners finish able to benchmark, monitor, autoscale, and right-size real inference workloads for cost and performance.

Contact hours54 hrs

Credit equivalent3-unit

PrerequisiteAIINFRA 200

Length16 weeks

01 / outcomes

Outcomes

Course objectives

Deploy containerized Linux services and distinguish dev/prototyping tools from production-grade inference stacks
Configure and tune production inference servers (vLLM, TGI, TensorRT-LLM, SGLang) for high-throughput, low-latency serving
Orchestrate GPU resources on Kubernetes using the GPU Operator, Dynamic Resource Allocation, MIG, and the KAI Scheduler
Implement multi-GPU tensor and pipeline parallelism with autoscaling and load balancing for large models
Benchmark, monitor, and cost-optimize inference deployments using Prometheus, Grafana, and disaggregated serving architectures

Student learning outcomes

Deploy containerized Linux services and distinguish dev/prototyping tools from production-grade inference stacks.
Configure and tune production inference servers (vLLM, TGI, TensorRT-LLM, SGLang) for high-throughput, low-latency serving.
Orchestrate GPU resources on Kubernetes using the GPU Operator, Dynamic Resource Allocation, MIG, and the KAI Scheduler.
Implement multi-GPU tensor and pipeline parallelism with autoscaling and load balancing for large models.
Benchmark, monitor, and cost-optimize inference deployments using Prometheus, Grafana, and disaggregated serving architectures.

02 / schedule

16-week schedule

Wk 01

Linux Server Foundations for AI Workloads

Builds the Linux server and systemd foundation needed to host AI workloads ahead of containerized and GPU-backed inference work.

Wk 02

Docker, GPU Runtime, and Dev-Only Tools (Ollama, LM Studio)

Covers Docker and the NVIDIA Container Toolkit while framing Ollama and LM Studio as dev-only tools, not production serving.

Wk 03

Production Inference Fundamentals and the vLLM Engine

Introduces production inference serving fundamentals and configuring the vLLM engine for high-throughput serving.

Wk 04

PagedAttention and Continuous Batching Deep Dive

Examines PagedAttention and continuous batching, the core techniques behind high-throughput LLM serving engines.

Wk 05

Hugging Face TGI and SGLang Serving Engines

Covers configuring and tuning the Hugging Face TGI and SGLang serving engines as alternatives to vLLM.

Wk 06

TensorRT-LLM Compilation and Optimization

Covers compiling and optimizing models for serving with NVIDIA TensorRT-LLM.

Wk 07

NVIDIA Triton Inference Server and Model Repositories

Introduces NVIDIA Triton Inference Server and structuring model repositories for production serving.

Wk 08

Serving Engine Benchmarking Lab

Midterm week: students benchmark serving engines covered so far in a hands-on lab.

Midterm · covers Wks 1–7

Wk 09

Ray Serve and KServe for Scalable Model Serving

Covers Ray Serve and KServe as scalable serving platforms for production model deployment.

Wk 10

GPU Orchestration on Kubernetes with the NVIDIA GPU Operator

Introduces GPU orchestration on Kubernetes using the NVIDIA GPU Operator to manage GPU resources for inference workloads.

Wk 11

Dynamic Resource Allocation and MIG Partitioning

Covers Kubernetes Dynamic Resource Allocation and NVIDIA MIG partitioning for fine-grained GPU resource sharing.

Wk 12

Multi-GPU Tensor and Pipeline Parallelism

Covers implementing multi-GPU tensor and pipeline parallelism to serve large models across multiple GPUs.

Wk 13

Autoscaling, Load Balancing, and the KAI Scheduler

Covers autoscaling, load balancing, and the KAI Scheduler for managing inference workload demand on Kubernetes.

Wk 14

Disaggregated Prefill/Decode with NVIDIA Dynamo

Introduces disaggregated prefill/decode serving architectures using NVIDIA Dynamo.

Wk 15

Observability, Cost Analysis, and Right-Sizing

Covers observability, cost analysis, and right-sizing techniques for production inference deployments.

Wk 16

Capstone Project & Course Review

Final capstone week: students design, deploy, and present a production inference serving and GPU orchestration project.

Capstone

03 / tools

Tools & frameworks

OS & Containers

Ubuntu ServersystemdDockerNVIDIA Container Toolkit

Dev/Prototyping Tools

OllamaLM Studio

Inference Engines

vLLMHugging Face TGITensorRT-LLMSGLang

Serving Platforms

NVIDIA Triton Inference ServerRay ServeKServe

Kubernetes & GPU Orchestration

KubernetesNVIDIA GPU OperatorDynamic Resource AllocationMIGKAI SchedulerNVIDIA Dynamo

Observability & Benchmarking

PrometheusGrafanak6NVIDIA Nsight Systems

Cloud & Hardware

NVIDIA A100/H100 GPUsHetzner/DigitalOcean GPU instancesNCCL

← AIINFRA 200: Cloud Platforms for AI & CI/CD for ML AIINFRA 202: Model Adaptation — Fine-Tuning & Quantization →