← All courses
// AIINFRA 102 · Semester 1
Containerization & GPU-Aware Orchestration
Package, deploy, and scale AI workloads with Docker and Kubernetes
This course teaches students to package, run, and orchestrate AI and Python applications using industry-standard containerization and orchestration tools. Students progress from building Docker images and GPU-enabled containers to deploying, scaling, and monitoring multi-container AI workloads on Kubernetes, including GPU-aware scheduling with the NVIDIA GPU Operator. By the end of the course, students can package a containerized AI application with Helm, provision its infrastructure with Terraform, and monitor it in production with Prometheus and Grafana.
01 / outcomes
Outcomes
Learning outcomes
- Explain containerization and build optimized Docker images for AI/Python applications.
- Compose and run multi-container AI stacks locally, including GPU-enabled containers via the NVIDIA Container Toolkit.
- Deploy and manage AI workloads on Kubernetes using pods, deployments, services, config, and persistent storage.
- Configure resource limits, autoscaling, and GPU-aware scheduling with the NVIDIA GPU Operator.
- Package, provision, monitor, and troubleshoot containerized AI systems using Helm, Terraform, and Prometheus/Grafana.
02 / schedule
16-week schedule
Wk 01
Why Containers for AI: VMs, Images, and the Docker Model
Introduces containerization concepts, contrasting VMs and container images, and the Docker model for packaging AI apps.
Wk 02
Docker Fundamentals: Running and Managing Containers
Covers the container lifecycle — running, inspecting, executing into, and managing Docker containers with resource and port controls.
Wk 03
Building AI Container Images with Dockerfiles
Teaches Dockerfile fundamentals, layer caching, multi-stage builds, and security practices for containerizing AI inference apps.
Wk 04
Docker Compose and Multi-Container AI Applications
Uses Docker Compose to define and run multi-container AI application stacks locally.
Wk 05
GPU Containers and the NVIDIA Container Toolkit
Covers running GPU-enabled containers using the NVIDIA Container Toolkit for AI workloads.
Wk 06
Container Registries, Tagging, and Image Security
Covers pushing images to registries, semantic versioning tag strategy, and scanning images for vulnerabilities with Trivy.
Wk 07
Introduction to Kubernetes: Architecture and kubectl
Introduces Kubernetes architecture and the kubectl command-line tool for interacting with a cluster.
Wk 08
Kubernetes Workloads: Deployments, Services, and Config
Covers Kubernetes Deployments, Services, and configuration objects. This week includes the course midterm.
Midterm · covers Wks 1–7Wk 09
Deploying an AI Inference App to Kubernetes
Applies Kubernetes fundamentals to deploy a real AI inference application onto a cluster.
Wk 10
Scaling and Resource Management with the Horizontal Pod Autoscaler
Covers resource requests/limits and autoscaling AI workloads with the Kubernetes Horizontal Pod Autoscaler.
Wk 11
GPU-Aware Kubernetes with the NVIDIA GPU Operator
Covers GPU-aware scheduling in Kubernetes using the NVIDIA GPU Operator and device plugin.
Wk 12
Storage, Volumes, and Persistence for AI Models and Data
Covers Kubernetes storage, volumes, and persistence strategies for AI models and datasets.
Wk 13
Packaging Kubernetes Apps with Helm
Teaches packaging and deploying Kubernetes applications as reusable charts with Helm.
Wk 14
Infrastructure as Code with Terraform
Introduces Terraform's declarative model — providers, resources, and state — to provision a local Kubernetes cluster and app as code.
Wk 15
Monitoring, Logging, and Troubleshooting Containerized AI
Covers installing the kube-prometheus-stack, reading GPU metrics via DCGM Exporter, and troubleshooting failing pods with kubectl.
Wk 16
Capstone Project & Course Review
Students design, build, and present a final capstone project demonstrating mastery of the course's containerization and orchestration outcomes.
Capstone03 / tools
Tools & frameworks
Containers
Orchestration
GPU Scheduling
Packaging & Infrastructure as Code
Registries & Security
Monitoring