INF-401

Inference at Scale

The other half of an AI system: serving models at production cost and latency without sacrificing the quality that made them worth serving.

ABOUT THIS COURSE

What you will learn.

The half of LLM systems that almost no one teaches is inference. Once you have decided that you are running an open model in production — or that you have outgrown a hosted API for cost reasons — you have entered a different discipline, one that owes more to systems engineering than to ML.

This six-week course is for the engineer who has been told, “Just make it faster and cheaper.” We cover model formats and quantization, KV-cache layouts and management, batching and continuous batching, speculative decoding, prefill versus decode separation, multi-tenancy, autoscaling under bursty load, and the operational disciplines required to keep a fleet of GPUs honest. We work with two production-grade inference servers and write enough kernel code to understand what they are doing on our behalf.

This is the only Solutech course with a CUDA homework set. It is also the course alumni cite most often when their AI bill gets cut in half.

WHAT YOU’LL BUILD

Four substantial projects.

Project 01

A quantization study

Quantize a model across three precisions and report the cost, latency, and quality tradeoff with evidence.

Project 02

A batched inference benchmark

Build a benchmark that distinguishes batching strategies for your real traffic mix.

Project 03

A KV-cache optimization

Profile and optimize KV-cache memory in a continuous-batching server and write up the tradeoffs.

Project 04

A multi-tenant SLA design

Design SLAs and prioritization for a fleet serving multiple internal products.

CURRICULUM

Week by week.

FIT

Who this is for — and who it is not.

For you if

Engineers running self-hosted models in production who need to reduce cost.
Platform teams supporting AI workloads at organizations large enough to care.
Engineers who want to understand what their inference server is actually doing.

Probably not for you if

Engineers using only hosted APIs and not planning to change that.
Researchers focused on model training.
Engineers without comfort in systems work; this course is dense.

YOUR INSTRUCTOR

Taught by an operator.

ML Systems Engineer

Tomasz Kowalski

Tomasz spent the first ten years of his career writing compilers — first at Intel, then at a chip startup that no longer exists — before pivoting into ML systems. He maintains an inference framework used by a handful of well-known products and has strong opinions about KV-cache layouts. His course is the only one in the Solutech catalog with a homework set involving CUDA.

FAQ

Questions we’re asked often.

INF-401 · Next cohort starts soon