What you will learn.
The half of LLM systems that almost no one teaches is inference. Once you have decided that you are running an open model in production — or that you have outgrown a hosted API for cost reasons — you have entered a different discipline, one that owes more to systems engineering than to ML.
This six-week course is for the engineer who has been told, “Just make it faster and cheaper.” We cover model formats and quantization, KV-cache layouts and management, batching and continuous batching, speculative decoding, prefill versus decode separation, multi-tenancy, autoscaling under bursty load, and the operational disciplines required to keep a fleet of GPUs honest. We work with two production-grade inference servers and write enough kernel code to understand what they are doing on our behalf.
This is the only Solutech course with a CUDA homework set. It is also the course alumni cite most often when their AI bill gets cut in half.
