AI-301

Evals & Observability for LLMs

Stop guessing whether your model changes helped. Build the eval and telemetry stack your team will quietly come to rely on.

ABOUT THIS COURSE

What you will learn.

Evals are the single most-discussed and least-built piece of infrastructure in modern AI engineering. Everyone agrees they matter; almost no one wants to do the work of building them well. The result, across our industry, is a quiet epidemic of teams shipping model and prompt changes with no idea whether they helped.

This course is a focused, five-week treatment of evals and observability as engineering disciplines. We cover golden-set construction, adversarial set generation, pairwise model-graded evaluation, online judges, regression gating, drift detection, structured tracing across model boundaries, and the kinds of dashboards that survive contact with a skeptical executive. The capstone is a real, running eval suite plus an observability surface, delivered against a system of your choice.

This is not a course about LLM theory. It is a course about the unromantic but high-leverage work that distinguishes teams who improve from teams who churn.

WHAT YOU’LL BUILD

Four substantial projects.

Project 01

A multi-layer eval suite

Build offline goldens, adversarial sets, and an online judge — and make them disagree usefully.

Project 02

Regression gating in CI

Wire the suite into CI so prompt and model changes fail loudly when they regress quality.

Project 03

Production tracing across the model boundary

Trace requests across your code, the model, and downstream tools with a single trace ID.

Project 04

An honest dashboard

Design a one-page dashboard that an executive can read and that your on-call can act on.

CURRICULUM

Week by week.

FIT

Who this is for — and who it is not.

For you if

Engineers who own an LLM feature in production and want to stop flying blind.
Tech leads who need to give their team a credible story about quality.
Engineers transferring from SRE or platform backgrounds into AI work.

Probably not for you if

Engineers without a system in production to measure — take AI-101 or AI-201 first.
People looking for a one-week intro — evals are deeper than they look.
Researchers focused on benchmark construction in academic settings.

YOUR INSTRUCTOR

Taught by an operator.

Founding Engineer

Rahul Mehta

Rahul was the first engineering hire at a well-known applied-LLM company you have probably heard of and signed an NDA about. He owns the evals stack, the cost-monitoring pipeline, and the on-call rotation. He teaches the operational side of LLM systems — the part of the job that is mostly invisible to people who only read launch posts on social media.

FAQ

Questions we’re asked often.

AI-301 · Next cohort starts soon