LLM-302

Multimodal Systems

Vision, audio, and document-grounded models in production — the part of the field that has matured fastest in the last twelve months.

ABOUT THIS COURSE

What you will learn.

Multimodal systems have crossed a quiet threshold. The image and document understanding capabilities of frontier models are now strong enough to underpin real products — and weak enough, in specific ways, that engineers who do not understand the failure modes will ship things that embarrass their teams.

This five-week course is a focused engineering treatment of multimodal systems: vision, document understanding, OCR-grounded extraction, and audio (briefly). We cover the input pipeline, prompting for multimodal models, retrieval that includes non-text content, evaluating extraction quality, and a working set of failure modes you should expect. The capstone is a document-extraction service with full evaluation.

This course pairs naturally with RAG-301 for teams shipping document-grounded products.

WHAT YOU’LL BUILD

Four substantial projects.

Project 01

A document extraction service

Build an end-to-end service that turns PDFs into structured data with confidence scores.

Project 02

An image-grounded QA system

Answer questions over a corpus of images, with citations and a refusal policy when confidence is low.

Project 03

A multimodal eval harness

Construct an eval suite that catches extraction errors a text-only eval would miss.

Project 04

An audio transcription + structuring pipeline

Take meeting audio to structured action items with appropriate guardrails.

CURRICULUM

Week by week.

FIT

Who this is for — and who it is not.

For you if

Engineers shipping document-grounded products: extraction, summarization, search.
Teams whose users upload images and expect the product to understand them.
Engineers from the text-only LLM world expanding into vision.

Probably not for you if

Engineers who have never used an LLM API — start at AI-101.
People focused on generative image models (image synthesis) — this is understanding, not generation.
Researchers studying multimodal foundation model training.

YOUR INSTRUCTOR

Taught by an operator.

Research Lead

Yuki Tanaka

Yuki led the retrieval-quality group at DeepMind before joining Solutech full time. She co-authored four widely-cited papers on dense passage retrieval and hybrid lexical-semantic search, and her course-week on reranking has become required reading inside two well-known AI startups. Her teaching style is direct: she will tell you, on day one, which of your assumptions about RAG are wrong, and she will be right.

FAQ

Questions we’re asked often.

LLM-302 · Next cohort starts soon