We didn't just write a course about AI. We used AI to research what actually works when teaching people AI — then built the course around those findings.
We built an AI tutor that teaches practical AI literacy, then gave a second AI (Claude Code) one job: make the tutor better. It could rewrite the teaching strategy, change the order, add or remove tools — anything. Then it tested the result against 5 simulated learners and kept what worked.
Each loop: ~20 minutes, ~$2, ~150 API calls. 13 runs total.
Each simulated learner had a real personality, real skepticism, and real resistance. If the tutor couldn't teach all five, the strategy wasn't universal.
After each teaching session, every learner was tested on 5 skills. These became the 5 modules in Part 1 and Part 2 of this course.
Most ideas made things worse. Only 3 out of 11 experiments actually improved the tutor.
| # | Score | Pass | Status | What we tried |
|---|---|---|---|---|
| base | 0.757 | 80% | keep | Baseline — original teaching prompt |
| exp1 | 0.817 | 88% | keep | Explicit tool vs prompt distinction |
| exp2 | 0.781 | 84% | discard | Tighter pacing, combine steps |
| exp3 | 0.862 | 92% | keep | Faster opening + explain AI mechanism during failures |
| exp4 | 0.897 | 96% | keep | Concrete worked example in build step |
| exp5 | 0.753 | 80% | discard | Active recall — learner explains back |
| exp6 | 0.471 | 52% | discard | Prescriptive 4-step framework |
| exp7 | 0.836 | 88% | discard | Faster pacing + combine steps 6&7 |
| exp8 | 0.710 | 76% | discard | Vivid autocomplete analogy |
| exp9 | 0.671 | 72% | discard | Remove constraints section |
| exp4r | 0.595 | 64% | variance | Re-run of exp4 — confirmed ±0.15 variance |
| exp11 | 0.713 | 76% | discard | Debug-via-contrast + payoff-first |
Every kept improvement added specificity. "Build a tool" failed. "Every week you [X], let's build..." worked. Abstract frameworks always regressed.
Removing the "under 150 words" and "never skip the why" rules cratered the score to 0.671. Constraints aren't decoration — they shape behavior.
Every experiment that made the prompt longer scored worse. Analogies, frameworks, active recall — all added words, all regressed.
Three experiments tried compressing the "build a tool" step. All failed. This is why Module 5 has three full examples before asking you to build.
Learners consistently confused "a good prompt" with "a tool." The input→output framing in Module 5 came directly from the experiment that scored 0.897.
The same prompt scored 0.897 and 0.595 on consecutive runs. Most of our "failures" may have been noise. Reliable testing needs 3+ runs per variant.
This course was built on evidence, not opinion. Start with Module 1.
Start the course →