SaySynth: A Brief History of Speaking Machines

19.12.25

music / code

These are expanded notes from a talk I gave at composition.codes on December 21, 2025. Slides here. Video here.

SaySynth is a synthesizer I built on top of macOS’s text-to-speech framework — more popularly known as the say command. But to explain why I built it and why I think it matters, I want to take a detour through the history of speaking machines more broadly.

A Typology of Speaking Machines

There are roughly four kinds of speaking machines that have existed over time:

Mechanical — Literally physical: bellows forcing air through a reed, with different knobs, valves, and whistles shaping different formants and phonemes. The human operator is part of the instrument.

Formant/Rule-Based — More like a synthesizer: an oscillator and a comb filter simulating the resonant shape of the vocal tract. The system models the acoustics of speech without recording any actual speech.

Sample-Based (Concatenative) — From something as crude as a toy with a phonograph inside, all the way to sophisticated “diphone” synthesizers that splice together recordings of every possible phoneme transition. GPS voices and automated customer service phone lines of the ’90s and 2000s were built this way.

Generative (Neural/AI) — What most people think of today. These are basically sample-based systems taken to an extreme: instead of recordings of phoneme pairs, you’re dealing with individual digital samples predicted by a neural network, sample by sample.

Links in this post:

"composition.codes": https://composition.codes/
"Slides here": https://brian.abelson.live/slides/saysynth.html
"Video here": https://www.youtube.com/watch?v=tX3nEPt0fKk
"SaySynth": https://gitlab.com/abelsonlive/saysynth