SaySynth: A Brief History of Speaking Machines

These are expanded notes from a talk I gave at composition.codes on December 21, 2025. Slides here. Video here.

SaySynth is a synthesizer I built on top of macOS’s text-to-speech framework β€” more popularly known as the say command. But to explain why I built it and why I think it matters, I want to take a detour through the history of speaking machines more broadly.

A Typology of Speaking Machines

There are roughly four kinds of speaking machines that have existed over time:

Mechanical β€” Literally physical: bellows forcing air through a reed, with different knobs, valves, and whistles shaping different formants and phonemes. The human operator is part of the instrument.

Formant/Rule-Based β€” More like a synthesizer: an oscillator and a comb filter simulating the resonant shape of the vocal tract. The system models the acoustics of speech without recording any actual speech.

Sample-Based (Concatenative) β€” From something as crude as a toy with a phonograph inside, all the way to sophisticated β€œdiphone” synthesizers that splice together recordings of every possible phoneme transition. GPS voices and automated customer service phone lines of the ’90s and 2000s were built this way.

Generative (Neural/AI) β€” What most people think of today. These are basically sample-based systems taken to an extreme: instead of recordings of phoneme pairs, you’re dealing with individual digital samples predicted by a neural network, sample by sample.


« time with friends on KEXP