A Neural Network example: English pronunciation

Steve Crossan
3 min readFeb 4, 2021

Summarising a previous post: What is a Neural Network?

  • A computer program takes input (for example, a set of numbers) and produces an output (for example, the average of those numbers).
  • Classically, a program will write down the steps required (add up the numbers, and divide by how many there are) in a computer language. This series of steps is what we mean by an algorithm.
  • A neural network is a different kind of program. It also takes input and produces output, but it is programmed in a different way. We set up a set of nodes which can be thought of as simple mini-programs, or program building blocks. These nodes are connected together from input to output with strengths or weights on each connection.
  • Once set up, the network initially produces incorrect (random) output. But we “train” it with a series of examples of input and the correct output. At each example, we adjust each weight slightly in the direction that would make the output less wrong.
  • Over many examples, this can converge so that the network gives approximately correct answers to examples it hasn’t seen in training.

In the last post we worked through an example of a neural network to estimate the average of 2 numbers. That’s a good way to get an idea of how it works, but it’s not a very good use of a neural network.

A better use: English pronunciation

Suppose you want to write a program that will take words in English and accurately tell you how to pronounce them.

If you were writing a classical program to solve this problem, you’d need a lot of rules, and there would be a lot of exceptions. A rule for pronouncing the letter ‘c’ might be:

if a 'c' appears before an 'i', 'e', or 'y'
then pronounce it like an 's'
else pronounce it like a 'k'
exception: 'celtic'

A neural network approach would be different. First, you’d set up your network — capable of taking an input word and producing an output of the set of phonemes corresponding to that word.

Input at the bottom, output at the top

[Phonemes by the way are just a standardised way of writing down pronunciation. For example the set of phonemes for ‘celtic’ with the hard ‘c’ at the beginning is “/kɛltɪk/”]

As before, the network is set up with random weights connecting its nodes, and we train it by giving it examples for which we know the answer, and asking it to guess. We then adjust the weights each time in the direction that makes the guess a little less wrong.

This is a real example; it’s called NETTalk and it was published in 1986 by Sejnowski and Rosenberg as an early use of a mutli-layer neural network.

It was trained by taking 16,000 words and their correct phonemes from the dictionary, and was then able to correctly pronounce 4,000 test words with 90% accuracy, using a network with 80 hidden nodes between input and output.

[A side note: 90% accuracy might sound impressive given how this was done, but it means getting one word in 10 wrong in practice. However by experimenting with different network architectures, and ways of presenting the training data, other researchers were subsequently able to improve accuracy significantly]



Steve Crossan

Research, investing & advising inc in AI & Deep Tech. Before: Product @ DeepMind. Founded Google Cultural Institute; Product @ Gmail, Maps, Search. Speak2Tweet.