A time-delay neural network architecture for isolated word recognition

tags
keywords: Constrained links, Isolated word recognition, Multiresolution learning, Multispeaker speech recognition, Network architecture, Neural networks, Time delays
author: Lang, K. J., Waibel, A. H., & Hinton, G. E.
url: [Science Direct] [PDF]
date: 1990

Summary
Motivation
Key Idea
Method
Dataset
Appendix
- Feature Extraction
- Viterbi Alignment

Summary

In the authors’ own words, in the context of “difficult” word recognition, this literature documents the success of,

a network that extracted features by repeatedly convolving a set of narrow weight patterns with the contents of a sliding window into the input.

This is a widely referenced early work that lays foundation for 1-D convolution based approaches in the recent advances, like Jasper and its derivatives.

See also: Dataset to understand the problem better.

Motivation

[…] the Viterbi-aligned speech fragments contained enough alignment errors to motivate a shift-invariant [model]

Ref: Viterbi Alignment

Key Idea

Neural networks are universal approximators;
Convolution controls model size; and
Pooling leads to positional invariance.

Method

Let

\(\text{data}\)	be a set of \((y,\mathbf{x})\) pairs; \(\mathbf{x}\) being the input vector, and \(y\in\mathbb{Z}_{\geqslant0}\) being the labels.
\(\mathbf{x}\in\mathbb{X}\subseteq\mathbb{R}^{C\times T}\)	be input vector; \(C=16\) being number of channels; and \(T\) being temporal resolution.
\(\Phi(\;\cdot;\theta):\mathbb{X}\to\mathbb{R}^{K\times T'}\)	be 1-D convolutional neural network, with its output being a \(T'\) long sequence of \(K\) dimensional vector, representing word probabilities; \(K\) being the vocabulary size and \(\theta\) being network params.
\(\boldsymbol{1}_{k}:\mathbb{Z}_{\geqslant0}\to\mathbb{R}^{k}\)	be whole number to one-hot vector converter
\(\parallel\,\cdot\,\parallel_{p,\mathrm{row}}\)	be an operator that computes \(L_{p}\) norm for each row vector in the tensor, defined here for mathematical convenience.
\(\mathbf{x}^{\otimes n}\)	be element-wise exponentiation operation, defined here for mathematical convenience.
\(\Delta_{E}\)	be Euclidean distance between two vectors.

\begin{align} \notag \theta_* &= \arg \min_{\theta} \underset {y,\mathbf{x} \sim\text{data}} {\mathbb{E}} \left[ \Delta(y, f(\mathbf{x};\theta)) \right] \\ \notag f(\mathbf{x};\theta) &= \left\| \Phi(\mathbf{x}; \theta) \right\|_{2,\mathrm{row}}^2 \\ \notag \Delta(y, \widetilde{\mathbf{y}}) &= \Delta_E \left(\boldsymbol{1}_K(y), \widetilde{\mathbf{y}} \right) \end{align}

Update 2024-08-28¹.

Dataset

400 samples

the four words, “bee,” “dee,” “ee,” and “vee” [B, D, E, and V] were used; earlier IBM research had shown that these four words were the most confusable members of the E-set of the alphabet.

Appendix

Feature Extraction

Initial spectrogram feature was extracted from \(150\) ms waveform containing log energy values and bearing shape \(128\times48\); \(128\) frequencies \(\times48\) frames each lasting \(3\) ms. In an experiment with input feature size, it was observed, however that \(16\times12\) input-based 2-layer hidden model exhibited the best performance; \(16\) frequency bands on linear scale \(\times12\) frames each lasting \(12\) ms.

[…] the program converted our 150 ms waveform samples into spectrograms containing 128 log energies ranging up to 8 kHz, and 49 time frames of 3 ms each. The first frame of each spectrogram was then discarded so that there would be 48 time steps (a highly factorizable number), and the DC bias com- ponent of each frame was set to zero. Because each of the 48 time frames represented 3 ms, the final duration of the spectrograms was 144 ms.

Viterbi Alignment

In a prior art at IBM, a hidden Markov model (HMM) was used to model the distribution of labels and spoken word. The Viterbi search listed the most likely sequence of labels, corresponding to each frame of utterance in a spoken word; where the word identity was known.

These labels were used to extract a 150 ms salient section of each utterance which included 100 ms before the first frame that was labeled “E” (this region should contain the consonant), plus 50 ms of the vowel.

HMM Model Details:

[…] the words B, D, and V are modelled by a concatenation of the state machines for noise, voiced consonant onset, {B,D,V}, E, E trail-off, and noise. The word E is modelled by a concatenation of the state machines for noise, E onset, E, E trail-off, and noise. The state machines contain 3 main states with associated transitions to model the beginning, middle, and end of each phone. The consonant and vowel machines include self-loops to model steady-state portions of the acoustic signal, and all of the machines include null transitions to model short durations.

Footnotes:

(Update 2024-08-28) Remove redundant average pooling operator, (earlier \(f(\mathbf{x};\theta) = \mathrm{avg} \circ \left\| \Phi(\mathbf{x}; \theta) \right\|_{2,\mathrm{row}}^2\)). The Frobenius norm at row level, effectively defines the sum-squares pooling, as in the TDNN paper. Thus the vague average pooling operator is eliminated.