Connectionist temporal classification:
Labelling unsegmented sequence data with recurrent neural networks
- author
- Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J.
- url
- [ACM], [PDF from toronto.edu], [PDF from tum.de], [Pytorch] [Tensorflow] [Julia]
- date
- 2006
- booktitle
- Proceedings of the 23rd International Conference on Machine Learning
Summary
- type of neural network output and associated scoring function (for RNN’s);
- to tackle sequence problems where the timing is variable;
- CTC refers to the outputs and scoring, and is independent of the underlying neural network structure.
For example, in speech audio there can be multiple time slices which correspond to a single phoneme. Since we don't know the alignment of the observed sequence with the target labels we predict a probability distribution at each time step. — Wikipedia
See also: https://distill.pub/2017/ctc/
Key Observation
An input waveform for the word HELLO
may vary in the
following ways,
- A quick and slow speaker may stretch it at varying
lengths, e.g.
HELLLOO
vsHEELLLOOOO
; and the same may be extend to syllable stresses and intonations when speaking in further detail; - The start points and blanks may vary, e.g.
---HEE-LLOO-
vs-HELLLOO-
We call this as an alignment problem, where given an alphabet, say \(\{H,E,L,O\}\), and a sequence \([x_1,\ldots,x_T]\), find the correspondence.
Inherent to this formulation, there’s no way to
distinguish between HELO
vs HELLO
. This problem is
like mode collapse. To this end, the author
introduces a special character called CTC blank
\(\epsilon\), that suppresses mode collapse. E.g.
LLLL
\(\to\) L
in post-processing, but LL-LL
\(\to\)
LL
. Hence, the alphabet now becomes
\(\{H,E,L,O,\epsilon\}\).
This problem grows polynomially in alphabet size and exponentially in sequence size.
The Problem
Minimise transcription mistakes from speech to text or handwriting to text, where the natural measure is a label error rate \(\mathrm{LER}\) of a temporal classifier \(h\), defined as follows.
\begin{align} \notag \mathrm{LER}(h,\mathbb{S}^{\prime}) &= \underset {\large(\mathbf{x},\mathbf{z}) \sim \mathbb{S}^{\prime}} {\Large\mathbb{E}} \left[\frac{\mathrm{ED}(h(\mathbf{x}), \mathbf{z})}{|\mathbf{z}|} \right] \end{align}where,
- \(\mathbb{S}^{\prime}\subset \mathcal{D}_{\mathcal{X} \times \mathcal{Z}}\) is the test sample;
- \(\mathrm{ED}\) is the edit distance.
Key Contribution
where,
- \(X\equiv[\mathbf{x}_1,\ldots,\mathbf{x}_n]\) be the input sequence;
- \(Y\equiv[\mathbf{y}_1,\ldots,\mathbf{y}_m]\) be the output sequence;
- \(A\equiv[\mathbf{a}_1,\ldots,\mathbf{a}_T]\) be an alignment between \(\mathbf{x}\) and \(\mathbf{y}\) — also the network output; and
- \(\mathcal{A} (X,Y)\) be such an alignment space;
Implementation Details
Inputs: \(X\in\mathbb{R}^{(\cdot)\times n}\) is a spectrogram-like audio input, like MFCC, providing \(n\) time step sequence of input.
Network Output (or RNN output) is the alignment \(A\in\mathbb{R}^{|L'|\times T}\). Here \(L'\equiv L\cup{\epsilon}\) is the alphabet augmented with CTC blank.
Alphabet \(Y\in\mathbb{R}^{m}\) (also known as
\(Y_{\text{mask}}\) in some implementations) is the
output sequence augmented by blanks, e.g. -HEL-LO-
or HEL-LO
for HELLO
, as a sequence of indices; or
seldom tokens dependent upon implementation detail.
Backpropagation with Forward-Backward Algorithm
\(\mathcal{A} (X,Y)\) is grows exponentially in the length of sequence, i.e. \(\mathcal{O}(m^T)\)
But similar to the HMM, a recursive definition enables us to compute the loss efficiently in \(\mathcal{O}(m^{2}T)\).