Jasper
End-to-End Convolutional Neural Acoustic Model
- tags
- Interspeech 2019
- keywords
- Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
- author
- Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J. M., Nguyen, H., …
- url
- [arXiv], [Papers With Code]
Table of Contents
Exercise
If the following equation describes the jasper model,
\begin{align} \notag \theta_* &= \arg\min_{\theta} \underset {(X,Y) \sim \mathcal{D}_{\mathcal{X}\times\mathcal{Y}}} {\mathbb{E}} \left[ \Delta(Y, \mathcal{F}(X;\theta)) \right] \end{align}Define \(X, Y, \mathcal{F}, \theta, \theta_*\).
- Formally define \(\mathrm{MSE} (\mathbf{y}, \widetilde{\mathbf{y}})\) as mean-squared error measure between target and prediction vectors.
- What are the conditions under which \(\mathrm{MSE} (\mathbf{y}, \widetilde{\mathbf{y}}) \equiv \Delta_E (\mathbf{y}, \widetilde{\mathbf{y}}) \equiv \|\mathbf{y} - \widetilde{\mathbf{y}}\|_F^2\)?
- How is a CTC Loss different from MSE Loss as a training objective?
- In order to learn a model for sequence-to-sequence mapping like speech to text, recommend whether to use Softmax-Cross-Entropy Classification Loss or Connectionist Temporal Classification Loss. Also justify your recommendation.
Prior Art
Datasets
- LibriSpeech (ICASSP ’15)
- WSJ: LDC93S6A (WSJ0), LDC94S13A (WSJ1)
- 2000hr Fisher+Switchboard (F+S): LDC2004S13, LDC2005S13, LDC97S62
Ideas and Strategy
Technology
- Normalisation
- Batch Norm; See also:[arXiv];
- Weight Norm; See also: [arXiv];
- Layer Norm; See also: [arXiv];
- Activation
- Gated Units
Contribution
- Jasper Model; and Dense Residual Topology;
- Evidence-based insights on convergence and non-convergence;
- NovoGrad Solver (like Adam);
- WER improvement on LibriSpeech test-clean.
Jasper Model
All figures and tables repeated here from the paper [LLGL+19].
Model search across
- 3 types of normalisation
- Batch Norm, Weight Norm, Layer Norm;
- 3 types of activations
- ReLU, clipped ReLU, Leaky ReLU;
- 2 types of Gates
- Gated Linear Units, Gated Activation Units;
- Architecture
- \(B\times R\) parameterisation, Dense Residual.
Jasper \(B\times R\) Architecture
Figure 1: Jasper \(B\times R\) model: \(B\): number of blocks; \(R\): number of sub-blocks.
#B | #S | B | K | #C (out) | Dropout |
---|---|---|---|---|---|
1 | 1 | Conv1 | 11 (s=2) | 256 | 0.2 |
2 | 5 | B1 | 11 | 256 | 0.2 |
2 | 5 | B2 | 13 | 384 | 0.2 |
2 | 5 | B3 | 17 | 512 | 0.2 |
2 | 5 | B4 | 21 | 640 | 0.3 |
2 | 5 | B5 | 25 | 768 | 0.3 |
1 | 1 | Conv2 | 29 (D=2) | 896 | 0.4 |
1 | 1 | Conv3 | 1 | 1024 | 0.4 |
1 | 1 | Conv4 | 1 | #graphemes | 0 |
Jasper Dense Residual Architecture
Figure 2: Jasper Dense Residual Model
Appendix
Covariate Shift
One of the challenges of deep learning is that the gradients with respect to the weights in one layer are highly dependent on the outputs of the neurons in the previous layer especially if these outputs change in a highly correlated way.
— Layer Normalisation Paper
Batch Norm
Figure 3: Courtesy: Batch Norm Paper
Weight Norm
Each neuron on an artificial neural network may be represented as,
\begin{align} \notag y &= \phi(\mathbf{w}\cdot\mathbf{x}+b) \end{align}
where,
\(\mathbf{x}\) is \(k\) dimensional vector of input
features,
\(\mathbf{w}\) is \(k\) dimensional vector of learnable
weights,
\(b\) is a (learnable) scalar bias term, and
\(\phi\) denotes element-wise non-linearity.
The key idea in weight normalisation is to re-parameterise the weight vector, as
\begin{align} \notag \mathbf{w} &= \frac{g}{\|\mathbf{v}\|} \mathbf{v} \end{align}
so that,
\(\mathbf{v}\) is \(k\) dimensional vector of learnable
weights,
\(g\) is a learnable scalar parameter, and
\(\|\mathbf{w}\|=g\), independent of \(\mathbf{v}\).
Instead of working with \(g\) directly, we may also use an exponential parameterisation for the scale,
\begin{align} \notag g &= e^s \end{align}where, \(s\) is a log-scale learnable scalar parameter.
For more details, please see \(\S 2.1\) and \(\S 2.2\) of the weight norm paper.
Layer Norm
The \(l^{\text{th}}\) layer in a feed forward neural network with inputs \(\mathbf{h}^l\) and weight matrix \(W^{l}\) and non-linear activation \(f\), may be written as,
\begin{align} \notag a_i^l &= {\mathbf{w}_{:,i}^l}^\top\mathbf{h}^l \qquad h_i^{l+1} = f(a_i^l+b_i^l) \end{align}A Batch Norm may be summarised as,
\begin{align} \notag h_i^{l+1} = f(\hat{a}_i^l+b_i^l) &\qquad \hat{a}_i^l = \frac{g_i^l}{\sigma_i^l} (a_i^l - \mu_i^l) \\ \notag \mu_i^l = \underset{\mathbf{x}\sim P(\mathbf{x})} {\mathbb{E}} \left[a_i^l\right] &\qquad \sigma_i^l = \sqrt{\underset{\mathbf{x}\sim P(\mathbf{x})} {\mathbb{E}} \left[\left(a_i^l - \mu_i^l\right)^2\right]} \end{align}It is typically impractical to [exactly] compute the expectations in [the equation above,] since it would require forward passes through the whole training dataset with the current set of weights. Instead, \(\mu\) and \(\sigma\) are estimated using the empirical samples from the current mini-batch.
Notice that changes in the output of one layer will tend to cause highly correlated changes in the summed inputs to the next layer, especially with ReLU units whose outputs can change by a lot.
\begin{align} \notag \mu_i^l = \mu^l &= \frac1H\sum_{i=1}^{H}a_i^l \\ \notag \sigma_i^l = \sigma^l &= \sqrt{ \frac1H \sum_{i=1}^H \left( a_i^l - \mu^l \right)} \end{align}We, thus, compute the layer normalization statistics over all the hidden units in the same layer as follows:
Figure 4: Courtesy: Layer Norm Paper
Sigmoid Activation
Error Function
Figure 5: Image Courtesy: Wikipedia
Sigmoid (Logistic Function)
Figure 6: Image Courtesy: Wikipedia
Other Sigmoidal Functions
Figure 7: Image Courtesy: Wikipedia
- Hyperbolic Tangent
- \begin{align} \notag \mathrm{tanh}\;x &= \frac {e^x - e^{-x}}{e^x + e^{-x}} \end{align}
- Arc Tangent
- \begin{align} \notag y &= \mathrm{arctan}\;x \iff x = \tan y; \quad y \in \left[-\frac\pi2,\frac\pi2\right] \end{align}
- Gudermannian Function
- \begin{align} \notag \mathrm{gd}(x) &=\int_0^x \frac{\mathrm{d}t}{\mathrm{cosh}\;t} = 2\;\mathrm{arctan}\left(\mathrm{tanh}\left(\frac x2 \right) \right) \end{align}
- Algebraic Functions
- \begin{align} \notag f(x) &= \frac{x}{\left(1+|x|^k\right)^{1/k}} \\ \notag &= \frac{x}{\left(1+|x|\right)}; \qquad k=1 \\ \notag &= \frac{x}{\sqrt{1+x^2}}; \qquad k=2 \end{align}
Rectifier Activation
Figure 8: Image Courtesy: Wikipedia
- ReLU (Rectified Linear Unit)
- \begin{align} \notag \mathrm{ReLU}(x) &= x^+ = \max(0,x) = \frac{x+|x|}2 = \begin{cases} x;&\text{if } x>0, \\ 0;&\text{otherwise.} \end{cases} \end{align}
- Clipped ReLU
- \begin{align}
\notag
\mathrm{cReLU}(x;a) &= \max(0,\min(a,x))
\end{align}
e.g. ReLU6 in [Pytorch], [Keras]
- Parametric and Leaky ReLU
- \begin{align} \notag \mathrm{PReLU}(x; a) &= \begin{cases} x;&\text{if } x>0, \\ ax;&\text{otherwise.} \end{cases} \\ \notag \mathrm{LeakyReLU}(x) &= \mathrm{PReLU}(x, 0.01) \end{align}
- GELU (Gaussian-error linear unit)
- \begin{align}
\notag
GELU(x) &= x\cdot\Phi(x) \\
\frac\partial{\partial x} GELU(x)
&= x\cdot\Phi'(x) + \Phi(x)
\end{align}
where \(\Phi(x) = Pr(X\leqslant x)\) is the cumulative Gaussian distribution.
Vanishing/Exploding Gradient Problem
Hochreiter’s work formally identified a major reason: Typical deep NNs suffer from the now famous problem of vanishing or exploding gradients. With standard activation functions (Sec. 1), cumulative backpropagated error signals (Sec. 5.5.1) either shrink rapidly, or grow out of bounds. In fact, they decay exponentially in the number of layers or CAP depth (Sec. 3), or they explode. This is also known as the long time lag problem.
See also: Deep Learning by Jürgen Schmidhuber
Gating History
Gating was introduced in the LSTM paper in ’97, in order to address vanishing/exploding gradient problem. Simply put, gating mechanism is element-wise multiplication of input vector with a gate-activation vector. The gate, in turn, is activated by looking at the input vector itself. For example, a basic gate would be formulated as,
\begin{align} \notag \mathbf{y} &= \mathbf{g} \otimes \mathbf{x} \\ \notag \mathbf{g} &= \sigma_{\otimes}(W\mathbf{x} + \mathbf{b}) \end{align}
where,
\(\sigma_{\otimes}(\mathbf{x})\) is the element-wise
sigmoid activation of input vector \(\mathbf{x}\); and
\(\otimes\) represents element-wise multiplication.
For a more involved use-case, let an RNN be defined for \(T\) time steps, with
- Given inputs as \(\{\mathbf{z}_1,\ldots,\mathbf{z}_T\}\);
- Cell States, \(\{\mathbf{c}_1,\ldots,\mathbf{c}_T\}\);
- Hidden States, \(\{\mathbf{h}_1,\ldots,\mathbf{h}_T\}\);
- Given initial states as \(\mathbf{c}_{0},\mathbf{h}_{0}\);
- Neural Network \(\Phi(\mathbf{z},\mathbf{c},\mathbf{h})\) to compute pre gate activation;
LSTM
\(\forall t\in\{1,\ldots,T\}\),
GRU
\(\forall t\in\{1,\ldots,T\}\),
See also:
Gated Linear Unit
In the context of speech processing, let \(\tilde{X}=W*X; \tilde{X}\in\mathbb{R}^{n\times(\cdot)}, W\in\mathbb{R}^{n\times m\times k}, X\in\mathbb{R}^{m\times(\cdot)}\) represent a 1-D convolution operation with kernel size \(k\), input filters \(m\) and output filters \(n\). A gated linear unit (GLU) wraps a convolution layer with a linear activation and sigmoid gate as follows,
\begin{align} \notag h_l(X) &= (W*X+B) \otimes \sigma_{\otimes} (V*X+C) \end{align}Since the element-wise multiplication is a symmetric operation, this may as well be interpreted as a linear gate over a sigmoid activation.
With hardware acceleration, this operation may be implemented with single parallelised convolution operations with double filter size, namely \(W\in\mathbb{R}^{2n\times m\times k}\), and bias \(B\in\mathbb{R}^{2n\times(\cdot)}\), as follows,
\begin{align} \notag \tilde{X} &= W*X+B \\ \notag h_l(X) &= \tilde{X}_{:n} \otimes \sigma_{\otimes} (\tilde{X}_{n:}) \end{align}See also: Gated Conv-Net Paper [arXiv]
Gated Activation Unit
A gated activation unit (GLU) wraps a convolution layer with a hyperbolic tangent activation and sigmoid gate as follows,
\begin{align} \notag \tilde{X} &= W*X+B \\ \notag h_l(X) &= \tanh_{\otimes} (\tilde{X}_{:n}) \otimes \sigma_{\otimes} (\tilde{X}_{n:}) \end{align}Since the element-wise multiplication is a symmetric operation, this may equally well be interpreted as a hyperbolic tangent gate and sigmoid activation.
Word Error Rate
Word Error Rate is inspired by “word recognition” accuracy measure in cognitive psychology, which is “the ability of a reader to recognize written words correctly and virtually effortlessly.”
The experiments generally test the ability to recognise “isolated words,” without additional contextual information. (Trivia: testing whose ability, the reader’s or the model’s?)
WER is a special type of normalised edit distance; computed as the normalised number operations required to transform reference (target) to hypothesis (prediction). The set of operations consist of substitution, deletion and insertion.
Formally, if \(Y\) is the reference set and \(Y^{\prime}\) is the hypothesis,
\begin{align} \notag \mathrm{WER}(Y\to Y^\prime) &= \frac {|Y^\prime \setminus Y| + \left[ |Y|-|Y^\prime| \right]_+} {|Y|} \end{align}where \(\setminus\) is the set difference operator.
Intuitively, we resolve for two cases, i.e. either \(Y^{\prime}\) is larger than \(Y\) or otherwise. In case of former, \(Y^\prime \setminus Y\) would include the set of substitutions as well as insertions. In case of latter, \(Y^\prime \setminus Y\) would include the set of substitutions only; hence, the additional term of difference in size is added to account for the number of deletions.
The denominator is a normalisation factor.