Jasper
End-to-End Convolutional Neural Acoustic Model

[Parent]

tags: Interspeech 2019
keywords: Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
author: Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J. M., Nguyen, H., …
url: [arXiv], [Papers With Code]

Exercise
Prior Art
Contribution
- Jasper Model
Appendix

Exercise

If the following equation describes the jasper model,
\begin{align} \notag \theta_* &= \arg\min_{\theta} \underset {(X,Y) \sim \mathcal{D}_{\mathcal{X}\times\mathcal{Y}}} {\mathbb{E}} \left[ \Delta(Y, \mathcal{F}(X;\theta)) \right] \end{align}
Define \(X, Y, \mathcal{F}, \theta, \theta_*\).
Formally define \(\mathrm{MSE} (\mathbf{y}, \widetilde{\mathbf{y}})\) as mean-squared error measure between target and prediction vectors.
What are the conditions under which \(\mathrm{MSE} (\mathbf{y}, \widetilde{\mathbf{y}}) \equiv \Delta_E (\mathbf{y}, \widetilde{\mathbf{y}}) \equiv \|\mathbf{y} - \widetilde{\mathbf{y}}\|_F^2\)?
How is a CTC Loss different from MSE Loss as a training objective?
In order to learn a model for sequence-to-sequence mapping like speech to text, recommend whether to use Softmax-Cross-Entropy Classification Loss or Connectionist Temporal Classification Loss. Also justify your recommendation.

Prior Art

Datasets

LibriSpeech (ICASSP ’15)
WSJ: LDC93S6A (WSJ0), LDC94S13A (WSJ1)
2000hr Fisher+Switchboard (F+S): LDC2004S13, LDC2005S13, LDC97S62

Ideas and Strategy

Technology

Normalisation

Batch Norm; See also:[arXiv];
Weight Norm; See also: [arXiv];
Layer Norm; See also: [arXiv];

Activation

ReLU
Clipped ReLU
Leaky ReLU

Gated Units

Gated Linear Units
Gated Activation Units

Contribution

Jasper Model; and Dense Residual Topology;
Evidence-based insights on convergence and non-convergence;
NovoGrad Solver (like Adam);
WER improvement on LibriSpeech test-clean.

Jasper Model

All figures and tables repeated here from the paper [LLGL+19].

Model search across

3 types of normalisation: Batch Norm, Weight Norm, Layer Norm;
3 types of activations: ReLU, clipped ReLU, Leaky ReLU;
2 types of Gates: Gated Linear Units, Gated Activation Units;
Architecture: \(B\times R\) parameterisation, Dense Residual.

Jasper \(B\times R\) Architecture

Figure 1: Jasper \(B\times R\) model: \(B\): number of blocks; \(R\): number of sub-blocks.

Table 1: Jasper \(10\times 5\)
#B	#S	B	K	#C (out)	Dropout
1	1	Conv1	11 (s=2)	256	0.2
2	5	B1	11	256	0.2
2	5	B2	13	384	0.2
2	5	B3	17	512	0.2
2	5	B4	21	640	0.3
2	5	B5	25	768	0.3
1	1	Conv2	29 (D=2)	896	0.4
1	1	Conv3	1	1024	0.4
1	1	Conv4	1	#graphemes	0

Jasper Dense Residual Architecture

Figure 2: Jasper Dense Residual Model

Appendix

Covariate Shift

One of the challenges of deep learning is that the gradients with respect to the weights in one layer are highly dependent on the outputs of the neurons in the previous layer especially if these outputs change in a highly correlated way.
— Layer Normalisation Paper

Batch Norm

Figure 3: Courtesy: Batch Norm Paper

Weight Norm

Each neuron on an artificial neural network may be represented as,

\begin{align} \notag y &= \phi(\mathbf{w}\cdot\mathbf{x}+b) \end{align}

where,
\(\mathbf{x}\) is \(k\) dimensional vector of input features,
\(\mathbf{w}\) is \(k\) dimensional vector of learnable weights,
\(b\) is a (learnable) scalar bias term, and
\(\phi\) denotes element-wise non-linearity.

The key idea in weight normalisation is to re-parameterise the weight vector, as

\begin{align} \notag \mathbf{w} &= \frac{g}{\|\mathbf{v}\|} \mathbf{v} \end{align}

so that,
\(\mathbf{v}\) is \(k\) dimensional vector of learnable weights,
\(g\) is a learnable scalar parameter, and \(\|\mathbf{w}\|=g\), independent of \(\mathbf{v}\).

Instead of working with \(g\) directly, we may also use an exponential parameterisation for the scale,

\begin{align} \notag g &= e^s \end{align}

where, \(s\) is a log-scale learnable scalar parameter.

For more details, please see \(\S 2.1\) and \(\S 2.2\) of the weight norm paper.

Layer Norm

The \(l^{\text{th}}\) layer in a feed forward neural network with inputs \(\mathbf{h}^l\) and weight matrix \(W^{l}\) and non-linear activation \(f\), may be written as,

\begin{align} \notag a_i^l &= {\mathbf{w}_{:,i}^l}^\top\mathbf{h}^l \qquad h_i^{l+1} = f(a_i^l+b_i^l) \end{align}

A Batch Norm may be summarised as,

\begin{align} \notag h_i^{l+1} = f(\hat{a}_i^l+b_i^l) &\qquad \hat{a}_i^l = \frac{g_i^l}{\sigma_i^l} (a_i^l - \mu_i^l) \\ \notag \mu_i^l = \underset{\mathbf{x}\sim P(\mathbf{x})} {\mathbb{E}} \left[a_i^l\right] &\qquad \sigma_i^l = \sqrt{\underset{\mathbf{x}\sim P(\mathbf{x})} {\mathbb{E}} \left[\left(a_i^l - \mu_i^l\right)^2\right]} \end{align}

It is typically impractical to [exactly] compute the expectations in [the equation above,] since it would require forward passes through the whole training dataset with the current set of weights. Instead, \(\mu\) and \(\sigma\) are estimated using the empirical samples from the current mini-batch.

Notice that changes in the output of one layer will tend to cause highly correlated changes in the summed inputs to the next layer, especially with ReLU units whose outputs can change by a lot.

We, thus, compute the layer normalization statistics over all the hidden units in the same layer as follows:

\begin{align} \notag \mu_i^l = \mu^l &= \frac1H\sum_{i=1}^{H}a_i^l \\ \notag \sigma_i^l = \sigma^l &= \sqrt{ \frac1H \sum_{i=1}^H \left( a_i^l - \mu^l \right)^2} \end{align}

Figure 4: Courtesy: Layer Norm Paper

Sigmoid Activation

Error Function

\begin{align} \notag \mathrm{erf}\;z &= \frac2{\sqrt\pi} \int_0^z e^{-t^2} \mathrm{d}t \end{align}

Figure 5: Image Courtesy: Wikipedia

Sigmoid (Logistic Function)

\begin{align} \notag \sigma(x) &= \frac1{1+e^{-x} } = \frac{e^x}{1+e^x} = 1 - \sigma(-x) \end{align}

Figure 6: Image Courtesy: Wikipedia

Other Sigmoidal Functions

Figure 7: Image Courtesy: Wikipedia

Hyperbolic Tangent: \begin{align} \notag \mathrm{tanh}\;x &= \frac {e^x - e^{-x}}{e^x + e^{-x}} \end{align}
Arc Tangent: \begin{align} \notag y &= \mathrm{arctan}\;x \iff x = \tan y; \quad y \in \left[-\frac\pi2,\frac\pi2\right] \end{align}

Gudermannian Function: \begin{align} \notag \mathrm{gd}(x) &=\int_0^x \frac{\mathrm{d}t}{\mathrm{cosh}\;t} = 2\;\mathrm{arctan}\left(\mathrm{tanh}\left(\frac x2 \right) \right) \end{align}
Algebraic Functions: \begin{align} \notag f(x) &= \frac{x}{\left(1+|x|^k\right)^{1/k}} \\ \notag &= \frac{x}{\left(1+|x|\right)}; \qquad k=1 \\ \notag &= \frac{x}{\sqrt{1+x^2}}; \qquad k=2 \end{align}

Rectifier Activation

Figure 8: Image Courtesy: Wikipedia

ReLU (Rectified Linear Unit): \begin{align} \notag \mathrm{ReLU}(x) &= x^+ = \max(0,x) = \frac{x+|x|}2 = \begin{cases} x;&\text{if } x>0, \\ 0;&\text{otherwise.} \end{cases} \end{align}
Clipped ReLU: \begin{align} \notag \mathrm{cReLU}(x;a) &= \max(0,\min(a,x)) \end{align}
e.g. ReLU6 in [Pytorch], [Keras]
Parametric and Leaky ReLU: \begin{align} \notag \mathrm{PReLU}(x; a) &= \begin{cases} x;&\text{if } x>0, \\ ax;&\text{otherwise.} \end{cases} \\ \notag \mathrm{LeakyReLU}(x) &= \mathrm{PReLU}(x, 0.01) \end{align}
GELU (Gaussian-error linear unit): \begin{align} \notag GELU(x) &= x\cdot\Phi(x) \\ \frac\partial{\partial x} GELU(x) &= x\cdot\Phi'(x) + \Phi(x) \end{align}
where \(\Phi(x) = Pr(X\leqslant x)\) is the cumulative Gaussian distribution.

Vanishing/Exploding Gradient Problem

Hochreiter’s work formally identified a major reason: Typical deep NNs suffer from the now famous problem of vanishing or exploding gradients. With standard activation functions (Sec. 1), cumulative backpropagated error signals (Sec. 5.5.1) either shrink rapidly, or grow out of bounds. In fact, they decay exponentially in the number of layers or CAP depth (Sec. 3), or they explode. This is also known as the long time lag problem.

Gating History

Gating was introduced in the LSTM paper in ’97, in order to address vanishing/exploding gradient problem. Simply put, gating mechanism is element-wise multiplication of input vector with a gate-activation vector. The gate, in turn, is activated by looking at the input vector itself. For example, a basic gate would be formulated as,

\begin{align} \notag \mathbf{y} &= \mathbf{g} \otimes \mathbf{x} \\ \notag \mathbf{g} &= \sigma_{\otimes}(W\mathbf{x} + \mathbf{b}) \end{align}

where,
\(\sigma_{\otimes}(\mathbf{x})\) is the element-wise sigmoid activation of input vector \(\mathbf{x}\); and
\(\otimes\) represents element-wise multiplication.

For a more involved use-case, let an RNN be defined for \(T\) time steps, with

Given inputs as \(\{\mathbf{z}_1,\ldots,\mathbf{z}_T\}\);
Cell States, \(\{\mathbf{c}_1,\ldots,\mathbf{c}_T\}\);
Hidden States, \(\{\mathbf{h}_1,\ldots,\mathbf{h}_T\}\);
Given initial states as \(\mathbf{c}_{0},\mathbf{h}_{0}\);
Neural Network \(\Phi(\mathbf{z},\mathbf{c},\mathbf{h})\) to compute pre gate activation;

LSTM
\(\forall t\in\{1,\ldots,T\}\),

\begin{align} \notag \mathbf{x} &\gets \Phi(\mathbf{z}_t, \mathbf{c}_{t-1}, \mathbf{h}_{t-1}) \\ \notag \mathbf{i} &\gets \sigma_{\otimes}(W_i\mathbf{x}+U_i\mathbf{h}_{t-1} + \mathbf{b}_i) \\ \notag \mathbf{f} &\gets \sigma_{\otimes} (W_f\mathbf{x} + U_f\mathbf{h}_{t-1} + \mathbf{b}_f) \\ \notag \mathbf{o} &\gets \sigma_{\otimes} (W_o\mathbf{x} + U_o\mathbf{h}_{t-1} + \mathbf{b}_o) \\ \notag \mathbf{g} &\gets \tanh_{\otimes} (W_g\mathbf{x} + U_g\mathbf{h}_{t-1} + \mathbf{b}_g) \\ \notag \mathbf{c}_t &\gets \mathbf{f}\otimes\mathbf{c}_{t-1} + \mathbf{i}\otimes\mathbf{g} \\ \notag \mathbf{h}_t &\gets \mathbf{o}\otimes\tanh_{\otimes} \mathbf{c}_t \end{align}

GRU
\(\forall t\in\{1,\ldots,T\}\),

\begin{align} \notag \mathbf{x} &\gets \Phi(\mathbf{z}_t, \mathbf{c}_{t-1}, \mathbf{h}_{t-1}) \\ \notag \mathbf{r} &\gets \sigma_{\otimes}(W_r\mathbf{x}+U_r\mathbf{h}_{t-1} + \mathbf{b}_r) \\ \notag \tilde{\mathbf{h}} &\gets \tanh_{\otimes}(W_h\mathbf{x} + U_h(\mathbf{r}\otimes\mathbf{h}_{t-1}) + \mathbf{b}_h) \\ \notag \mathbf{c}_t &\gets \sigma_{\otimes}(W_c\mathbf{x}+U_c\mathbf{h}_{t-1} + \mathbf{b}_c) \\ \notag \mathbf{h}_t &\gets \mathbf{c}_t\otimes \mathbf{h}_{t-1} + (1-\mathbf{c}_t) \otimes \tilde{\mathbf{h}} \end{align}

See also:

Gated Linear Unit

In the context of speech processing, let \(\tilde{X}=W*X; \tilde{X}\in\mathbb{R}^{n\times(\cdot)}, W\in\mathbb{R}^{n\times m\times k}, X\in\mathbb{R}^{m\times(\cdot)}\) represent a 1-D convolution operation with kernel size \(k\), input filters \(m\) and output filters \(n\). A gated linear unit (GLU) wraps a convolution layer with a linear activation and sigmoid gate as follows,

\begin{align} \notag h_l(X) &= (W*X+B) \otimes \sigma_{\otimes} (V*X+C) \end{align}

Since the element-wise multiplication is a symmetric operation, this may as well be interpreted as a linear gate over a sigmoid activation.

With hardware acceleration, this operation may be implemented with single parallelised convolution operations with double filter size, namely \(W\in\mathbb{R}^{2n\times m\times k}\), and bias \(B\in\mathbb{R}^{2n\times(\cdot)}\), as follows,

\begin{align} \notag \tilde{X} &= W*X+B \\ \notag h_l(X) &= \tilde{X}_{:n} \otimes \sigma_{\otimes} (\tilde{X}_{n:}) \end{align}

See also: Gated Conv-Net Paper [arXiv]

Gated Activation Unit

A gated activation unit (GLU) wraps a convolution layer with a hyperbolic tangent activation and sigmoid gate as follows,

\begin{align} \notag \tilde{X} &= W*X+B \\ \notag h_l(X) &= \tanh_{\otimes} (\tilde{X}_{:n}) \otimes \sigma_{\otimes} (\tilde{X}_{n:}) \end{align}

Since the element-wise multiplication is a symmetric operation, this may equally well be interpreted as a hyperbolic tangent gate and sigmoid activation.

Word Error Rate

Word Error Rate is inspired by “word recognition” accuracy measure in cognitive psychology, which is “the ability of a reader to recognize written words correctly and virtually effortlessly.”

The experiments generally test the ability to recognise “isolated words,” without additional contextual information. (Trivia: testing whose ability, the reader’s or the model’s?)

WER is a special type of normalised edit distance; computed as the normalised number operations required to transform reference (target) to hypothesis (prediction). The set of operations consist of substitution, deletion and insertion.

Formally, if \(Y\) is the reference set and \(Y^{\prime}\) is the hypothesis,

\begin{align} \notag \mathrm{WER}(Y\to Y^\prime) &= \frac {|Y^\prime \setminus Y| + \left[ |Y|-|Y^\prime| \right]_+} {|Y|} \end{align}

where \(\setminus\) is the set difference operator.

Intuitively, we resolve for two cases, i.e. either \(Y^{\prime}\) is larger than \(Y\) or otherwise. In case of former, \(Y^\prime \setminus Y\) would include the set of substitutions as well as insertions. In case of latter, \(Y^\prime \setminus Y\) would include the set of substitutions only; hence, the additional term of difference in size is added to account for the number of deletions.

The denominator is a normalisation factor.

Jasper End-to-End Convolutional Neural Acoustic Model

Table of Contents