BVR's Avatar DelTA Lab Logo IIT Kanpur Logo University of Bath Logo

Jasper
End-to-End Convolutional Neural Acoustic Model

[Parent]​

tags
Interspeech 2019
keywords
Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
author
Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J. M., Nguyen, H., …
url
[arXiv]​, [Papers With Code]​

Table of Contents

Exercise

  1. If the following equation describes the jasper model,

    \begin{align} \notag \theta_* &= \arg\min_{\theta} \underset {(X,Y) \sim \mathcal{D}_{\mathcal{X}\times\mathcal{Y}}} {\mathbb{E}} \left[ \Delta(Y, \mathcal{F}(X;\theta)) \right] \end{align}

    Define \(X, Y, \mathcal{F}, \theta, \theta_*\).

  2. Formally define \(\mathrm{MSE} (\mathbf{y}, \widetilde{\mathbf{y}})\) as mean-squared error measure between target and prediction vectors.
  3. What are the conditions under which \(\mathrm{MSE} (\mathbf{y}, \widetilde{\mathbf{y}}) \equiv \Delta_E (\mathbf{y}, \widetilde{\mathbf{y}}) \equiv \|\mathbf{y} - \widetilde{\mathbf{y}}\|_F^2\)?
  4. How is a CTC Loss different from MSE Loss as a training objective?
  5. In order to learn a model for sequence-to-sequence mapping like speech to text, recommend whether to use Softmax-Cross-Entropy Classification Loss or Connectionist Temporal Classification Loss. Also justify your recommendation.

Prior Art

Datasets

Technology

Normalisation
Activation
Gated Units

Contribution

  • Jasper Model; and Dense Residual Topology;
  • Evidence-based insights on convergence and non-convergence;
  • NovoGrad Solver (like Adam);
  • WER improvement on LibriSpeech test-clean.

Jasper Model

All figures and tables repeated here from the paper [LLGL+19].

Model search across

3 types of normalisation
Batch Norm, Weight Norm, Layer Norm;
3 types of activations
ReLU, clipped ReLU, Leaky ReLU;
2 types of Gates
Gated Linear Units, Gated Activation Units;
Architecture
\(B\times R\) parameterisation, Dense Residual.

Jasper \(B\times R\) Architecture

2024-08-28_07-44-17_screenshot.png

Figure 1: Jasper \(B\times R\) model: \(B\): number of blocks; \(R\): number of sub-blocks.

Table 1: Jasper \(10\times 5\)
#B #S B K #C (out) Dropout
1 1 Conv1 11 (s=2) 256 0.2
2 5 B1 11 256 0.2
2 5 B2 13 384 0.2
2 5 B3 17 512 0.2
2 5 B4 21 640 0.3
2 5 B5 25 768 0.3
1 1 Conv2 29 (D=2) 896 0.4
1 1 Conv3 1 1024 0.4
1 1 Conv4 1 #graphemes 0

Jasper Dense Residual Architecture

2024-08-28_07-46-36_screenshot.png

Figure 2: Jasper Dense Residual Model

Appendix

Covariate Shift

One of the challenges of deep learning is that the gradients with respect to the weights in one layer are highly dependent on the outputs of the neurons in the previous layer especially if these outputs change in a highly correlated way.
Layer Normalisation Paper

Batch Norm

2024-08-20_00-59-14_screenshot.png

Figure 3: Courtesy: Batch Norm Paper

Weight Norm

Each neuron on an artificial neural network may be represented as,

\begin{align} \notag y &= \phi(\mathbf{w}\cdot\mathbf{x}+b) \end{align}

where,
\(\mathbf{x}\) is \(k\) dimensional vector of input features,
\(\mathbf{w}\) is \(k\) dimensional vector of learnable weights,
\(b\) is a (learnable) scalar bias term, and
\(\phi\) denotes element-wise non-linearity.

The key idea in weight normalisation is to re-parameterise the weight vector, as

\begin{align} \notag \mathbf{w} &= \frac{g}{\|\mathbf{v}\|} \mathbf{v} \end{align}

so that,
\(\mathbf{v}\) is \(k\) dimensional vector of learnable weights,
\(g\) is a learnable scalar parameter, and \(\|\mathbf{w}\|=g\), independent of \(\mathbf{v}\).

Instead of working with \(g\) directly, we may also use an exponential parameterisation for the scale,

\begin{align} \notag g &= e^s \end{align}

where, \(s\) is a log-scale learnable scalar parameter.

For more details, please see \(\S 2.1\) and \(\S 2.2\) of the weight norm paper.

Layer Norm

The \(l^{\text{th}}\) layer in a feed forward neural network with inputs \(\mathbf{h}^l\) and weight matrix \(W^{l}\) and non-linear activation \(f\), may be written as,

\begin{align} \notag a_i^l &= {\mathbf{w}_{:,i}^l}^\top\mathbf{h}^l \qquad h_i^{l+1} = f(a_i^l+b_i^l) \end{align}

A Batch Norm may be summarised as,

\begin{align} \notag h_i^{l+1} = f(\hat{a}_i^l+b_i^l) &\qquad \hat{a}_i^l = \frac{g_i^l}{\sigma_i^l} (a_i^l - \mu_i^l) \\ \notag \mu_i^l = \underset{\mathbf{x}\sim P(\mathbf{x})} {\mathbb{E}} \left[a_i^l\right] &\qquad \sigma_i^l = \sqrt{\underset{\mathbf{x}\sim P(\mathbf{x})} {\mathbb{E}} \left[\left(a_i^l - \mu_i^l\right)^2\right]} \end{align}

It is typically impractical to [exactly] compute the expectations in [the equation above,] since it would require forward passes through the whole training dataset with the current set of weights. Instead, \(\mu\) and \(\sigma\) are estimated using the empirical samples from the current mini-batch.

Notice that changes in the output of one layer will tend to cause highly correlated changes in the summed inputs to the next layer, especially with ReLU units whose outputs can change by a lot.

We, thus, compute the layer normalization statistics over all the hidden units in the same layer as follows:

\begin{align} \notag \mu_i^l = \mu^l &= \frac1H\sum_{i=1}^{H}a_i^l \\ \notag \sigma_i^l = \sigma^l &= \sqrt{ \frac1H \sum_{i=1}^H \left( a_i^l - \mu^l \right)} \end{align}

2024-08-20_03-20-24_screenshot.png

Figure 4: Courtesy: Layer Norm Paper

Sigmoid Activation

Error Function

\begin{align} \notag \mathrm{erf}\;z &= \frac2{\sqrt\pi} \int_0^z e^{-t^2} \mathrm{d}t \end{align}

Error_Function.svg

Figure 5: Image Courtesy: Wikipedia

Sigmoid (Logistic Function)

\begin{align} \notag \sigma(x) &= \frac1{1+e^{-x} } = \frac{e^x}{1+e^x} = 1 - \sigma(-x) \end{align}

Logistic-curve.svg

Figure 6: Image Courtesy: Wikipedia

Other Sigmoidal Functions

sigmoid-comparison.svg

Figure 7: Image Courtesy: Wikipedia

Hyperbolic Tangent
\begin{align} \notag \mathrm{tanh}\;x &= \frac {e^x - e^{-x}}{e^x + e^{-x}} \end{align}
Arc Tangent
\begin{align} \notag y &= \mathrm{arctan}\;x \iff x = \tan y; \quad y \in \left[-\frac\pi2,\frac\pi2\right] \end{align}
Gudermannian Function
\begin{align} \notag \mathrm{gd}(x) &=\int_0^x \frac{\mathrm{d}t}{\mathrm{cosh}\;t} = 2\;\mathrm{arctan}\left(\mathrm{tanh}\left(\frac x2 \right) \right) \end{align}
Algebraic Functions
\begin{align} \notag f(x) &= \frac{x}{\left(1+|x|^k\right)^{1/k}} \\ \notag &= \frac{x}{\left(1+|x|\right)}; \qquad k=1 \\ \notag &= \frac{x}{\sqrt{1+x^2}}; \qquad k=2 \end{align}

Rectifier Activation

rectifier.svg

Figure 8: Image Courtesy: Wikipedia

ReLU (Rectified Linear Unit)
\begin{align} \notag \mathrm{ReLU}(x) &= x^+ = \max(0,x) = \frac{x+|x|}2 = \begin{cases} x;&\text{if } x>0, \\ 0;&\text{otherwise.} \end{cases} \end{align}
Clipped ReLU
\begin{align} \notag \mathrm{cReLU}(x;a) &= \max(0,\min(a,x)) \end{align}

e.g. ReLU6 in [Pytorch]​, [Keras]​

Parametric and Leaky ReLU
\begin{align} \notag \mathrm{PReLU}(x; a) &= \begin{cases} x;&\text{if } x>0, \\ ax;&\text{otherwise.} \end{cases} \\ \notag \mathrm{LeakyReLU}(x) &= \mathrm{PReLU}(x, 0.01) \end{align}
GELU (Gaussian-error linear unit)
\begin{align} \notag GELU(x) &= x\cdot\Phi(x) \\ \frac\partial{\partial x} GELU(x) &= x\cdot\Phi'(x) + \Phi(x) \end{align}

where \(\Phi(x) = Pr(X\leqslant x)\) is the cumulative Gaussian distribution.

Vanishing/Exploding Gradient Problem

Hochreiter’s work formally identified a major reason: Typical deep NNs suffer from the now famous problem of vanishing or exploding gradients. With standard activation functions (Sec. 1), cumulative backpropagated error signals (Sec. 5.5.1) either shrink rapidly, or grow out of bounds. In fact, they decay exponentially in the number of layers or CAP depth (Sec. 3), or they explode. This is also known as the long time lag problem.

See also: Deep Learning by Jürgen Schmidhuber

Gating History

Gating was introduced in the LSTM paper in ’97, in order to address vanishing/exploding gradient problem. Simply put, gating mechanism is element-wise multiplication of input vector with a gate-activation vector. The gate, in turn, is activated by looking at the input vector itself. For example, a basic gate would be formulated as,

\begin{align} \notag \mathbf{y} &= \mathbf{g} \otimes \mathbf{x} \\ \notag \mathbf{g} &= \sigma_{\otimes}(W\mathbf{x} + \mathbf{b}) \end{align}

where,
\(\sigma_{\otimes}(\mathbf{x})\) is the element-wise sigmoid activation of input vector \(\mathbf{x}\); and
\(\otimes\) represents element-wise multiplication.

For a more involved use-case, let an RNN be defined for \(T\) time steps, with

  • Given inputs as \(\{\mathbf{z}_1,\ldots,\mathbf{z}_T\}\);
  • Cell States, \(\{\mathbf{c}_1,\ldots,\mathbf{c}_T\}\);
  • Hidden States, \(\{\mathbf{h}_1,\ldots,\mathbf{h}_T\}\);
  • Given initial states as \(\mathbf{c}_{0},\mathbf{h}_{0}\);
  • Neural Network \(\Phi(\mathbf{z},\mathbf{c},\mathbf{h})\) to compute pre gate activation;

LSTM
\(\forall t\in\{1,\ldots,T\}\),

\begin{align} \notag \mathbf{x} &\gets \Phi(\mathbf{z}_t, \mathbf{c}_{t-1}, \mathbf{h}_{t-1}) \\ \notag \mathbf{i} &\gets \sigma_{\otimes}(W_i\mathbf{x}+U_i\mathbf{h}_{t-1} + \mathbf{b}_i) \\ \notag \mathbf{f} &\gets \sigma_{\otimes} (W_f\mathbf{x} + U_f\mathbf{h}_{t-1} + \mathbf{b}_f) \\ \notag \mathbf{o} &\gets \sigma_{\otimes} (W_o\mathbf{x} + U_o\mathbf{h}_{t-1} + \mathbf{b}_o) \\ \notag \mathbf{g} &\gets \tanh_{\otimes} (W_g\mathbf{x} + U_g\mathbf{h}_{t-1} + \mathbf{b}_g) \\ \notag \mathbf{c}_t &\gets \mathbf{f}\otimes\mathbf{c}_{t-1} + \mathbf{i}\otimes\mathbf{g} \\ \notag \mathbf{h}_t &\gets \mathbf{o}\otimes\tanh_{\otimes} \mathbf{c}_t \end{align}

GRU
\(\forall t\in\{1,\ldots,T\}\),

\begin{align} \notag \mathbf{x} &\gets \Phi(\mathbf{z}_t, \mathbf{c}_{t-1}, \mathbf{h}_{t-1}) \\ \notag \mathbf{r} &\gets \sigma_{\otimes}(W_r\mathbf{x}+U_r\mathbf{h}_{t-1} + \mathbf{b}_r) \\ \notag \tilde{\mathbf{h}} &\gets \tanh_{\otimes}(W_h\mathbf{x} + U_h(\mathbf{r}\otimes\mathbf{h}_{t-1}) + \mathbf{b}_h) \\ \notag \mathbf{c}_t &\gets \sigma_{\otimes}(W_c\mathbf{x}+U_c\mathbf{h}_{t-1} + \mathbf{b}_c) \\ \notag \mathbf{h}_t &\gets \mathbf{c}_t\otimes \mathbf{h}_{t-1} + (1-\mathbf{c}_t) \otimes \tilde{\mathbf{h}} \end{align}

See also:

Gated Linear Unit

In the context of speech processing, let \(\tilde{X}=W*X; \tilde{X}\in\mathbb{R}^{n\times(\cdot)}, W\in\mathbb{R}^{n\times m\times k}, X\in\mathbb{R}^{m\times(\cdot)}\) represent a 1-D convolution operation with kernel size \(k\), input filters \(m\) and output filters \(n\). A gated linear unit (GLU) wraps a convolution layer with a linear activation and sigmoid gate as follows,

\begin{align} \notag h_l(X) &= (W*X+B) \otimes \sigma_{\otimes} (V*X+C) \end{align}

Since the element-wise multiplication is a symmetric operation, this may as well be interpreted as a linear gate over a sigmoid activation.

With hardware acceleration, this operation may be implemented with single parallelised convolution operations with double filter size, namely \(W\in\mathbb{R}^{2n\times m\times k}\), and bias \(B\in\mathbb{R}^{2n\times(\cdot)}\), as follows,

\begin{align} \notag \tilde{X} &= W*X+B \\ \notag h_l(X) &= \tilde{X}_{:n} \otimes \sigma_{\otimes} (\tilde{X}_{n:}) \end{align}

See also: Gated Conv-Net Paper [arXiv]​

Gated Activation Unit

A gated activation unit (GLU) wraps a convolution layer with a hyperbolic tangent activation and sigmoid gate as follows,

\begin{align} \notag \tilde{X} &= W*X+B \\ \notag h_l(X) &= \tanh_{\otimes} (\tilde{X}_{:n}) \otimes \sigma_{\otimes} (\tilde{X}_{n:}) \end{align}

Since the element-wise multiplication is a symmetric operation, this may equally well be interpreted as a hyperbolic tangent gate and sigmoid activation.

See also: Conditional PixelCNN Paper [NeurIPS '16]​

Word Error Rate

Word Error Rate is inspired by “word recognition” accuracy measure in cognitive psychology, which is “the ability of a reader to recognize written words correctly and virtually effortlessly.”

The experiments generally test the ability to recognise “isolated words,” without additional contextual information. (Trivia: testing whose ability, the reader’s or the model’s?)

WER is a special type of normalised edit distance; computed as the normalised number operations required to transform reference (target) to hypothesis (prediction). The set of operations consist of substitution, deletion and insertion.

Formally, if \(Y\) is the reference set and \(Y^{\prime}\) is the hypothesis,

\begin{align} \notag \mathrm{WER}(Y\to Y^\prime) &= \frac {|Y^\prime \setminus Y| + \left[ |Y|-|Y^\prime| \right]_+} {|Y|} \end{align}

where \(\setminus\) is the set difference operator.

Intuitively, we resolve for two cases, i.e. either \(Y^{\prime}\) is larger than \(Y\) or otherwise. In case of former, \(Y^\prime \setminus Y\) would include the set of substitutions as well as insertions. In case of latter, \(Y^\prime \setminus Y\) would include the set of substitutions only; hence, the additional term of difference in size is added to account for the number of deletions.

The denominator is a normalisation factor.

Updated 2024-08-28 Wed 12:29

Validate