UCS749 Speech Processing and Synthesis

L	T	P	Cr
2	0	2	3

Overview

Figure 1: Academic Calendar

	W	L	P
Prior to MST	8	16	7/8
MST – Diwali	3	6	2/3
Diwali – EST	4	8	4

Evaluation Schedule

	Date	MM
MST	TBA	30
EST	TBA	40
Quiz 1	12-Sep 05:30pm	5
Quiz 2	21-Nov 05:30pm	5
Lab Eval 1	9-Sep–13-Sep	10
Lab Eval 2	18-Nov–22-Nov	10
		100

About Lab Eval

All exercise(s) shall be solved in (Colab) python notebook(s), committed to Github using @thapar.edu account. Only a Github Repo link and commit id shall be submit using the Google Form. Any attachments are not allowed. [Read more…]

Introduction

[Slides]

NLP: Lexeme/ Grapheme
Speech: Phoneme
Statistical Models: Noise/ Pattern/ Characterisation
Language Model: N-Grams/ TFIDF/ Word2Vec/ BERT
Speech Models: Wav2Vec/ HuBERT
Pre-requisites:
1. Linear Algebra: Vector Spaces/ Linear Maps/ Singularity/ Matrix Decomposition/ Null Space/ Span/ Markov Chains…
2. Probability and Statistics: Central Limit Theorem/ Conditionals & Marginals/ Bayes Theorem/ Markov Assumption/ Stochastic Process…
3. Information Theory: Cross Entropy
4. Neural Network: Perceptron Model/ Hidden Layers/ Convolution/ Activation/ Pooling/ Atrous/ Padding/ Backpropagation…
5. Optimisation: Stochastic Gradient Descent/ Momentum/ Dropout/ RMSProp/ Adam…
6. Deep Learning: Sequential Model/ Residual Model/ Adversarial Model/ Attention Model/ Encoder-Decoder Model…

Recognition

Hidden Markov Model

Notes [PDF],
Further reading: Rabiner’s Tutorial; Google, Duck,Duck,Go.

Time Delay DNN (TDNN)

[Slides] Time-delay Networks (TDNN),
[Slides] Connectionist Temporal Classification (CTC),
[Slides] Jasper,
[Slides] QuartzNet; Further reading: [Papers with Code],
[Slides] Citrinet; Further reading: [Papers with code].

Speech Command Recognition

MatchboxNet: [Slides]; Further reading: [Papers with code]; Implementation: [Colab]; (Implementation: here and here uses AvgPool after blocks)

Synthesis (Text-to-Speech; TTS)

Overview: [Google Slides]
Spectrogram Generators: Tacotron: [Google Slides].
Audio Generators: Wavenet: [Google Slides].
Further Reading: Tacotron2: [Papers with code]; WaveGlow: [Papers with code]; SqueezeWave: [Papers with code]; GlowTTS: [Papers with code].

List of Slides/Notes

Schedule of Practicals

Lab 1: Getting familiar with speech processing
Lab 2: Hidden Markov Model
Lab 3: ASR in English
Lab 4: ASR in Indic Language
Lab 5: Speech Commands
Lab 6: TTS with Tacotron 2
Lab 7: TTS in Indic Language

Lab 1: Getting familiar with speech processing

Getting familiar with the pipeline of Speech Recognition:
Speech Recognition with Wav2Vec2 (Pytorch)
Perform a simple command classification task with a sequential model:
- (Tensorflow) Simple Audio Recognition :Recognising keywords; or if you prefer
- (Pytorch) Speech Command Classification with M5.

Lab 2: Hidden Markov Model

Using MFCCs as features from this example:
MFCC Example [Colab] by Raghav B. Venkataramaiyer;
along with the following dataset:
Free Spoken Digit Dataset (10 digits x 6 speakers x 50 repeats) [Github];
and using hmmlearn as in this tutorial to fit the model
HMM Learn [ReadTheDocs]

Compute the probability of occurrence of a given sequence, say \(\{3,2,5,4,0\}\). (Encode the Forward Algorithm)
Predict the most likely sequence, given an audio sequence. (Encode the Viterbi algorithm)

Theory

PDF (Concise), More literature from Google, Duck,Duck,Go; Rabiner’s Tutorial.

More Datasets

hmm-speech-recognition [Google Code]

More Feature Descriptors

CMVN, i-vectors