UCS749 Speech Processing and Synthesis
Course Page
L | T | P | Cr |
---|---|---|---|
2 | 0 | 2 | 3 |
Overview
Figure 1: Academic Calendar
W | L | P | |
---|---|---|---|
Prior to MST | 8 | 16 | 7/8 |
MST – Diwali | 3 | 6 | 2/3 |
Diwali – EST | 4 | 8 | 4 |
Evaluation Schedule
Date | MM | |
---|---|---|
MST | TBA | 30 |
EST | TBA | 40 |
Quiz 1 | 12-Sep 05:30pm | 5 |
Quiz 2 | 21-Nov 05:30pm | 5 |
Lab Eval 1 | 9-Sep–13-Sep | 10 |
Lab Eval 2 | 18-Nov–22-Nov | 10 |
100 |
About Lab Eval
All exercise(s) shall be solved in (Colab) python notebook(s), committed to Github using @thapar.edu account. Only a Github Repo link and commit id shall be submit using the Google Form. Any attachments are not allowed. [Read more…]
Schedule of topics
Introduction
- NLP: Lexeme/ Grapheme
- Speech: Phoneme
- Statistical Models: Noise/ Pattern/ Characterisation
- Language Model: N-Grams/ TFIDF/ Word2Vec/ BERT
- Speech Models: Wav2Vec/ HuBERT
- Pre-requisites:
- Linear Algebra: Vector Spaces/ Linear Maps/ Singularity/ Matrix Decomposition/ Null Space/ Span/ Markov Chains…
- Probability and Statistics: Central Limit Theorem/ Conditionals & Marginals/ Bayes Theorem/ Markov Assumption/ Stochastic Process…
- Information Theory: Cross Entropy
- Neural Network: Perceptron Model/ Hidden Layers/ Convolution/ Activation/ Pooling/ Atrous/ Padding/ Backpropagation…
- Optimisation: Stochastic Gradient Descent/ Momentum/ Dropout/ RMSProp/ Adam…
- Deep Learning: Sequential Model/ Residual Model/ Adversarial Model/ Attention Model/ Encoder-Decoder Model…
Recognition
- Hidden Markov Model
- Notes [PDF],
Further reading: Rabiner’s Tutorial; Google, Duck,Duck,Go. - Time Delay DNN (TDNN)
- [Slides] Time-delay Networks (TDNN),
- [Slides] Connectionist Temporal Classification (CTC),
- [Slides] Jasper,
- [Slides] QuartzNet; Further reading: [Papers with Code],
- [Slides] Citrinet; Further reading: [Papers with code].
- Speech Command Recognition
- MatchboxNet: [Slides]; Further reading: [Papers with code]; Implementation: [Colab]; (Implementation: here and here uses AvgPool after blocks)
Synthesis (Text-to-Speech; TTS)
- Overview
- [Google Slides]
- Spectrogram Generators
- Tacotron2: Further Reading [Papers with code], GlowTTS: Further Reading [Papers with code]
- Audio Generators
- Wavenet: [Google Slides]; WaveGlow: Further Reading [Papers with code]; SqueezeWave: Further Reading [Papers with code].
List of Slides/Notes
Schedule of Practicals
Lab 1: Getting familiar with speech processing
- Getting familiar with the pipeline of Speech
Recognition:
Speech Recognition with Wav2Vec2 (Pytorch) - Perform a simple command classification task with
a sequential model:
- (Tensorflow) Simple Audio Recognition :Recognising keywords; or if you prefer
- (Pytorch) Speech Command Classification with M5.
Lab 2: Hidden Markov Model
Using MFCCs as features from this example:
MFCC Example [Colab] by Raghav B. Venkataramaiyer;
along with the following dataset:
Free Spoken Digit Dataset (10 digits x 6 speakers x 50
repeats) [Github];
and using hmmlearn as in this tutorial to fit the
model
HMM Learn [ReadTheDocs]
- Compute the probability of occurrence of a given sequence, say \(\{3,2,5,4,0\}\). (Encode the Forward Algorithm)
- Predict the most likely sequence, given an audio sequence. (Encode the Viterbi algorithm)
Theory
PDF (Concise), More literature from Google, Duck,Duck,Go; Rabiner’s Tutorial.
More Datasets
hmm-speech-recognition [Google Code]
More Feature Descriptors
See Also
HMM Tutorial [Colab] by BAMB School 2023
Bean-Machine based Tutorial [Colab]
HMM Predicting Gold Prices [Medium]
Single Speaker Word Recognition with HMM [Colab]
ASR using HMM from scratch [Colab]
Lab 3: ASR in English
Additional references:
amp_level="O1"
: the argument used inPytorchLightning.Trainer
instance;- But Apex deprecated out of PL v2.0;
For Starters :
NeMo Installation and Getting Started Guide with
Citrinet ASR Evaluation
Lab 4: ASR in Indic Language
Use the method from Lab 3, but use Indic Dataset.
Lab 5: Speech Commands
Lab 6: TTS with Tacotron 2
Lab 7: TTS in Indic Language
Use the method from Lab 6, but along with Indic Dataset for TTS.
Resources
- Speech
- Linear Algebra
- Probability and Statistics
- Bertsekas & Tsitsiklis: Introduction To Probability; Probabilistic Systems Analysis And Applied Probability
- 3B1B
- Neural Network Concepts
- Information Theory & Learning
- Datasets
- Code