BVR's Avatar DelTA Lab Logo IIT Kanpur Logo University of Bath Logo

UCS749 Speech Processing and Synthesis
Course Page

L T P Cr
2 0 2 3

Overview

Link to Syllabus [PDF]​

2024-07-15_22-56-44_screenshot.png

Figure 1: Academic Calendar

  W L P
Prior to MST 8 16 7/8
MST – Diwali 3 6 2/3
Diwali – EST 4 8 4

Evaluation Schedule

  Date MM
MST TBA 30
EST TBA 40
Quiz 1 12-Sep 05:30pm 5
Quiz 2 21-Nov 05:30pm 5
Lab Eval 1 9-Sep–13-Sep 10
Lab Eval 2 18-Nov–22-Nov 10
    100

About Lab Eval

All exercise(s) shall be solved in (Colab) python notebook(s), committed to Github using @thapar.edu account. Only a Github Repo link and commit id shall be submit using the Google Form. Any attachments are not allowed. [Read more…]​

Schedule of topics

Introduction

[Slides]​

  1. NLP: Lexeme/ Grapheme
  2. Speech: Phoneme
  3. Statistical Models: Noise/ Pattern/ Characterisation
  4. Language Model: N-Grams/ TFIDF/ Word2Vec/ BERT
  5. Speech Models: Wav2Vec/ HuBERT
  6. Pre-requisites:
    1. Linear Algebra: Vector Spaces/ Linear Maps/ Singularity/ Matrix Decomposition/ Null Space/ Span/ Markov Chains…
    2. Probability and Statistics: Central Limit Theorem/ Conditionals & Marginals/ Bayes Theorem/ Markov Assumption/ Stochastic Process…
    3. Information Theory: Cross Entropy
    4. Neural Network: Perceptron Model/ Hidden Layers/ Convolution/ Activation/ Pooling/ Atrous/ Padding/ Backpropagation…
    5. Optimisation: Stochastic Gradient Descent/ Momentum/ Dropout/ RMSProp/ Adam…
    6. Deep Learning: Sequential Model/ Residual Model/ Adversarial Model/ Attention Model/ Encoder-Decoder Model…

Recognition

Hidden Markov Model
Notes [PDF]​,
Further reading: Rabiner’s Tutorial; Google, Duck,Duck,Go.
Time Delay DNN (TDNN)
Speech Command Recognition
MatchboxNet: [Slides]​; Further reading: [Papers with code]​; Implementation: [Colab]​; (Implementation: here and here uses AvgPool after blocks)

Synthesis (Text-to-Speech; TTS)

Overview
[Google Slides]​
Spectrogram Generators
Tacotron: [Google Slides]​.
Audio Generators
Wavenet: [Google Slides]​.
Further Reading
Tacotron2: [Papers with code]​; WaveGlow: [Papers with code]​; SqueezeWave: [Papers with code]​; GlowTTS: [Papers with code]​.

List of Slides/Notes

Schedule of Practicals

Lab 1: Getting familiar with speech processing

  1. Getting familiar with the pipeline of Speech Recognition:
    Speech Recognition with Wav2Vec2 (Pytorch)
  2. Perform a simple command classification task with a sequential model:

Lab 2: Hidden Markov Model

Using MFCCs as features from this example:
MFCC Example [Colab]​ by Raghav B. Venkataramaiyer;
along with the following dataset:
Free Spoken Digit Dataset (10 digits x 6 speakers x 50 repeats) [Github]​;
and using hmmlearn as in this tutorial to fit the model
HMM Learn [ReadTheDocs]​

  1. Compute the probability of occurrence of a given sequence, say \(\{3,2,5,4,0\}\). (Encode the Forward Algorithm)
  2. Predict the most likely sequence, given an audio sequence. (Encode the Viterbi algorithm)

Theory

PDF (Concise), More literature from Google, Duck,Duck,Go; Rabiner’s Tutorial.

More Datasets

hmm-speech-recognition [Google Code]​

More Feature Descriptors

CMVN, i-vectors

See Also

HMM Tutorial [Colab]​ by BAMB School 2023
Bean-Machine based Tutorial [Colab]​
HMM Predicting Gold Prices [Medium]​
Single Speaker Word Recognition with HMM [Colab]​
ASR using HMM from scratch [Colab]​

Lab 4: ASR in Indic Language

Use the method from Lab 3, but use Indic Dataset.

Lab 6: TTS with Tacotron 2

Lab 7: TTS in Indic Language

Use the method from Lab 6, but along with Indic Dataset for TTS.

Resources

Updated 2024-11-27 Wed 10:25

Validate