MFCCS¶

How to extract meaningful features from audio¶

(from "A tutoral on Text-Independent Speaker Verification" by Bimbot et al., 2004)¶

Loading and Preprocessing Audio Data¶

This code cell demonstrates how to load and plot audio data using the librosa library. It starts by importing necessary modules, including librosa for audio processing, numpy for numerical operations, and matplotlib for data visualization.

Next, it sets the path to an audio file and uses the librosa.load() function to read the audio data into a numpy array. It also includes error handling to check whether the file exists.

Finally, the code generates a plot of the audio waveform using matplotlib's plot() function, and sets the title and axis labels using title(), xlabel(), and ylabel(). The resulting plot is displayed using show().

In [1]:
# Import necessary modules
import librosa
import numpy as np
import matplotlib.pyplot as plt

# Set file path and load audio data
wavsFolder = "../../Wavs/NormalizedAudio/"
fileName = "usc_timit_033.wav"
filePath = wavsFolder + fileName

try:
    audio, sr = librosa.load(filePath, sr=None, mono=True)
except FileNotFoundError:
    print(f"File not found: {filePath}")
# Build the time axis
time = np.linspace(0,len(audio)/sr,len(audio))

# Plot audio waveform
plt.figure(figsize=(14, 5))
plt.plot(time, audio)
plt.title(f"Audio waveform: {fileName}")
plt.xlabel("Time [s]")
plt.ylabel("Amplitude")
plt.show()

Speech Parametrization: Pre-emphasis (Optional)¶

This code cell defines a function called audioPreEmphasis() that applies pre-emphasis to an audio signal using a pre-emphasis coefficient a. Pre-emphasis is a technique used in audio signal processing to balance the frequency spectrum of a signal, making higher frequencies more pronounced and easier to detect.

The function takes in the audio data as input, and creates a new numpy array of zeros called emphasized_audio with the same length as the input. The for loop applies the pre-emphasis to each sample of the audio signal using the equation y[t] = x[t] - a*x[t-1], where x is the input audio and y is the pre-emphasized audio. The resulting pre-emphasized audio is returned as output.

To make the code more robust, it includes error handling to check whether the input audio is a numpy array and whether the pre-emphasis coefficient a is within a valid range. Additionally, it initializes the emphasized_audio array with the same shape as the input audio array using np.zeros_like(), and initializes the first element of the pre-emphasized signal to be the same as the original signal.

Function documentation using docstrings is included, which explains the purpose and parameters of the function. This helps make the code more understandable and easier to use.

In [2]:
def preEmphasis(audio, a=0.95):
    """
    Applies pre-emphasis to an audio signal.

    Parameters:
    audio (numpy.ndarray): Audio signal data.
    a (float): Pre-emphasis coefficient.

    Returns:
    numpy.ndarray: Pre-emphasized audio signal data.
    """
    if not isinstance(audio, np.ndarray):
        raise TypeError("Input audio must be a numpy array.")

    if a < 0 or a > 1:
        raise ValueError("Pre-emphasis coefficient must be between 0 and 1.")

    audio_preEmphasis = np.zeros_like(audio)
    audio_preEmphasis[0] = audio[0]
    for i in range(1, len(audio)):
        audio_preEmphasis[i] = audio[i] - a*audio[i-1]
    return audio_preEmphasis

This code cell plots the original audio file and its amplitude spectrum, as well as the pre-emphasized audio file and its amplitude spectrum. The audio file is windowed using the Hamming window, and its amplitude spectrum is computed using the Fast Fourier Transform (FFT). The pre-emphasis function is used to apply a high-pass filter to the audio, boosting the high frequencies and improving the overall quality of the signal. The resulting pre-emphasized audio file is then windowed and its amplitude spectrum is also computed using the FFT.

In [3]:
from scipy.signal import hamming

winLen = len(audio)
win = hamming(winLen)
n = 2 ** int(np.ceil(np.log2(len(audio))))

# Plot of the original audio file
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(16,8))
axs[0][0].plot(time, audio, color='#008000', label='original audio')
axs[0][0].legend()
axs[0][0].set_xlabel("Time [s]")
axs[0][0].set_ylabel("Audio amplitude")
axs[0][0].set_title("Original audio")

# Amplitude Spectrum of the original audio file
audio_spectrum = np.fft.fft(audio*win,n) # computes the n-point DFT
freqAxis = np.fft.fftfreq(n, 1.0/sr)[0:n//2+1] # returns the n-point frequency axis, given a certain sample spacing
nyqFreq = sr/2
freqAxis[-1] = nyqFreq
axs[1][0].plot(freqAxis, 2*np.abs(audio_spectrum)[0:n//2+1], color='#008000', label='original audio')
axs[1][0].set_xlabel("Freq [Hz]")
axs[1][0].set_ylabel("Amplitude Spectrum")

# Pre-emphasis
audio_preEmphasis = preEmphasis(audio)
axs[0][1].plot(time, audio_preEmphasis, color='#FFA500', label='pre-emphasized audio')
axs[0][1].legend()
axs[0][1].set_xlabel("Time [s]")
axs[0][1].set_ylabel("Audio amplitude")
axs[0][1].set_title("Pre-emphasized audio")
ylim = abs(max(np.concatenate([audio,audio_preEmphasis])))
axs[0][0].set_ylim([-ylim,ylim])
axs[0][1].set_ylim([-ylim,ylim])

# Amplitude Spectrum of the pre-emphasized audio
audio_preEmphasis_spectrum = np.fft.fft(audio_preEmphasis*win,n) # computes the n-point DFT
axs[1][1].plot(freqAxis, 2*np.abs(audio_preEmphasis_spectrum)[0:n//2+1], color='#FFA500', label='pre-emphasized audio')
axs[1][1].set_xlabel("Freq [Hz]")
axs[1][1].set_ylabel("Amplitude Spectrum")
plt.show()

The sense of the pre-emphasis¶

Speech Parametrization: Computing Spectrogram¶

In order to analyze the spectral content of the audio signal, we will compute its spectrogram using a Hamming window. The computeSpectrogram function defined below takes as input the audio signal, the length of the Hamming window and its overlap (both in seconds), and the length of the FFT (optional, defaults to the maximum possible length for the given window size). It returns the spectrogram as a 2D array of amplitude values, as well as the corresponding time and frequency axes.

We will use this function to compute the spectrogram for our audio data, which will be used later for speaker verification.

In [4]:
# Verify what happens to the remaining audio segment at the end
def compute_spectrogram(audio, sr, winLen_seconds=0.02, winOverlap_seconds=0.01, n=None):
    """
    Computes the spectrogram of an audio signal.

    Args:
        audio (array): The input audio signal.
        win_len_seconds (float): The length of the analysis window in seconds.
        win_overlap_seconds (float): The overlap between adjacent windows in seconds.
        n (int): The FFT length. If None, it will use the next power of two greater than the window length.

    Returns:
        The spectrogram, time axis, and frequency axis of the audio signal.
    """
    # Determine the window length and overlap in samples
    winLen = int(winLen_seconds * sr)
    winOverlap = int(winOverlap_seconds * sr)
    
    # Use default value for n if it is not specified
    if n is None:
        n = 2 ** int(np.ceil(np.log2(winLen)))
    
    # Create a Hamming window of the appropriate length
    win = hamming(winLen)
    nWin = int((len(audio) - 2*(winLen - winOverlap))/(winLen - winOverlap))
    
    # Initialize the spectrogram array
    spectrogram = np.zeros((nWin, n // 2 + 1))    

    # Compute the spectrogram for each window
    for i in range(nWin):
        # Extract the windowed segment from the audio signal
        start = (winLen - winOverlap)*i
        stop = start + winLen
        audio_segment = audio[start:stop]

        # Apply the Hamming window to the audio segment
        audio_segment_windowed = audio_segment * win
        # Compute the fft of the windowed segment and take the single-sided spectrum
        audio_segment_windowed_spectrum = np.fft.fft(audio_segment_windowed,n)[0:n // 2 + 1]

        # Store the Amplitude Spectrum in the spectrogram
        spectrogram[i,:] = np.abs(audio_segment_windowed_spectrum)
    
    # Computhe the frequency and time axis
    audio_segment_windowed_freqAxis = np.fft.fftfreq(n,1.0/sr)[0:n // 2 + 1]
    audio_segment_windowed_timeAxis = np.arange((winLen)/2, (winLen - winOverlap)*(nWin-1) + winLen, (winLen - winOverlap)) / sr
    return spectrogram, n, audio_segment_windowed_timeAxis, audio_segment_windowed_freqAxis

In this code cell, we create spectrograms for the original and emphasized audio signals. We use the compute_spectrogram function with a window length of 20 ms and overlap of 10 ms. The n_fft parameter determines the number of FFT points, which we set to 512.

We plot the spectrograms side-by-side using the subplots function to create a figure with two subplots. We use the imshow function to display the spectrograms and set the appropriate labels and titles for each subplot. We also set the extent of the plots to match the time and frequency ranges of the spectrograms. Finally, we call tight_layout to avoid overlapping between the subplots.

In [5]:
audio_spectrogram, n_fft, audio_windowed_time, audio_windowed_freq = compute_spectrogram(audio, sr)
audio_preEmphasis_spectrogram, _,  _, _ = compute_spectrogram(audio_preEmphasis, sr)

# Create a figure with two subplots side-by-side
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(20, 10))

# Original audio
axs[0].set_title("Audio-Original Spectrogram")
axs[0].set_xlabel("Time [s]")
axs[0].set_ylabel("Frequency [KHz]")
audioSpecrogramImage = axs[0].imshow(20*np.log10(audio_spectrogram.T),origin='lower',aspect="auto",
                                     extent=[audio_windowed_time[0],audio_windowed_time[-1], audio_windowed_freq[0]/1000, audio_windowed_freq.max()/1000]) # log for dB

# Emphasized audio
axs[1].set_title("Audio-Emphasized Spectrogram")
axs[1].set_xlabel("Time [s]")
axs[1].set_ylabel("Frequency [KHz]")
audioEmphasizedSpectrogramImage = axs[1].imshow(20*np.log10(audio_preEmphasis_spectrogram.T),origin="lower",aspect="auto",
                                                extent=[audio_windowed_time[0],audio_windowed_time[-1], audio_windowed_freq[0]/1000, audio_windowed_freq.max()/1000]) # log for dB

# Set a tight layout to avoid overlapping
fig.tight_layout()

Speech Parametrization: Mel-Scale Filterbank Design¶

This code cell defines the functions hertzToMel, melToHertz, and designFilterbank which are used to convert frequencies from Hertz to Mel scale and vice versa, and to design a Mel-scale filterbank for use in speech analysis.

The hertzToMel and melToHertz functions take a frequency value in Hertz or Mel scale, respectively, and return the corresponding frequency value in the other scale.

The designFilterbank function takes the sampling frequency, number of points

In [6]:
def hertzToMel(f):
    """
    Convert frequency in Hertz to Mel scale
    
    Parameters:
    -----------
    f: float
        Frequency in Hertz
    
    Returns:
    --------
    float
        Frequency in Mel scale
    """
    return (np.log10(1+f/1000) / np.log10(2)) * 1000

def melToHertz(m):
    """
    Convert frequency in Mel scale to Hertz
    
    Parameters:
    -----------
    m: float
        Frequency in Mel scale
    
    Returns:
    --------
    float
        Frequency in Hertz
    """
    return (10**((m/1000)*(np.log10(2))) - 1) * 1000

def designFilterbank(sr, n_fft, fMax=None, fMin=0, nMel=24):
    """
    Design a filterbank in the Mel domain
    
    Parameters:
    -----------
    sr: int
        Sampling frequency in Hz
    nFFT: int
        Number of points in the FFT
    fMax: int or str, optional
        Upper limit frequency in Hz, default is 'Nyq' (half the sampling frequency)
    fMin: int, optional
        Lower limit frequency in Hz, default is 0
    nMel: int, optional
        Number of Mel bands to use, default is 24
        
    Returns:
    --------
    ndarray
        A matrix of shape (nMel, nFFT // 2 + 1) representing the filterbank
    """
    if fMax is None:
        fMax = sr/2
    melMin = hertzToMel(fMin)
    melMax = hertzToMel(fMax)
    melPoints = np.linspace(melMin, melMax, nMel+2) # I want nMel overlapped triangular filters, so I need nMel + 2 points on the x-axis
    # Filterbank design in the Hertz domain
    hertzPoints = melToHertz(melPoints)
    # Number of freq points in each band (depending on the n-point FFT)
    pointsPerHertz = (n_fft // 2 + 1)/(fMax - fMin)
    fftBins = np.floor(hertzPoints * pointsPerHertz).astype(int)
    filterbank = np.zeros((nMel, n_fft // 2 + 1))
    
    for i in range(1,nMel+1):
        left = fftBins[i-1]
        center = fftBins[i]
        right = fftBins[i+1]

        # Computation of triangular filter
        for j in range(np.floor(left).astype(int), np.floor(center).astype(int)):
            filterbank[i-1,j] = (center - j) / (center - left)
        for j in range(center, right):
            filterbank[i-1,j] = (right - j) / (right - center)
    
    return filterbank, melPoints

This code cell applies a filterbank to the spectrogram obtained from the pre-processed audio signal. The filterbank is designed using the designFilterbank function defined previously. The filterbank is applied to the original and emphasized spectrograms, resulting in mel-spectrograms. Mel-spectrograms are obtained by converting the frequency axis of the spectrogram from Hz to the mel scale. After that, the mel-spectrograms are converted into decibels (dB) using the 20*np.log10 function. Finally, the mel-spectrograms are plotted using imshow function from matplotlib.

In [7]:
# Number of filters to apply in the filterbank. 24 are the critical bands found in the cochlear map. 
# nMel is an hyperparameter
nMel = 24

# Design the filterbank
filterbank, melPoints = designFilterbank(sr, n_fft, nMel=nMel)

# Apply the filterbank to the spectrogram
audio_melSpectrogram = audio_spectrogram@filterbank.T
audio_preEmphasis_melSpectrogram = audio_preEmphasis_spectrogram@filterbank.T

# Convert in dB
audio_melSpectrogram = 20*np.log10(audio_melSpectrogram + 1e-9)
audio_preEmphasis_melSpectrogram = 20*np.log10(audio_preEmphasis_melSpectrogram + 1e-9)

# Plot spectrogram
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(20,30))
aspect = audio_windowed_time[-1]/melPoints[-1]

axs[0].set_title("Audio Mel Spectrogram")
axs[0].set_xlabel("Time [s]")
axs[0].set_ylabel("Mel Scale")
axs[0].imshow(audio_melSpectrogram.T, aspect=aspect, origin='lower', extent=[audio_windowed_time[0],audio_windowed_time[-1], melPoints[0], melPoints[-1]])

axs[1].set_title("Audio Emphasized Mel Spectrogram")
axs[1].set_xlabel("Time [s]")
axs[1].set_ylabel("Mel Scale")
axs[1].imshow(audio_preEmphasis_melSpectrogram.T, aspect=aspect, origin='lower', extent=[audio_windowed_time[0],audio_windowed_time[-1], melPoints[0], melPoints[-1]])

# Set a tight layout to avoid overlapping
fig.tight_layout()
plt.show()

Speech Parametrization: Extracting MFCCs from Mel Spectrogram¶

Now that we have computed the Mel spectrogram, we can extract the Mel-Frequency Cepstral Coefficients (MFCCs) which are commonly used in speech recognition tasks. MFCCs capture the spectral envelope of a speech signal and are obtained by applying a Discrete Cosine Transform (DCT) to the log-amplitude Mel spectrogram. The resulting coefficients are usually normalized and only the first few coefficients are retained as they contain the most relevant information for speech recognition.

In [8]:
def dctBasis(n, m):
    basis = np.zeros((n,m))
    for i in range(n):
        basis[i,:] = (i+1) * ((np.arange(m) + 0.5) * (np.pi/m))
    return np.cos(basis)

def computeMFCC(audio_melSpectrogram, n_mfcc=13):
    # Number of mel filters
    nMel = audio_melSpectrogram.shape[1]
    # Generate the DCT basis
    dctBasisFunctions = dctBasis(n_mfcc, nMel)
    # Compute the DCT of the log-amplitude mel spectrogram
    audio_mfccs = dctBasisFunctions@audio_melSpectrogram.T
    # Mean and variance normalization. Cepstral mean variance normalization (CMVN) reduces performance in speaker verification
    for i in range(audio_mfccs.shape[0]):
        audio_mfccs[i,:] = (audio_mfccs[i,:] - np.mean(audio_mfccs[i,:])) / np.sqrt(np.var(audio_mfccs[i,:]))
    return audio_mfccs
In [9]:
audio_mfccs = computeMFCC(audio_melSpectrogram)
audio_preEmphasis_mfccs = computeMFCC(audio_preEmphasis_melSpectrogram)

fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(20,30))

axs[0].set_title("Audio MFCCs")
axs[0].set_xlabel("Time [s]")
axs[0].set_ylabel("MFCCs")
axs[0].imshow(audio_mfccs, origin="lower", aspect=20)

axs[1].imshow(audio_preEmphasis_mfccs, origin="lower", aspect=20)
# Set a tight layout to avoid overlapping
fig.tight_layout()
plt.show()

Bi-Gaussian Model of Feature Vectors¶

We can use a bi-Gaussian model to fit the distribution of feature vectors, and then use this model to discard feature vectors that correspond to silence or background noise. A bi-Gaussian model assumes that the distribution of feature vectors can be described by a combination of two Gaussian distributions, one for the background noise and one for the speech signal.

To fit the bi-Gaussian model, we first estimate the mean and standard deviation of the background noise using a portion of the feature vectors that are known to correspond to silence. We then estimate the mean and standard deviation of the speech signal using a portion of the feature vectors that are known to correspond to speech. Once we have these estimates, we can calculate the probability of each feature vector belonging to the speech or background distribution, and discard those that are more likely to belong to the background distribution.

Below is the code to estimate the bi-Gaussian model of the MFCC feature vectors, assuming that the first 13 MFCC coefficients correspond to the speech signal:

In [ ]:
# Initialize the means and covariances of the Gaussians randomly
mu1 = np.random.randn(nDims)
cov1 = np.eye(nDims)
mu2 = np.random.randn(nDims)
cov2 = np.eye(nDims)

# Run the EM algorithm
logLikelihood = -np.inf
tolerance = 1e-4
converged = False
while not converged:
    # E-step: compute the posterior probabilities
    p1 = multivariate_normal.pdf(mfccs, mean=mu1, cov=cov1)
    p2 = multivariate_normal.pdf(mfccs, mean=mu2, cov=cov2)
    gamma = p1 / (p1 + p2)
    
    # M-step: update the parameters of the Gaussians
    mu1 = np.sum(gamma[:, np.newaxis] * mfccs, axis=0) / np.sum(gamma)
    cov1 = np.dot((gamma[:, np.newaxis] * (mfccs - mu1)).T, (mfccs - mu1)) / np.sum(gamma)
    mu2 = np.sum((1 - gamma)[:, np.newaxis] * mfccs, axis=0) / np.sum(1 - gamma)
    cov2 = np.dot(((1 - gamma)[:, np.newaxis] * (mfccs - mu2)).T, (mfccs - mu2)) / np.sum(1 - gamma)
    
    # Compute the log-likelihood of the data
    newLogLikelihood = np.sum(np.log(p1 + p2))
    
    # Check for convergence
    if np.abs(newLogLikelihood - logLikelihood) < tolerance:
        converged = True
    else:
        logLikelihood = newLogLikelihood

# Discard the feature vectors that are likely to correspond to silence or background noise
threshold = 0.5
p1 = multivariate_normal.pdf(mfccs, mean=mu1, cov=cov1)
p2 = multivariate_normal.pdf(mfccs, mean=mu2, cov=cov2)
gamma = p1 / (p1 + p2)
mfccsFiltered = mfccs[gamma >= threshold, :]