How to sharpen word recognition with DTW + kNN

  dtw, knn, librosa, python, speech-recognition

I have been trying to distinguish between words by extracting MFCCs from audios (with the librosa library), then applying Dynamic Time Warping to classify between the audios using kNN.

For example, I am trying to recognize between the words "cat" and "anything".
My problem is that I can’t find similarities between two different pronunciations of the word "anything" and the word "cat".
All three words seem to be equally distant to each other according to DTW. I tried to reduce or raise the number of coefficients used in the MFCC, to preprocess the MFCCs (normalizing and removing mean) but nothing seems to work.

I’m using the DTW function from the dtw package : dist, cost, acc_cost, path = dtw(mfcc3.T, mfcc2.T, dist=lambda x, y: norm(x - y, ord=1))

My question is : why do you think can’t I classify those data ?

— Did I insufficiently preprocessed the data before comparing them with DTW ?

— Need I adjust the DTW more intelligently so the distance that is computed effectively differentiates between different words ?

— Are the kNN or DTW inadequate in my case ? How can I fix this?

Here are the main lines of the code :

for i in range(len(mots)):
y1, sr1 = librosa.load(dirname+"/"+mots[i])
mfcc1 = librosa.feature.mfcc(y1, sr1)
for j in range(len(mots)):
    y2, sr2 = librosa.load(dirname+"/"+mots[j])
    mfcc2 = librosa.feature.mfcc(y2, sr2)
    dist, _, _, _ = dtw(mfcc1.T, mfcc2.T, dist=lambda x, y: norm(x - y, ord=1))
    distances[i,j] = dist #representing the distance between the spoken words i and j


label = ['cat','anything']

# # Train a kNN classifier to determine if the audio is cat or anything

from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=3,metric='euclidean')
classifier.fit(distances, y)


# # Comparing a sample with these distances to find which word is the most similar to the sample
y, sr = librosa.load(dst)
mfcc = librosa.feature.mfcc(y, sr)
distanceTest = []
for i in range(len(mots)):
    y1, sr1 = librosa.load(dirname+"/"+mots[i])
    mfcc1 = librosa.feature.mfcc(y1, sr1)
    dist, _, _, _ = dtw(mfcc.T, mfcc1.T, dist=lambda x, y: norm(x - y, ord=1))
    distanceTest.append(dist)

#result
pre = classifier.predict([distanceTest])[0]

The code I’ve been using is mostly taken from https://github.com/aishoot/DTWSpeech/blob/master/DTW_MFCC_KNN.ipynb and applied to words instead of phonemes.

Source: Python Questions

LEAVE A COMMENT