Trying to code the nearest neighbours algorithm – euclidean distance function only calculates the distances for one row of the test set – why?

  algorithm, euclidean-distance, indexing, python

I am trying to code the Nearest Neighbours Algorithm from scratch and have come across a problem – my algorithm was only giving the index/classification of the nearest neighbour for one row/point of the the training set. I went through every part of my code and realised that the problem is my Euclidean distance function. It only gives the result for one row.

This is the code I have written for Euclidean distance;

def euclidean_dist(r1, r2):
    dist = 0
    for j in range(0, len(r2)-1):
        dist = dist + (r2[j] - r1[j])**2
    return dist**0.5

Within my Nearest Neighbours algorithm this is the implementation of this Euclid distance function;

for i in range(len(x_test)):
        dist1 = []
        dist2 = []
        for j in range(len(x_train)):
            distances = euclidean_dist(x_test[i], x_train[j,:])
            dist1.append(distances)
            dist2.append(distances)
        dist1 = np.array(dist1)
        sorting(dist1) #separate sorting function to sort the distances from lowest to highest,
#the aim was to get one array, dist1, with the euclidean distances for each row sorted
#and one array with the unsorted euclidean distances, dist2, (to be able to search for index later in the code)

I noticed the problem when using the iris dataset and trying out this part of the function with it. I split the data set into test and training (X_test and X_train and y_test).

When this was implemented with the data set I got the following array for dist2;

[0.3741657386773946,
 1.643167672515499,
 3.389690251335658,
 2.085665361461421,
 1.284523257866513,
 3.9572717874818752,
 0.9539392014169458,
 3.5805027579936315,
 0.7211102550927979,
      ...
0.8062257748298555,
 0.4242640687119287,
 0.5196152422706631]

Its length is 112 which is the same length as X_train, but these are only the Euclidean distances for the first row or point of the X_test set. The dist1 array is the same except it is sorted.

Why am I not getting the Euclidean distances for every row/point of the test set? I thought I iterated through correctly with the for loops, but clearly something is not quite right. Any advice or help would be appreciated.

Source: Python Questions

LEAVE A COMMENT