Logistic Regression Cost Function Fluctuates With any Step Size

Thank you for taking the time to read this 🙂

I finished Andrew Ng’s ML course, except I just couldn’t understand backpropagation. So, I’m following the tutorial on Neural Networks and Deep Learning to create my own neural network with Numpy. But my logistic regression training cost function fluctuates a lot no matter what step size (see alpha variable in equations at the end) I use in gradient descent.

I was graphing the training cost (blue, fluctuating problem) and the test cost to create a learning curve. I wanted to figure out if the model had high bias or high variance. It’s weird that my test cost doesn’t fluctuate, though I use the same logistic regression cost function for both test and training costs. (I don’t have regularisation in either case).
Test cost vs training cost learning curve

You can run my code yourself on Kaggle here

This is what I’ve tried to debug my code

  • I tried alpha in range 0.1 to 0.000000000001 decreasing by a factor of 10 each time
  • I tried to increase the batch size (from 20 to 70 in increments of 10) in case the spikes in the training cost were due to inaccurate parameter updates from low batch size.
  • I double-checked my implementation of backpropagation against the MUCH CLEANER implementation in Neural Networks and Deep Learning 😀 The only difference I can see is that he doesn’t sum his bias partial derivatives over all examples? See his line 104 and 116 and my #Gradients to Return section.
  • I searched up other people’s issues with implementing neural networks while following Andrew Ng’s ML course. This question and this question and this question all had coding issues in their implementation. It could entirely be I also have a coding issue in my implementation 🙁 This is my second time trying to implement backpropagation (I gave up last time). But I did check that I didn’t have the same issues as those questions.

These are the theoretical equations I’m using:
Logistic Regression Cost Function
C is for the Cost, m is the number of training examples, K is the number of output classes, y is a m by K matrix with true/false labels for all classes for each example, h is a m by K matrix with predictions for all classes for each example, and theta is a vector with all parameters (weights and biases).

enter image description here

Alpha is the ‘step size’ that I’m making smaller to minimise fluctuations in cost. The nabla notation is partial derivative of the cost with respect to all elements of theta. I compute this with the equations shown in Neural Networks and Deep Learning.

Source: Python Questions