What’s the best scientific computing/statistical package for my regression analysis/neural network?

  python, tensorflow

I’m trying to build something like Salesforce’s Einstein in Python. What Einstein does is takes a big dataset – all of the customer records and how much $ they’re worth – and crunches it into one best-guess number, a "score." They call it an AI in their marketing literature, but it’s hardly that. It’s been a few years since I spent much time in SPSS/R, but I think this is just a multiple regression analysis.

I’ll be crunching a dataset that’s a couple thousand rows and max 20 columns (though I’d probably hand-pick the columns and it would be about 8). About thirty rows are added to the dataset every day. I could afford to wait maybe fifteen minutes for those new rows to return a resulting score, and I could simplify further by making that resulting score boolean instead of a granular rating, if that matters.

So, I’m putting it to the StackOveflow community because you guys have kept up with the tech more than I have.

  • Do I need to go with something like tensorflow and build a neural network? My understanding is that this path would provide me with a model that I could feed a new row into and get a result back very quickly. But I have a feeling 2000 rows/16000 data points in a neural network is tantamount to just generating random output.

  • On the other end of the spectrum is something like numpy. My worry here is that it would need to crunch all the data every time it crunched any data. I have no idea how long it would take to do a multiple regression analysis on 2000 rows. If you told me it was thirty seconds, I’d believe you. If you told me it was six hours, I’d believe you. My hard limit on this calculation would be ~15 minutes on an EC2 instance of any (reasonable) size.

  • Another package?

Source: Python Questions

LEAVE A COMMENT