I have 2 years of user data for a growing company, and have identified a clear correlation with the weather. I have 2019 and 2020, I have clustered some data like below. I have taken a weekly average of the daily maximum temperature, and grouped the total number of users for that week with it, like so:
DF4(below) is for 2020, I have the same(df3) for 2019:
kmeans = KMeans(n_clusters=3).fit(df4) centroids = kmeans.cluster_centers_ plt.scatter(df4['tmax'], df4['Entries'], c= kmeans.labels_.astype(float), s=50, alpha=0.5) plt.scatter(centroids[:, 1], centroids[:, 0], c='red', s=50) plt.show()
I also have the weather for the first 3 months of 2021. How can I begin to use the weather data I have and information from the past 2 years to form an estimate of the number of users that week, so I can compare with the actual. I don’t expect to get extremely close, just want a rough estimate that can follow the trend of the previous years. The idea is that then I can use a weeks weather forecast to ‘guess’ the number of users.
I also have the problem that I only have 2 years of data as its a new company, and the growth rate is a massive factor(around doubled from a month in 2020 to the same month in 2021).
How do I begin using K means to form some coherent prediction? Will I have to use some form of growth rate too?
Source: Python Questions