Category : statistics I have a set of data points that I have used to generate my empirical CDF which looks like this (to simplify things I have reduced the number of points for this question but it shouldn’t matter): Given this data and plot I need to somehow generate random values which follow this distribution. I admit .. I have a data set : Google Sheet Data While performing a Two sample t test assuming unequal variances , the excel output is this : I am trying to replicate the same in python using : T_test = ttest_ind(df.dropna()[‘PRE’],rest.dropna()[‘POST’],equal_var=False, alternative="less) result = T_test The p value from scipy is 0.004689 where as in excel ..

If I want to keep my bin-to-bin fluctuations less than <= 5%, how do I find the maximum number of bins I should use? I am using an list of 10,000 random numbers to see analyze its histogram distribution. Here is an example: I am using the same data but with different bin sizes. Bin ..

I am trying to create a box plot with matplotlib library of python. The code is given below. fig, ax = plt.subplots(figsize=(8, 6)) bp = ax.boxplot([corr_df[‘bi’], corr_df[‘ndsi’], corr_df[‘dbsi’], corr_df[‘mbi’]], patch_artist = True, notch =’True’, vert = 1) ax.set_title("Spearman’s correlation coefficient for Soil indices", fontsize=14) ax.set_xlabel("Indices", fontsize=14) ax.set_ylabel("Spearman’s correlation coefficient", fontsize=14) colors = [‘#088A08’, ‘#FFFF00′,’#01DFD7’, ‘#FF00FF’, ..

So I have a data science interview at Google, and I’m trying to prepare. One of the questions I see a lot (on Glassdoor) from people who have interviewed there before has been: "Write code to generate random normal distribution." While this is easy to do using numpy, I know sometimes Google asks the candidate .. I have a dataset where every data sample consists of 10-20 2D coordinates points. The data is mostly clean but occasionally there are falsely annotated points. For illustration the cleany annotated data would look like these: either clustered in a small area or spread across a larger area. The outliers I’m trying to filter out ..

this is my first time writing to a board like this so I have no idea of how many things I’m doing wrong, sorry in advance I’m trying to perform a tukey kramer test on a dataframe I have constructed from a dictionary, but I keep getting told "AttributeError: ‘str’ object has no attribute ‘dropna’" ..