I have a large data set (df) with 4M+ lines, showing logon behavior of appr 14k unique user ids from Feb to May this year. I found out that only part of the large df is relevant, since I need to look at an amount corresponding to half of those 14k user ids. This other list of user ids must be integrated with the large df, and deleting user ids I do not want to look at. How can do that?
To examplify what I am getting at, view this:
I need to import a list of other user ids from another data frame to replace the contents of the column ‘User_Name’, below. Essentially, the imported list contains many of the same user ids already listed in ‘User_Name’, so in practical terms the task for me to remove the user ids NOT relevant in ‘User_Name’. The column ‘User_Name’ contains many user ids multiple times (although the below example does not directly indicate that).
Logon_Time User_Name 01-03-2020 01:00 146996 1 192184 1 192357 1 01-03-2020 01:01 200141 1 01-03-2020 01:02 190235 1 .. 31-05-2020 23:58 161871 1 182574 1 192903 1 31-05-2020 23:59 193814 1 195437 1 Length: 825559, dtype: int64
So how do I tell pandas to match one list of 6.6k unique user ids with a list of unique 14k user ids (each listed multiple times because of many logons over time), and then throw away the user ids not relevant in the one with 14k user ids?
To create the list with 6.6k user ids, I did this (notice the df3 = pd.merge(dataset, all_data, how=’inner’,….)…line at the end of the code). I first removed duplicates from the list and ended up with 6.6k rows, starting with 25k rows:
dataset = pd.read_csv("...listing_unique_students_all.csv", sep=";", encoding="utf-8", low_memory=False).dropna() all_data = pd.read_csv("...Approved_S_numbers_Cand_Dipl_Bach.csv", sep=";", encoding="ISO-8859-1") dataset['User Name'] = pd.to_numeric(dataset['User Name'], errors='coerce') # Rows containing duplicate data print("nRows containing duplicate data") duplicate_rows_dataset = dataset[dataset.duplicated()] print("Number of duplicate rows: ", duplicate_rows_dataset.shape) # Dropping the duplicates print("nDropping the duplicates") drop_duplicates = dataset.drop_duplicates() print("ndrop_duplicates.head(25)") print(drop_duplicates.head(25)) print("ndrop_duplicates.shape") print(drop_duplicates.shape) dataset = drop_duplicates dataset.columns = ['User Name']) print("nall_data.columns") print(all_data.columns) df3 = pd.merge(dataset, all_data, how='inner', left_on='User Name', right_on='studienr') print("ndf3") print(df3.head()) print("ndf3.shape") print(df3.shape)
Source: Python-3x Questions