Category : pyspark

I have a question similar to https://stackoverflow.com/posts/68805893/ but with the addition that extra columns need to be applied and I need to know what element was the last of the list where the sliding window was applied. I’ll give an example: Given a df: input_df = spark.createDataFrame([ (2,[1,2,3,4,5], ["a","b","c","c","b"], ["a","a","c","c","d"]), ], ("id", "target", "feature1", "feature2")) ..

Read more

I have a Date and an Hour column in a PySpark dataframe. How do I merge these together to get the Desired_Calculated_Result column? df1 = sqlContext.createDataFrame( [ (‘2021-10-20′,’1300’, ‘2021-10-20 13:00:00.000+0000’) ,(‘2021-10-20′,’1400’, ‘2021-10-20 14:00:00.000+0000’) ,(‘2021-10-20′,’1500’, ‘2021-10-20 15:00:00.000+0000’) ] ,[‘dt’, ‘tm’, ‘Desired_Calculated_Result’] ) Source: Python..

Read more

I can’t find information about the case when the function is invoked over broadcasted objects in spark like this without arguments (everything is on the executors already): df_b = spark.sparkContext.broadcast(pandas_df) def function_running_on_each_executor(): data = df_b.value # some processing df.groupby(‘id’).sum() # and so on #… # submit it to the ML model and get predictions I ..

Read more