Category : pyspark

I have multiple data frames with values that have been computed on different source data. For simplicity I’ll give an example with two dataframes but I’m looking for a solution with n dataframes data_1 +——+———–+ |person|first_value| +——+———–+ | 1| 1.0| | 2| 0.9| | 3| 0.8| | 4| 0.7| +——+———–+ data_2 +——+————+ |person|second_value| +——+————+ | ..

Read more

Aim : To update the values of column actv_ind to ‘0’ when another column in the same pyspark dataframe is == delete otherwise update value to ‘1’. All of the columns are string type in the dataframe. Current Approach : df.withColumn("actv_ind", F.when( (F.col(‘actv_ind’).isNull()) & (F.col(‘cdc_status’) == ‘delete’), F.lit(‘0’)) .otherwise(F.lit(‘1’))) This is not updating the actv_ind ..

Read more

I am a newbie to Python/Apache Spark, looking for the right IDE/Env in Windows to start learning. Here are my requirements, Interactive shell-like editor for writing python code for data analysis Auto-complete feature Apache Spark integration (like Pyspark) Any recommendation will be helpful, thanks. Source: Python-3x..

Read more

I have csv like that id,date,newdate 1,19930101,1011993 2,19930101,1011993 3,19930101,1011993 4,19930101,1011993 5,19930102,1011993 And also i have predefined schema: schema = StructType(List(StructField(account_id,LongType,true),StructField(date,TimestampType,true),StructField(newdate,TimestampType,true))) I need to find the way to read this csv and get dataframe with formatted timestamp columns. when I’m setting timestamp format for only one of columns it’s returning null for me and i have ..

Read more

pandas.cut() is used to bin values into discrete intervals. For instance, pd.cut( np.array([0.2, 0.25, 0.36, 0.55, 0.67, 0.78]), 3, include_lowest=True, right=False ) Out[9]: [[0.2, 0.393), [0.2, 0.393), [0.2, 0.393), [0.393, 0.587), [0.587, 0.781), [0.587, 0.781)] Categories (3, interval[float64]): [[0.2, 0.393) < [0.393, 0.587) < [0.587, 0.781)] How can I achieve the same in PySpark? I ..

Read more

Aim: To basically implement this query select *, case when new_x != x or new_y != y then ‘some_status_change’ else cdc_status end as cdc_status from dataframe where cdc_status = ‘noUpdateRequired’ I am trying to implement this logic using pyspark(3.0.0) and spark(2.4.4) and I currently have this df = df.withColumn("cdc_status", F.when(((F.col(‘cdc_status’) == ‘noUpdateRequired’) & (F.col(‘new_autoapproveind’) != ..

Read more