Category : apache-spark

Aim : To update the values of column actv_ind to ‘0’ when another column in the same pyspark dataframe is == delete otherwise update value to ‘1’. All of the columns are string type in the dataframe. Current Approach : df.withColumn("actv_ind", F.when( (F.col(‘actv_ind’).isNull()) & (F.col(‘cdc_status’) == ‘delete’), F.lit(‘0’)) .otherwise(F.lit(‘1’))) This is not updating the actv_ind ..

Read more

I am a newbie to Python/Apache Spark, looking for the right IDE/Env in Windows to start learning. Here are my requirements, Interactive shell-like editor for writing python code for data analysis Auto-complete feature Apache Spark integration (like Pyspark) Any recommendation will be helpful, thanks. Source: Python-3x..

Read more

pandas.cut() is used to bin values into discrete intervals. For instance, pd.cut( np.array([0.2, 0.25, 0.36, 0.55, 0.67, 0.78]), 3, include_lowest=True, right=False ) Out[9]: [[0.2, 0.393), [0.2, 0.393), [0.2, 0.393), [0.393, 0.587), [0.587, 0.781), [0.587, 0.781)] Categories (3, interval[float64]): [[0.2, 0.393) < [0.393, 0.587) < [0.587, 0.781)] How can I achieve the same in PySpark? I ..

Read more

Aim: To basically implement this query select *, case when new_x != x or new_y != y then ‘some_status_change’ else cdc_status end as cdc_status from dataframe where cdc_status = ‘noUpdateRequired’ I am trying to implement this logic using pyspark(3.0.0) and spark(2.4.4) and I currently have this df = df.withColumn("cdc_status", F.when(((F.col(‘cdc_status’) == ‘noUpdateRequired’) & (F.col(‘new_autoapproveind’) != ..

Read more

Im currently working with pyspark.ml.classification.RandomForestClassifier and pyspark.ml.tuning.CrossValidator. I can obviously use a RandomForestClassifier as the CrossValidation’s "estimator" param. However RandomForestClassifier doesn’t seem to be inhering from pyspark.ml.base.Estimator. On the other side, looking to the source code of RandomForestClassifier (https://spark.apache.org/docs/latest/api/python/_modules/pyspark/ml/classification.html#RandomForestClassifier), I can’t figure out where RandomForestClassifier implements it’s fit method (which in my opinion should be ..

Read more

I have a df below: | year | id | area | visitor | 2007 | 001 | GFD | [{‘id’:’AA1′ ‘age’:20}, {‘id’:’AA2′ ‘age’:24},{‘id’:’AA3′ ‘age’:4}] | 2009 | 045 | TGH | [{‘id’:’AA1′ ‘age’:20}, {‘id’:’AA2′ ‘age’:24},{‘id’:’AA3′ ‘age’:5}] | 2009 | 019 | GFD | [{‘id’:’AA1′ ‘age’:14}, {‘id’:’AA2′ ‘age’:24},{‘id’:’AA3′ ‘age’:55}] | 2007 | 002 | GFD ..

Read more

I am updating seahorse code with spark 2.3.2 version, and was testing a python transformation. The transformation is working fine. But the following error is logging on: 2021-05-05 18:03:38,779 ERROR org.apache.spark.scheduler.DAGScheduler:91 Failed to update accumulators for task 0 java.lang.NullPointerException at org.apache.spark.api.python.PythonAccumulatorV2.openSocket(PythonRDD.scala:599) at org.apache.spark.api.python.PythonAccumulatorV2.merge(PythonRDD.scala:618) at org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1131) at org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1123) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1123) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1207) ..

Read more