Category : pyspark

So I have a given PySpark DataFrame, say df, looking like below: df.show() +——————–+——————-+ | series| value | +——————–+——————-+ | XXXX-AAAA | 1 | | XXXX-BB | 2 | | XXXX-CCCCC | 3 | +——————–+——————-+ In the series column, I would like to get rid of the XXXX- substring (i.e. length of 5 characters), which ..

Read more

I have a dataset which contains Multiple columns and rows. Currently, it’s in String type And, I wanted to convert to a date-time format for further task. I tried this below code which returns null df = df.withColumn(‘Date_Time’,df[‘Date_Time’].cast(TimestampType())) df.show() I tried some of the solutions from here, but none of them is working all, in ..

Read more

Hi I am new to spark and I am trying to write my first program when I import from pyspark.mllib.linalg import Vectors i get the following I am using anaconda and Jyupter note book I have installed spark and I am able to execute from terminal ————————————————————————– TypeError Traceback (most recent call last) <ipython-input-1-da13502db94b> in ..

Read more

rdd = sc.textFile(‘count.csv’) x = rdd.partitionBy(10, ‘Hourly_Counts’) x.getNumPartitions() x.take(1) Sample RDD (Before removing headers): [(‘ID’, [‘Date_Time’, ‘Year’, ‘Month’, ‘Mdate’, ‘Day’, ‘Time’, ‘Sensor_ID’, ‘Sensor_Name’, ‘Hourly_Counts’]), (‘2887628′, [’11/01/2019 05:00:00 PM’, ‘2019’, ‘November’, ‘1’, ‘Friday’, ’17’, ’34’, ‘Flinders St-Spark La’, ‘300’]), (‘2887629′, [’11/01/2019 05:00:00 PM’, ‘2019’, ‘November’, ‘1’, ‘Friday’, ’17’, ’39’, ‘Alfred Place’, ‘604’]), (‘2887630′, [’11/01/2019 05:00:00 PM’, ..

Read more

Is there a way to add ‘****’ at the end of the each row of my data frame? for example detail orientated marketing and events community development consultant leader medical assistant instructor android developerator project natural hair care specialist turn to detail orientated **** marketing and events **** community development consultant **** leader medical assistant ..

Read more