I am loading a (5gb compressed file) into memory (aws), creating a dataframe(in spark) and trying to split it into smaller dataframes based on 2 column values. Eventually i want to write all these sub-sets into their respective files.
I just started experimenting in spark and just getting used to the data structures. The approach I was trying to follow was something like this.
read the file
sort it by the 2 columns (still not familiar with repartitioning and do not know if it will help)
identify unique list of all values of those 2 columns
iterate through this list -- create smaller dataframes by filtering using the values in list -- writing to files
df.sort("DEVICE_TYPE", "PARTNER_POS")
df.registerTempTable("temp")
grp_col = sqlContext.sql("SELECT DEVICE_TYPE, PARTNER_POS FROM temp GROUP BY DEVICE_TYPE, PARTNER_POS")
print(grp_col)
I do not believe this are cleaner and more efficient ways of doing this. I need to write this to files as there are etls which get kicked off in parallel based on the output. Any recommendations?
JavaScript questions and answers, JavaScript questions pdf, JavaScript question bank, JavaScript questions and answers pdf, mcq on JavaScript pdf, JavaScript questions and solutions, JavaScript mcq Test , Interview JavaScript questions, JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)