in Education by
I am loading a (5gb compressed file) into memory (aws), creating a dataframe(in spark) and trying to split it into smaller dataframes based on 2 column values. Eventually i want to write all these sub-sets into their respective files. I just started experimenting in spark and just getting used to the data structures. The approach I was trying to follow was something like this. read the file sort it by the 2 columns (still not familiar with repartitioning and do not know if it will help) identify unique list of all values of those 2 columns iterate through this list -- create smaller dataframes by filtering using the values in list -- writing to files df.sort("DEVICE_TYPE", "PARTNER_POS") df.registerTempTable("temp") grp_col = sqlContext.sql("SELECT DEVICE_TYPE, PARTNER_POS FROM temp GROUP BY DEVICE_TYPE, PARTNER_POS") print(grp_col) I do not believe this are cleaner and more efficient ways of doing this. I need to write this to files as there are etls which get kicked off in parallel based on the output. Any recommendations? JavaScript questions and answers, JavaScript questions pdf, JavaScript question bank, JavaScript questions and answers pdf, mcq on JavaScript pdf, JavaScript questions and solutions, JavaScript mcq Test , Interview JavaScript questions, JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)

1 Answer

0 votes
by
If it's okay that the subsets are nested in a directory hierarchy, then you should consider using spark's builtin partitioning: df.write.partitionBy("device_type","partner_pos") .json("/path/to/root/output/dir")

Related questions

0 votes
    I am utilizing the following to find missing values in my spark df: from pyspark.sql.functions import col, ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...
asked Apr 7, 2022 in Education by JackTerrance
0 votes
    I want to know in simple language what are all the differences between rdd and dataframes? Select the correct answer from above options...
asked Jan 21, 2022 in Education by JackTerrance
0 votes
    I am trying to write my below pinging script results into the Text file, but I am getting an error message. ... object is not iterable Select the correct answer from above options...
asked Jan 11, 2022 in Education by JackTerrance
0 votes
    How can I convert an RDD to a dataframe? I converted a data frame to rdd using .rdd. After processing it I ... convert it back to rdd Select the correct answer from above options...
asked Jan 21, 2022 in Education by JackTerrance
0 votes
    How can I convert an RDD to a dataframe? I converted a data frame to rdd using .rdd. After processing ... ,Core Questions, Core Hadoop MCQ,core interview questions for experienced...
asked Oct 31, 2021 in Education by JackTerrance
0 votes
    I have a Python dictionary like the following: {u'2012-06-08': 388, u'2012-06-09': 388, u'2012-06-10 ... (my_dict,index=my_dict.keys()) Select the correct answer from above options...
asked Jan 27, 2022 in Education by JackTerrance
0 votes
    I have a dictionary d full of a collection of dataframes: key type size value gm1 dataframe mxn .. ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...
asked Apr 26, 2022 in Education by JackTerrance
0 votes
    I have 2 data frames df1 Name 2010 2011 0 Jack 25 35 1 Jill 15 20 df2 Name 2010 2011 0 Berry 45 25 1 ... used the code df1.add(df2) Select the correct answer from above options...
asked Jan 18, 2022 in Education by JackTerrance
0 votes
    Quandl API for Python wraps the ________ REST API to return Pandas DataFrames with time series indexes. (a) ... and answers pdf, Data Science interview questions for beginners...
asked Oct 31, 2021 in Education by JackTerrance
0 votes
    Is it possible to memory-map huge files (multiple GBs) in Java? This method of FileChannel looks promising: ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...
asked May 15, 2022 in Education by JackTerrance
0 votes
    Is it possible to memory-map huge files (multiple GBs) in Java? This method of FileChannel looks promising: ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...
asked May 10, 2022 in Education by JackTerrance
0 votes
    Is it possible to memory-map huge files (multiple GBs) in Java? This method of FileChannel looks promising: ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...
asked May 4, 2022 in Education by JackTerrance
0 votes
    Dataframes can be converted into a matrix by calling the following function data ______ (a) matr() (b) matrix ... of R Programming Select the correct answer from above options...
asked Feb 15, 2022 in Education by JackTerrance
0 votes
    For the function F (s) = (s^2+s+1)/s(s+5)(s+3), after splitting this function into partial fractions ... Questions for GATE EC Exam, Network Theory MCQ (Multiple Choice Questions)...
asked Oct 16, 2021 in Education by JackTerrance
...