Splitting a huge dataframe into smaller dataframes and writing to files using SPARK(python)

Question

Splitting a huge dataframe into smaller dataframes and writing to files using SPARK(python)

asked Jun 12, 2022 in Education by JackTerrance

I am loading a (5gb compressed file) into memory (aws), creating a dataframe(in spark) and trying to split it into smaller dataframes based on 2 column values. Eventually i want to write all these sub-sets into their respective files. I just started experimenting in spark and just getting used to the data structures. The approach I was trying to follow was something like this. read the file sort it by the 2 columns (still not familiar with repartitioning and do not know if it will help) identify unique list of all values of those 2 columns iterate through this list -- create smaller dataframes by filtering using the values in list -- writing to files df.sort("DEVICE_TYPE", "PARTNER_POS") df.registerTempTable("temp") grp_col = sqlContext.sql("SELECT DEVICE_TYPE, PARTNER_POS FROM temp GROUP BY DEVICE_TYPE, PARTNER_POS") print(grp_col) I do not believe this are cleaner and more efficient ways of doing this. I need to write this to files as there are etls which get kicked off in parallel based on the output. Any recommendations? JavaScript questions and answers, JavaScript questions pdf, JavaScript question bank, JavaScript questions and answers pdf, mcq on JavaScript pdf, JavaScript questions and solutions, JavaScript mcq Test , Interview JavaScript questions, JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)

1 Answer

Related questions

0 votes

Q: convert into a pandas dataframe after finding missing values in a spark dataframe

I am utilizing the following to find missing values in my spark df: from pyspark.sql.functions import col, ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...

asked Apr 7, 2022 in Education by JackTerrance

0 votes

Q: What is the difference between rdd and dataframes in Apache Spark ?

I want to know in simple language what are all the differences between rdd and dataframes? Select the correct answer from above options...

asked Jan 21, 2022 in Education by JackTerrance

0 votes

Q: Writing Python Script`s Results into Text file

I am trying to write my below pinging script results into the Text file, but I am getting an error message. ... object is not iterable Select the correct answer from above options...

asked Jan 11, 2022 in Education by JackTerrance

0 votes

Q: The blobs are broken into smaller pieces using which of the following functions?

The blobs are broken into smaller pieces using which of the following functions? (a) partition() (b) ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...

asked Oct 22, 2021 in Education by JackTerrance

0 votes

Q: How to convert rdd object to dataframe in spark

How can I convert an RDD to a dataframe? I converted a data frame to rdd using .rdd. After processing it I ... convert it back to rdd Select the correct answer from above options...

asked Jan 21, 2022 in Education by JackTerrance

0 votes

Q: How to convert rdd object to dataframe in spark

How can I convert an RDD to a dataframe? I converted a data frame to rdd using .rdd. After processing ... ,Core Questions, Core Hadoop MCQ,core interview questions for experienced...

asked Oct 31, 2021 in Education by JackTerrance

0 votes

Q: Convert Python dict into a dataframe

I have a Python dictionary like the following: {u'2012-06-08': 388, u'2012-06-09': 388, u'2012-06-10 ... (my_dict,index=my_dict.keys()) Select the correct answer from above options...

asked Jan 27, 2022 in Education by JackTerrance

0 votes

Q: Get separate dataframes from a dictionary of dataframes Python

I have a dictionary d full of a collection of dataframes: key type size value gm1 dataframe mxn .. ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...

asked Apr 26, 2022 in Education by JackTerrance

0 votes

Q: How to add values from two dataframes according to the persons in first column? Python Pandas

I have 2 data frames df1 Name 2010 2011 0 Jack 25 35 1 Jill 15 20 df2 Name 2010 2011 0 Berry 45 25 1 ... used the code df1.add(df2) Select the correct answer from above options...

asked Jan 18, 2022 in Education by JackTerrance

0 votes

Q: Quandl API for Python wraps the ________ REST API to return Pandas DataFrames with time series indexes.

Quandl API for Python wraps the ________ REST API to return Pandas DataFrames with time series indexes. (a) ... and answers pdf, Data Science interview questions for beginners...

asked Oct 31, 2021 in Education by JackTerrance

0 votes

Q: Memory-mapping huge files in Java

Is it possible to memory-map huge files (multiple GBs) in Java? This method of FileChannel looks promising: ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...

asked May 15, 2022 in Education by JackTerrance

0 votes

Q: Memory-mapping huge files in Java

Is it possible to memory-map huge files (multiple GBs) in Java? This method of FileChannel looks promising: ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...

asked May 10, 2022 in Education by JackTerrance

0 votes

Q: Memory-mapping huge files in Java

Is it possible to memory-map huge files (multiple GBs) in Java? This method of FileChannel looks promising: ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...

asked May 4, 2022 in Education by JackTerrance

0 votes

Q: Dataframes can be converted into a matrix by calling the following function data ______

Dataframes can be converted into a matrix by calling the following function data ______ (a) matr() (b) matrix ... of R Programming Select the correct answer from above options...

asked Feb 15, 2022 in Education by JackTerrance

0 votes

Q: For the function F (s) = (s^2+s+1)/s(s+5)(s+3), after splitting this function into partial fractions, the co-efficient of the term 1/s is?

For the function F (s) = (s^2+s+1)/s(s+5)(s+3), after splitting this function into partial fractions ... Questions for GATE EC Exam, Network Theory MCQ (Multiple Choice Questions)...

asked Oct 16, 2021 in Education by JackTerrance

JackTerrance · Answer 1 · 2022-06-12T17:13:58+0000

If it's okay that the subsets are nested in a directory hierarchy, then you should consider using spark's builtin partitioning: df.write.partitionBy("device_type","partner_pos") .json("/path/to/root/output/dir")