Spark Question
Shuffling is the process of redistributing data across partitions that may lead to data movement across the executors. The shuffle operation is implemented differently in Spark compared to Hadoop.
Shuffling has 2 important compression parameters:
spark.shuffle.compress – checks whether the engine would compress shuffle outputs or not spark.shuffle.spill.compress – decides whether to compress intermediate shuffle spill files or not
It occurs while joining two tables or while performing byKey operations such as GroupByKey or ReduceByKey
Shuffling has 2 important compression parameters:
spark.shuffle.compress – checks whether the engine would compress shuffle outputs or not spark.shuffle.spill.compress – decides whether to compress intermediate shuffle spill files or not
It occurs while joining two tables or while performing byKey operations such as GroupByKey or ReduceByKey