How does partitioning work in Spark

0 votes
I'm trying to understand how partitioning is done in Apache Spark. Can anyone help please?

Here is the scenario:

a master and two nodes with 1 core each
a file sample.txt of 10 MB in size
How many partitions does the following create?

rdd = sc.textFile(sample.txt)
Does the size of the file have any impact on the number of partitions?
May 31, 2018 in Apache Spark by coldcode
• 2,070 points
352 views

1 answer to this question.

0 votes
By default a partition is created for each HDFS partition, which by default is 64MB (from the Spark Programming Guide).

It's possible to pass another parameter defaultMinPartitions which overrides the minimum number of partitions that spark will create. If you don't override this value then spark will create at least as many partitions as spark.default.parallelism.

Since spark.default.parallelism is supposed to be the number of cores across all of the machines in your cluster I believe that there would be at least 3 partitions created in your case.

You can also repartition or coalesce an RDD to change the number of partitions that in turn influences the total amount of available parallelism.
Hope it will answer your query to some extent.
answered May 31, 2018 by nitinrawat895
• 11,380 points

Related Questions In Apache Spark

0 votes
1 answer

How to work with Matrix Multiplication in Apache Spark?

Hey, You can follow this below solution for ...READ MORE

answered Jul 31, 2019 in Apache Spark by Gitika
• 65,930 points
4,545 views
0 votes
2 answers

In a Spark DataFrame how can I flatten the struct?

// Collect data from input avro file ...READ MORE

answered Jul 4, 2019 in Apache Spark by Dhara dhruve
3,763 views
+1 vote
1 answer

How can I write a text file in HDFS not from an RDD, in Spark program?

Yes, you can go ahead and write ...READ MORE

answered May 29, 2018 in Apache Spark by Shubham
• 13,480 points
5,410 views
0 votes
5 answers

How to change the spark Session configuration in Pyspark?

You aren't actually overwriting anything with this ...READ MORE

answered Dec 13, 2020 in Apache Spark by Gitika
• 65,930 points
60,275 views
+1 vote
1 answer
0 votes
1 answer

Writing File into HDFS using spark scala

The reason you are not able to ...READ MORE

answered Apr 5, 2018 in Big Data Hadoop by kurt_cobain
• 9,390 points
13,207 views
0 votes
1 answer

Is there any way to check the Spark version?

There are 2 ways to check the ...READ MORE

answered Apr 19, 2018 in Apache Spark by nitinrawat895
• 11,380 points
4,990 views
0 votes
1 answer

What's the difference between 'filter' and 'where' in Spark SQL?

Both 'filter' and 'where' in Spark SQL ...READ MORE

answered May 23, 2018 in Apache Spark by nitinrawat895
• 11,380 points
21,112 views
0 votes
1 answer

How to convert rdd object to dataframe in spark

SqlContext has a number of createDataFrame methods ...READ MORE

answered May 30, 2018 in Apache Spark by nitinrawat895
• 11,380 points
2,959 views
+1 vote
8 answers

How to replace null values in Spark DataFrame?

Hi, In Spark, fill() function of DataFrameNaFunctions class is used to replace ...READ MORE

answered Dec 15, 2020 in Apache Spark by MD
• 95,240 points
60,005 views