Most viewed questions in Apache Spark

0 votes
3 answers

Filtering a row in Spark DataFrame based on matching values from a list

Use the function as following: var notFollowingList=List(9.8,7,6,3 ...READ MORE

Jun 5, 2018 in Apache Spark by Shubham
• 13,480 points
77,637 views
+2 votes
14 answers

How to create new column with function in Spark Dataframe?

val coder: (Int => String) = v ...READ MORE

Apr 4, 2019 in Apache Spark by anonymous

edited Apr 5, 2019 by Omkar 68,056 views
+5 votes
11 answers

Concatenate columns in apache spark dataframe

its late but this how you can ...READ MORE

Mar 21, 2019 in Apache Spark by anonymous
62,277 views
+1 vote
8 answers

How to replace null values in Spark DataFrame?

Hi, In Spark, fill() function of DataFrameNaFunctions class is used to replace ...READ MORE

Dec 15, 2020 in Apache Spark by MD
• 95,140 points
56,361 views
0 votes
5 answers

How to change the spark Session configuration in Pyspark?

You aren't actually overwriting anything with this ...READ MORE

Dec 13, 2020 in Apache Spark by Gitika
• 65,870 points
54,331 views
+1 vote
6 answers

groupByKey vs reduceByKey in Apache Spark.

ReduceByKey is the best for production. READ MORE

Mar 3, 2019 in Apache Spark by anonymous
43,795 views
+1 vote
8 answers

How to print the contents of RDD in Apache Spark?

Save it to a text file: line.saveAsTextFile("alicia.txt") Print contains ...READ MORE

Dec 10, 2018 in Apache Spark by Akshay
42,770 views
+1 vote
3 answers

What is the difference between rdd and dataframes in Apache Spark ?

Comparison between Spark RDD vs DataFrame 1. Release ...READ MORE

Aug 27, 2018 in Apache Spark by shams
• 3,660 points
35,276 views
+2 votes
4 answers

use length function in substring in spark

You can use the function expr val data ...READ MORE

May 3, 2018 in Apache Spark by kurt_cobain
• 9,390 points
33,289 views
+1 vote
2 answers

Spark: Dataframe vs Dataset

Recently, there are two new data abstractions ...READ MORE

Jul 29, 2019 in Apache Spark by Jackie
28,482 views
+1 vote
3 answers

map() vs flatMap() in Spark

Spark map function expresses a one-to-one transformation. ...READ MORE

Jun 17, 2019 in Apache Spark by vishal
• 180 points
25,799 views
+1 vote
1 answer

Reading a text file through spark data frame

Try this: val df = sc.textFile("HDFS://nameservice1/user/edureka_168049/Structure_IT/samplefile.txt") df.collect() val df = ...READ MORE

Jul 24, 2019 in Apache Spark by Suri
20,305 views
0 votes
1 answer

org.apache.spark.sql.AnalysisException: cannot resolve given input columns

The string Productivity has to be enclosed between single ...READ MORE

Jul 10, 2019 in Apache Spark by Tina
19,766 views
0 votes
1 answer

What's the difference between 'filter' and 'where' in Spark SQL?

Both 'filter' and 'where' in Spark SQL ...READ MORE

May 23, 2018 in Apache Spark by nitinrawat895
• 11,380 points
19,579 views
0 votes
1 answer

How to find the number of elements present in the array in a Spark DataFame column?

You can select the column and apply ...READ MORE

Jun 5, 2018 in Apache Spark by Shubham
• 13,480 points
15,112 views
0 votes
3 answers

How to transpose Spark DataFrame?

Please check the below mentioned links for ...READ MORE

Dec 31, 2018 in Apache Spark by anonymous
14,854 views
0 votes
3 answers

Sorting rows in descending order in Spark SQL

df.orderBy($"col".desc) - this works as well READ MORE

Jul 5, 2020 in Apache Spark by Sai
• 160 points
14,317 views
0 votes
1 answer

org.apache.spark.sql.AnalysisException: cannot resolve "`id`" given input columns

I have used a header-less csv file ...READ MORE

Jul 13, 2019 in Apache Spark by Puneet
13,982 views
0 votes
1 answer

How can I remove headers from dataframe?

You can use filter to do this. ...READ MORE

Feb 14, 2019 in Apache Spark by Aryan
13,163 views
+1 vote
2 answers

How can I convert Spark Dataframe to Spark RDD?

Assuming your RDD[row] is called rdd, you ...READ MORE

Jul 9, 2018 in Apache Spark by zombie
• 3,790 points
12,788 views
+1 vote
1 answer

Is there any efficient way of dealing null values during concat functionality of pyspark.sql version 2.3.4?

When you concatenate any string with a ...READ MORE

Nov 6, 2019 in Apache Spark by Rishi
11,404 views
0 votes
1 answer

Changing Column position in spark dataframe

Yes, you can reorder the dataframe elements. You need ...READ MORE

Apr 19, 2018 in Apache Spark by Ashish
• 2,650 points
10,105 views
+1 vote
1 answer

map vs mapValues in Spark

There is a difference between the two: mapValues ...READ MORE

Jun 29, 2018 in Apache Spark by nitinrawat895
• 11,380 points
9,994 views
0 votes
1 answer

Why do we use sc.parallelize?

Spark revolves around the concept of a ...READ MORE

Jul 11, 2019 in Apache Spark by Suman
9,722 views
0 votes
1 answer

Cache() vs persist() in Spark

The cache() is used only the default storage level ...READ MORE

Mar 8, 2019 in Apache Spark by Raj
8,909 views
0 votes
1 answer

Is it better to have one large parquet file or lots of smaller parquet files?

Ideally, you would use snappy compression (default) ...READ MORE

May 23, 2018 in Apache Spark by nitinrawat895
• 11,380 points
8,774 views
0 votes
3 answers

Can anyone explain fold() operation in Spark?

Fold in spark Fold is a very powerful ...READ MORE

Aug 22, 2018 in Apache Spark by samarth295
• 2,220 points
8,264 views
0 votes
2 answers

Difference between createOrReplaceTempView and registerTempTable

I am pretty sure createOrReplaceTempView just replaced ...READ MORE

Sep 18, 2020 in Apache Spark by Nathan Mott
7,811 views
0 votes
1 answer

How to save and retrieve the Spark RDD from HDFS?

You can save the RDD using saveAsObjectFile and saveAsTextFile method. ...READ MORE

May 29, 2018 in Apache Spark by Shubham
• 13,480 points
7,691 views
+2 votes
2 answers

py4j.protocol.Py4JError: org.apache.spark.api.python.PythonUtils.getEncryptionEnabled does not exist in the JVM

Using findspark is expected to solve the ...READ MORE

Jun 20, 2020 in Apache Spark by suvasish
7,513 views
0 votes
1 answer

Spark - repartition() vs coalesce()

It avoids a full shuffle. If it's ...READ MORE

Oct 11, 2018 in Apache Spark by nitinrawat895
• 11,380 points
6,960 views
+1 vote
2 answers

Apache Spark vs Apache Spark 2

Spark 2 doesn't differ much architecture-wise from ...READ MORE

Apr 24, 2018 in Apache Spark by kurt_cobain
• 9,390 points
6,877 views
0 votes
3 answers

Lineage Graph in Spark

Whenever a series of transformations are performed ...READ MORE

Aug 27, 2018 in Apache Spark by shams
• 3,660 points
6,525 views
0 votes
3 answers

I don't understand the reason behind Spark RDD being immutable.

There are few reasons for keeping RDD ...READ MORE

Apr 18, 2019 in Apache Spark by santlal561987@gmail.com
6,489 views
0 votes
1 answer

Spark: java.sql.SQLException: No suitable driver

The missing driver is the JDBC one ...READ MORE

Jul 24, 2019 in Apache Spark by John
6,488 views
0 votes
1 answer

what are the spark job and spark task and spark staging ?

In a Spark application, when you invoke ...READ MORE

Mar 18, 2019 in Apache Spark by Pavan
6,452 views
0 votes
1 answer

Cannot load file to spark: "org.apache.spark.sql.AnalysisException: Path does not exist"

Since the file is in HDFS so ...READ MORE

Jul 31, 2019 in Apache Spark by Tina
6,435 views
0 votes
1 answer

Difference between cogroup and full outer join in spark

Please go through the below explanation : Full ...READ MORE

Jul 13, 2019 in Apache Spark by Kiran
6,356 views
+1 vote
1 answer

is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [51, 53, 10, 10]

Hi@akhtar, Here you are trying to read a ...READ MORE

Feb 3, 2020 in Apache Spark by MD
• 95,140 points
5,712 views
0 votes
1 answer

How to find max value in pair RDD?

Use Array.maxBy method: val a = Array(("a",1), ("b",2), ...READ MORE

May 25, 2018 in Apache Spark by nitinrawat895
• 11,380 points
5,616 views
0 votes
1 answer

How to add third party java jars for use in PySpark?

You can add external jars as arguments ...READ MORE

Jul 4, 2018 in Apache Spark by nitinrawat895
• 11,380 points
5,606 views
0 votes
2 answers

How to use RDD filter with other function?

val x = sc.parallelize(1 to 10, 2)   // ...READ MORE

Aug 16, 2018 in Apache Spark by zombie
• 3,790 points
5,525 views
0 votes
1 answer

When not to use foreachPartition and mapPartition?

With mapPartion() or foreachPartition(), you can only ...READ MORE

Apr 30, 2018 in Apache Spark by Data_Nerd
• 2,390 points
5,240 views
0 votes
1 answer

Spark Error: java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.

There seems to be a problem with ...READ MORE

May 24, 2019 in Apache Spark by Jishan
5,226 views
0 votes
1 answer

Pyspark dataframe with random values

Hey @Esha, you can use this code. ...READ MORE

Aug 1, 2019 in Apache Spark by Zed
5,149 views
+1 vote
1 answer

How can I write a text file in HDFS not from an RDD, in Spark program?

Yes, you can go ahead and write ...READ MORE

May 29, 2018 in Apache Spark by Shubham
• 13,480 points
5,080 views
0 votes
1 answer

Passing condition dynamically to Spark application.

You can try this: d.filter(col("value").isin(desiredThings: _*)) and if you ...READ MORE

Feb 19, 2019 in Apache Spark by Omkar
• 69,110 points
5,048 views
+1 vote
1 answer

How to convert pyspark Dataframe to pandas Dataframe?

Hi@akhtar, To convert pyspark dataframe into pandas dataframe, ...READ MORE

May 7, 2020 in Apache Spark by MD
• 95,140 points
4,935 views
0 votes
1 answer

ImportError: No module named 'pyspark'

Hi@akhtar, By default pyspark in not present in ...READ MORE

May 6, 2020 in Apache Spark by MD
• 95,140 points
4,872 views
0 votes
1 answer

Spark, Scala: Load custom delimited file

You can load a DAT file into ...READ MORE

Jul 16, 2019 in Apache Spark by Shri
4,841 views