Most viewed questions in Apache Spark

0 votes
1 answer

Pyspark dataframe with random values

Hey @Esha, you can use this code. ...READ MORE

Aug 1, 2019 in Apache Spark by Zed
8,506 views
0 votes
1 answer

How to increase worker timeout in Spark application?

By default, the timeout is set to ...READ MORE

Mar 25, 2019 in Apache Spark by Hari
8,417 views
0 votes
1 answer

Passing condition dynamically to Spark application.

You can try this: d.filter(col("value").isin(desiredThings: _*)) and if you ...READ MORE

Feb 19, 2019 in Apache Spark by Omkar
• 69,210 points
8,388 views
0 votes
1 answer

How to add third party java jars for use in PySpark?

You can add external jars as arguments ...READ MORE

Jul 4, 2018 in Apache Spark by nitinrawat895
• 11,380 points

edited Nov 19, 2021 by Sarfaraz 8,340 views
0 votes
1 answer

How fault tolerance is achieved in Apache Spark?

Hey, In Apache Spark, the data storage model is ...READ MORE

Jul 22, 2019 in Apache Spark by Gitika
• 65,910 points
8,167 views
0 votes
1 answer

Is there any way to check the Spark version?

There are 2 ways to check the ...READ MORE

Apr 19, 2018 in Apache Spark by nitinrawat895
• 11,380 points
8,030 views
+1 vote
1 answer

How to convert pyspark Dataframe to pandas Dataframe?

Hi@akhtar, To convert pyspark dataframe into pandas dataframe, ...READ MORE

May 7, 2020 in Apache Spark by MD
• 95,440 points
7,948 views
+1 vote
1 answer

How can I write a text file in HDFS not from an RDD, in Spark program?

Yes, you can go ahead and write ...READ MORE

May 29, 2018 in Apache Spark by Shubham
• 13,490 points
7,946 views
0 votes
1 answer

Join in RDD using keys

Suppose you have two dataset results( id, ...READ MORE

Aug 2, 2019 in Apache Spark by Trisha
7,945 views
0 votes
1 answer

What does the command df.registerTempTable() do?

df.registerTempTable(“airports”) This command is used to register ...READ MORE

Jul 14, 2019 in Apache Spark by James
7,854 views
0 votes
1 answer

Spark - repartition() vs coalesce()

It avoids a full shuffle. If it's ...READ MORE

Oct 11, 2018 in Apache Spark by nitinrawat895
• 11,380 points
7,826 views
0 votes
1 answer

Spark: Error while instantiating "org.apache.spark.sql.hive.HiveSessionState"

Seems like you have not started the ...READ MORE

Jul 25, 2019 in Apache Spark by Rohit
7,763 views
0 votes
0 answers

Why doesn't my Spark Yarn client runs on all available worker machines?

I am running an application on Spark ...READ MORE

Feb 22, 2019 in Apache Spark by Uzair Ahmad

edited Feb 22, 2019 by Omkar 7,696 views
0 votes
1 answer

How to find max value in pair RDD?

Use Array.maxBy method: val a = Array(("a",1), ("b",2), ...READ MORE

May 26, 2018 in Apache Spark by nitinrawat895
• 11,380 points
7,663 views
0 votes
1 answer

Efficient way to read specific columns from parquet file in spark

As parquet is a column based storage ...READ MORE

Apr 20, 2018 in Apache Spark by kurt_cobain
• 9,390 points
7,319 views
0 votes
1 answer

How to work with Matrix Multiplication in Apache Spark?

Hey, You can follow this below solution for ...READ MORE

Jul 31, 2019 in Apache Spark by Gitika
• 65,910 points
7,298 views
0 votes
1 answer

Spark to check if a particular string exists in a file

You can use this: lines = sc.textFile(“hdfs://path/to/file/filename.txt”); def isFound(line): if ...READ MORE

Mar 15, 2019 in Apache Spark by Raj
7,056 views
0 votes
1 answer

Spark error: Caused by: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.

Give  read-write permissions to  C:\tmp\hive folder Cd to winutils bin folder ...READ MORE

Jul 11, 2019 in Apache Spark by Rajiv
7,016 views
0 votes
2 answers

java.lang.StringIndexOutOfBoundsException: String index out of range: 1

When using the Java substring() method, a ...READ MORE

Mar 13, 2020 in Apache Spark by evanbung
• 180 points
6,993 views
0 votes
1 answer

Py4JJavaError: An error occurred while calling o310.csv. : java.net.ConnectException: Call From master/192.168.56.101 to master:9000

Hi@akhtar, I think your HDFS cluster is not ...READ MORE

May 7, 2020 in Apache Spark by MD
• 95,440 points
6,962 views
0 votes
1 answer

When not to use foreachPartition and mapPartition?

With mapPartion() or foreachPartition(), you can only ...READ MORE

Apr 30, 2018 in Apache Spark by Data_Nerd
• 2,390 points
6,683 views
0 votes
1 answer

Spark comparing two big data files using scala

Try this and see if this does ...READ MORE

Apr 2, 2019 in Apache Spark by Omkar
• 69,210 points
6,645 views
0 votes
1 answer

How to increase Spark listener bus event queue capacity?

The default capacity of listener bus is ...READ MORE

Mar 12, 2019 in Apache Spark by Raj
6,516 views
0 votes
1 answer

Languages supported by Apache Spark?

Apache Spark supports the following four languages:  Scala, ...READ MORE

Sep 3, 2018 in Apache Spark by nitinrawat895
• 11,380 points
6,507 views
0 votes
1 answer

What is Executor Memory in a Spark application?

Every spark application has same fixed heap ...READ MORE

Jan 5, 2019 in Apache Spark by Frankie
• 9,830 points
6,197 views
0 votes
1 answer

Removing the header of a text file in SparkRDD

1) First we loaded the data to ...READ MORE

Jul 31, 2019 in Apache Spark by Namitha
6,182 views
0 votes
1 answer

Difference between map() and mapPartitions() function in Spark.

Hi@ akhtar, Both map() and mapPartitions() are the ...READ MORE

Jan 29, 2020 in Apache Spark by MD
• 95,440 points
6,094 views
0 votes
1 answer

How Foreach Operation works in Apache Spark?

Hi, foreach() operation is an action. It does not ...READ MORE

Aug 2, 2019 in Apache Spark by Gitika
• 65,910 points
6,012 views
0 votes
1 answer

env: ‘python’: No such file or directory in pyspark.

Hi@akhtar, This error occurs because your python version ...READ MORE

Apr 7, 2020 in Apache Spark by MD
• 95,440 points
5,927 views
0 votes
1 answer

"main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream

1. We will check whether master and ...READ MORE

Jul 29, 2019 in Apache Spark by Yogi
5,866 views
0 votes
1 answer

How SortBykey() operation works in Spark?

Hey, sortByKey() is a transformation. It returns an RDD sorted ...READ MORE

Aug 2, 2019 in Apache Spark by Gitika
• 65,910 points
5,812 views
0 votes
1 answer

Which File System is supported by Apache Spark?

Hi, Apache Spark is an advanced data processing ...READ MORE

Jul 5, 2019 in Apache Spark by Gitika
• 65,910 points
5,758 views
0 votes
2 answers

In a Spark DataFrame how can I flatten the struct?

// Collect data from input avro file ...READ MORE

Jul 4, 2019 in Apache Spark by Dhara dhruve
5,721 views
0 votes
1 answer

Date formats : how to cast string to date?

Try this, it should work: > from pyspark.sql.functions ...READ MORE

Jul 29, 2019 in Apache Spark by Niall
5,713 views
0 votes
1 answer

Spark Null Pointer Exception.

I used Spark 1.5.2 with Hadoop 2.6 ...READ MORE

Jul 19, 2019 in Apache Spark by ravikiran
• 4,620 points
5,713 views
0 votes
1 answer

When running Spark on Yarn, do I need to install Spark on all nodes of Yarn Cluster?

No, it is not necessary to install ...READ MORE

Jun 14, 2018 in Apache Spark by nitinrawat895
• 11,380 points
5,713 views
+1 vote
1 answer

Primary keys in Apache Spark

import sqlContext.implicits._ import org.apache.spark.sql.Row import org.apache.spark.sql.types.{StructType, StructField, LongType} val df ...READ MORE

Aug 9, 2019 in Apache Spark by ravikiran
• 4,620 points
5,682 views
+1 vote
1 answer

How do I turn off INFO Logging in Spark?

Hi, You need to edit one property in ...READ MORE

Jul 12, 2019 in Apache Spark by ravikiran
• 4,620 points

edited Dec 20, 2020 by MD 5,644 views
0 votes
1 answer

1)Given sfpd RDD, to create a pair RDD consisting of tuples of the form (Category. 1) in scala ,which of the following is used?

Hi, @Ritu, When creating a pair RDD from ...READ MORE

Nov 23, 2020 in Apache Spark by Gitika
• 65,910 points
5,609 views
+1 vote
1 answer

How to add package com.databricks.spark.avro in spark?

Start spark shell using below line of ...READ MORE

Jul 10, 2019 in Apache Spark by Jishnu
5,571 views
0 votes
1 answer

How to call the Debug Mode in PySpark?

As far as I understand your intentions ...READ MORE

Jul 26, 2019 in Apache Spark by ravikiran
• 4,620 points
5,553 views
0 votes
1 answer

error: identified expected but integer literal found.

Hi, You can resolve this error with a ...READ MORE

Jul 4, 2019 in Apache Spark by Gitika
• 65,910 points
5,549 views
0 votes
1 answer

Persistence Levels in Spark

Spark has various persistence levels to store ...READ MORE

Jun 8, 2018 in Apache Spark by kurt_cobain
• 9,390 points
5,517 views
0 votes
1 answer

How to compute the square root of sum of squares of numbers?

Hey, You need to follow some steps to complete ...READ MORE

Jul 23, 2019 in Apache Spark by Gitika
• 65,910 points
5,473 views
+1 vote
2 answers

Hadoop 3 compatibility with older versions of Hive, Pig, Sqoop and Spark

Hadoop 3 is not widely used in ...READ MORE

Apr 20, 2018 in Apache Spark by kurt_cobain
• 9,390 points
5,470 views
0 votes
1 answer

How to append a list in Scala?

Hey, For this purpose, we use the single ...READ MORE

Jul 26, 2019 in Apache Spark by Gitika
• 65,910 points
5,432 views
0 votes
1 answer

Read multiple xml files in Spark

You can do this using globbing. See ...READ MORE

Jul 25, 2019 in Apache Spark by Jack
5,302 views
0 votes
1 answer

Can anyone explain the sparse vector in Spark?

Hey, A sparse vector is used for storing ...READ MORE

Aug 2, 2019 in Apache Spark by Gitika
• 65,910 points
5,272 views
0 votes
0 answers

WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable [closed]

Hi All I am running Scala program on ...READ MORE

May 5, 2019 in Apache Spark by Vishal

closed May 6, 2019 by Omkar 5,268 views