Spark Java Tutorial : Your One Stop Solution to Spark in Java

Python Spark Certification Training usin... (20 Blogs) Become a Certified Professional

Java is an effective programming language in Software development and Scala is the dominant programming used in big-data development. The collaboration of both can yield a powerful combination. In this Spark Java tutorial, we shall work with Spark programs in Java environment. I have lined up the docket for our topic as below.

What is Spark-Java?

In simple terms, Spark-Java is a combined programming approach to Big-data problems. Spark is written in Java and Scala uses JVM to compile codes written in Scala. Spark supports many programming languages like Pig, Hive, Scala and many more. Scala is one of the most prominent programming languages ever built for Spark applications.

The Need for Spark-Java

Majority of the software developers feel comfortable working with Java at an enterprise level where they hardly prefer Scala or any such other type of languages. Spark-Java is one such approach where the software developers can run all the Scala programs and applications in the Java environment with ease.

Now we have a brief understanding of Spark Java, Let us now move on to our next stage where we shall learn about setting up the environment for Spark Java. I have lined up the procedure in the form of steps.

Setting up Spark-Java environment

Step 1:

Install the latest versions of the JDK and JRE.

Step 2:

Install the latest version of WinUtils.exe

Step 3:

Install the latest version of Apache Spark.

Step 4:

Install the latest version of Apache Maven.

Step 5:

Install the latest version of Eclipse Installer.

Step 6:

Install the latest version of Scala IDE.

Step 7:

Set home and path for the following:
- Java
- Set a new Java_Home as shown below.

- Similarly, Set Path for Java Home by editing Path variables

- Hadoop
  - Set a new Hadoop_Home as shown below.

- - Similarly, Set Path for Hadoop Home by editing Path variables

- Spark
  - Set a new Spark_Home as shown below.

- - Similarly, Set Path for Spark Home by editing Path variables

- Maven
  - Set a new Maven_Home as shown below.

- - Similarly, Set Path for Maven Home by editing Path variables

- Scala
  - Set a new Scala_Home as shown below.

- - Similarly, Set Path for Scala Home by editing Path variables

Redefine your data analytics workflow and unleash the true potential of big data with Pyspark Course.

Now you are set with all the requirements to run Apache Spark on Java. Let us try an example of a Spark program in Java.

Examples in Spark-Java

Before we get started with actually executing a Spark example program in a Java environment, we need to achieve some prerequisites which I’ll mention below as steps for better understanding of the procedure.

Step 1:

Open the command prompt and start Spark in command prompt as a master.

Step 2:

Open a new command prompt and start Spark again in the command prompt and this time as a Worker along with the master’s IP Address.

The IP Address is available at Localhost:8080.

Step 3:

Open a new command prompt and now you can start up the Spark shell along with the master’s IP Address.

Step 4:

Now you can open up the Eclipse Enterprise IDE and set up your workplace and start with your project.

Step 5:

Set Scala nature on your Eclipse IDE and create a new maven project.
First, we shall begin with POM.XML
The following code is the pom.xml file

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>Edureka</groupId>
<artifactId>ScalaExample</artifactId>
<version>0.0.1-SNAPSHOT</version>
    <dependencies>
              <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
              <dependency>
              <groupId>org.apache.spark</groupId>
              <artifactId>spark-core_2.12</artifactId>
              <version>2.4.2</version>
              </dependency>
    </dependencies>
</project>

Step 6:

Begin with your Scala application.
The following code is for the Scala application file.


package ScalaExample

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql._
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};

object EdurekaApp {
    def main(args: Array[String]) {
         val logFile = "C:/spark/README.md" // Should be some file on your system
         val conf = new SparkConf().setAppName("EdurekaApp").setMaster("local[*]")
         val sc = new SparkContext(conf)
         val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
         val logData = spark.read.textFile(logFile).cache()
         val numAs = logData.filter(line => line.contains("a")).count()
         val numBs = logData.filter(line => line.contains("b")).count()
         println(s"Lines with a: $numAs, Lines with b: $numBs")
    spark.stop()
   }
}

Output:

Lines with a: 62, Lines with b: 31

Now that we have a brief understanding of Spark Java, Let us move into our use case on Students academic performance so as to learn Spark Java in a much better way.

Students Performance in the Examination: Use Case

Similar to our previous example Let us set up our prerequisites and then, we shall begin with our Use Case. Our use case will about Students performance in the examinations conducted on a few important subjects.

This is how our code looks like, now let us perform one by one operation upon our use case.

The following code is the pom.xml file

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>ScalaExample3</groupId>
<artifactId>Edureka3</artifactId>
<version>0.0.1-SNAPSHOT</version>
    <dependencies>
                 <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
                 <dependency>
                          <groupId>org.apache.spark</groupId>
                          <artifactId>spark-core_2.12</artifactId>
                          <version>2.4.3</version>
                 </dependency>
                 <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
                 <dependency>
                         <groupId>org.apache.spark</groupId>
                         <artifactId>spark-sql_2.12</artifactId>
                         <version>2.4.3</version>
                 </dependency>
                 <!-- https://mvnrepository.com/artifact/com.databricks/spark-csv -->
                 <dependency>
                         <groupId>com.databricks</groupId>
                         <artifactId>spark-csv_2.11</artifactId>
                         <version>1.5.0</version>
                 </dependency>
      </dependencies>
</project>

The following code is for the Scala application file.


package ScalaExample

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql._
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};

object EdurekaApp {
         def main(args: Array[String]) {
              val conf = new SparkConf().setAppName("EdurekaApp3").setMaster("local[*]")
              val sc = new SparkContext(conf)</pre>

              val sqlContext = new SQLContext(sc)
              val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
              val customizedSchema = StructType(Array(StructField("gender", StringType, true),StructField("race", StringType, true),StructField("parentalLevelOfEducation", StringType, true),StructField("lunch", StringType, true),StructField("testPreparationCourse", StringType, true),StructField("mathScore", IntegerType, true),StructField("readingScore", IntegerType, true),StructField("writingScore", IntegerType, true)))
              val pathToFile = "C:/Users/Ravikiran/Downloads/students-performance-in-exams/StudentsPerformance.csv"
              val DF = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").schema(customizedSchema).load(pathToFile)
              print("We are starting from here...!")
              DF.rdd.cache()
              DF.rdd.foreach(println)
              println(DF.printSchema)
              DF.registerTempTable("Student")
              sqlContext.sql("SELECT * FROM Student").show()
              sqlContext.sql("SELECT gender, race, parentalLevelOfEducation, mathScore FROM Student WHERE mathScore > 75").show()
              sqlContext.sql("SELECT race, count(race) FROM Student GROUP BY race").show()
              sqlContext.sql("SELECT gender, race, parentalLevelOfEducation, mathScore, readingScore FROM Student").filter("readingScore>90").show()
              sqlContext.sql("SELECT race, parentalLevelOfEducation FROM Student").distinct.show()
              sqlContext.sql("SELECT gender, race, parentalLevelOfEducation, mathScore, readingScore FROM Student WHERE mathScore> 75 and readingScore>90").show()
              sqlContext<span>("SELECT gender, race, parentalLevelOfEducation, mathScore, readingScore").dropDuplicates().show()</span>
              println("We have finished here...!")
       spark.stop()
    }
}

The Output for the SparkSQL statements executed above are as follows:

Printing out data using println function.
```
DF.rdd.foreach(println)
```

Printing the schema that we designed for our data.
```
println(DF.printSchema)
```

Printing our Dataframe using the select command.

sqlContext.sql("SELECT * FROM Student").show()

Applying the function WHERE to print the data of the students who scored more than 75 in maths.

sqlContext.sql("SELECT gender, race, parentalLevelOfEducation, mathScore FROM Student WHERE mathScore > 75").show()

Using Group By and Count operation to find out the number of students in each group.

sqlContext.sql("SELECT race, count(race) FROM Student GROUP BY race").show()

Using filter operation to find out the students who are proven to be the best in reading.

sqlContext.sql("SELECT gender, race, parentalLevelOfEducation, mathScore, readingScore FROM Student").filter("readingScore>90").show()

Using Distinct function to find out the distinct values in our data.

sqlContext.sql("SELECT race, parentalLevelOfEducation FROM Student").distinct.show()

Using And function to compare multiple entities.

sqlContext.sql("SELECT gender, race, parentalLevelOfEducation, mathScore, readingScore FROM Student WHERE mathScore> 75 and readingScore>90").show()

Using DropDuplicates function to remove duplicate entries.

sqlContext("SELECT gender, race, parentalLevelOfEducation, mathScore, readingScore").dropDuplicates().show()

So, with this, we come to an end of this Spark Java Tutorial article. I hope we sparked a little light upon your knowledge about Spark, Java and Eclipse their features and the various types of operations that can be performed using them.

For details, You can even check out tools and systems used by Big Data experts and its concepts with the Masters in data engineering.

This article based on Apache Spark and Scala Certification Training is designed to prepare you for the Cloudera Hadoop and Spark Developer Certification Exam (CCA175). You will get in-depth knowledge on Apache Spark and the Spark Ecosystem, which includes Spark RDD, Spark SQL, Spark MLlib and Spark Streaming. You will get comprehensive knowledge on Scala Programming language, HDFS, Sqoop, Flume, Spark GraphX and Messaging System such as Kafka.Upskill your data engineering skills with our Microsoft fabric certification course

Introduction to Spark

Spark Components

Spark Interview Questions

Big Data

Spark Java Tutorial : Your One Stop Solution to Spark in Java

What is Spark-Java?

The Need for Spark-Java

Setting up Spark-Java environment

Students Performance in the Examination: Use Case

Recommended videos for you

Improve Customer Service With Big Data

When not to use Hadoop

Big Data Processing with Spark and Scala

Apache Spark Redefining Big Data Processing

Pig Tutorial – Know Everything About Apache Pig Script

Logistic Regression In Data Science

Big Data – XML Parsing With MapReduce

Real-Time Analytics with Apache Storm

New-Age Search through Apache Solr

HBase Tutorial – A Complete Guide On Apache HBase

Distributed Cache With MapReduce

Tailored Big Data Solutions Using MapReduce Design Patterns

5 Scenarios: When To Use & When Not to Use Hadoop

Introduction to Big Data TDD and Pig Unit

Is It The Right Time For Me To Learn Hadoop ? Find out.

Power of Python With BigData

What Is Hadoop – All You Need To Know About Hadoop

Webinar: Introduction to Big Data & Hadoop

Hadoop Architecture – Hadoop Tutorial on HDFS Architecture

Is Hadoop A Necessity For Data Science?

Recommended blogs for you

Apache Flink: The Next Gen Big Data Analytics Framework For Stream And Batch Data Processing

Apache Spark combineByKey Explained

Top Hadoop Developer Skills You Need to Master in 2025

Operators in Apache Pig: Part 2- Diagnostic Operators

Introduction to Spark with Python – PySpark for Beginners

Apache Pig UDF: Part 1 – Eval, Aggregate & Filter Functions

Spark Tutorial: Real Time Cluster Computing Framework

Explaining Kerberos

Elasticsearch Tutorial – Power Up Your Searches

Top Apache Kafka Interview Questions To Prepare In 2025

Why Should you go for Hadoop Administration Course?

DBInputFormat to Transfer Data From SQL to NoSQL Database

Hadoop Ecosystem: Hadoop Tools for Crunching Big Data

Apache Kafka: What You Need For A Career In Real-Time Analytics

Azure Synapse vs. Databricks – What Are the Differences?

Hive and Yarn Examples on Spark

Sample HBase POC

Hadoop Streaming: Writing A Hadoop MapReduce Program In Python

Top Hadoop Interview Questions To Prepare In 2025 – HDFS

How Predictive Analysis can Help you Combat Employee Attrition

Join the discussionCancel reply

Trending Courses in Big Data

Microsoft Azure Data Engineering Training Cou ...

Microsoft Fabric DP-700 Certification Trainin ...

PySpark Certification Training Course

Big Data Hadoop Certification Training Course

Applied Data Engineering on Azure Cloud Cours ...

Apache Kafka Certification Training Course

ELK Stack Training & Certification

Apache Spark and Scala Certification Training ...

Splunk Certification Training: Power User and ...

Comprehensive MapReduce Certification Trainin ...

Browse Categories

Subscribe to our Newsletter, and get personalized recommendations.

Spark Java Tutorial : Your One Stop Solution to Spark in Java