Which cluster type should I choose for Spark

Question

I am new to Apache Spark, and I just came to know that Spark supports three types of cluster:

Standalone
YARN
Mesos

Since I am new to Spark, I think I should try Standalone first. But I wonder which one is the recommended. Say, in the future, I need to build a large cluster (hundreds of instances), which cluster type should I go to?

nitinrawat895 · Answer 1 · Jun 27, 2018

According to me, start with a standalone cluster if this is a new deployment. Standalone mode is the easiest to set up and will provide almost all the same features as the other cluster managers if you are only running Spark.

If you would like to run Spark alongside other applications or to use richer resource scheduling capabilities (e.g. queues), both YARN and Mesos provide these features. Of these, YARN will likely be preinstalled in many Hadoop distributions.

One advantage of Mesos over both YARN and standalone mode is its fine-grained sharing option, which lets interactive applications such as the Spark shell scale down their CPU allocation between commands. This makes it attractive in environments where multiple users are running interactive shells.

In all cases, it is best to run Spark on the same nodes as HDFS for fast access to storage. You can install Mesos or the standalone cluster manager on the same nodes manually, or most Hadoop distributions already install YARN and HDFS together.

Hope it will answer your query to some extent.

answered Jun 27, 2018 by nitinrawat895
• 11,380 points

zombie · Answer 2 · Jul 12, 2018

I think the best to answer this is:

Start with a standalone cluster if this is a new deployment. Standalone mode is the easiest to set up and will provide almost all the same features as the other cluster managers if you are only running Spark.
If you would like to run Spark alongside other applications or to use richer resource scheduling capabilities (e.g. queues), both YARN and Mesos provide these features. Of these, YARN will likely be preinstalled in many Hadoop distributions.
One advantage of Mesos over both YARN and the standalone mode is its fine-grained sharing option, which lets interactive applications such as the Spark shell scale down their CPU allocation between commands. This makes it attractive in environments where multiple users are running interactive shells.
In all cases, it is best to run Spark on the same nodes as HDFS for fast access to storage. You can install Mesos or the standalone cluster manager on the same nodes manually, or most Hadoop distributions already install YARN and HDFS together.

answered Jul 12, 2018 by zombie
• 3,790 points

shams · Answer 3 · Aug 21, 2018

Spark is agnostic to the underlying cluster manager, all of the supported cluster managers can be launched on-site or in the cloud. All have options for controlling the deployment’s resource usage and other capabilities, and all come with monitoring tools.

So how do you decide which is the best cluster manager for your use case? Let me give you an overview of the same:

Spark Standalone
The Spark Standalone cluster manager is a simple cluster manager available as part of the Spark distribution. It has HA for the master, is resilient to worker failures, has capabilities for managing resources per application, and can run alongside of an existing Hadoop deployment and access HDFS (Hadoop Distributed File System) data. The distribution includes scripts to make it easy to deploy either locally or in the cloud on Amazon EC2. It can run on Linux, Windows, or Mac OSX.

Apache Mesos
Apache Mesos, a distributed systems kernel, has HA for masters and slaves, can manage resources per application, and has support for Docker containers. It can run Spark jobs, Hadoop MapReduce, or any other service application. It has API’s for Java, Python, and C++. It can run on Linux or Mac OSX.

Hadoop YARN
Hadoop YARN, a distributed computing framework for job scheduling and cluster resource management, has HA for masters and slaves, support for Docker containers in non-secure mode, Linux and Windows container executors in secure mode, and a pluggable scheduler. It can run on Linux and Windows.

Cluster Management Scheduling Capabilities:

On all cluster managers, jobs or actions within a Spark application are scheduled by the Spark scheduler in a FIFO fashion. Alternatively, the scheduling can be set to a fair scheduling policy where Spark assigns resources to jobs in a round-robin fashion. In addition, the memory used by an application can be controlled with settings in the SparkContext. The resources used by a Spark application can be dynamically adjusted based on the workload. Thus, the application can free unused resources and request them again when there is a demand. This is available on all coarse-grained cluster managers, i.e. standalone mode, YARN mode, and Mesos coarse-grained mode.

Spark standalone uses a simple FIFO scheduler for applications. By default, each application uses all the available nodes in the cluster. The number of nodes can be limited per application, per user, or globally. Other resources, such as memory, cpus, etc. can be controlled via the application’s SparkConf object.

Apache Mesos has a master and slave processes. The master makes offers of resources to the application (called a framework in Apache Mesos) which either accepts the offer or not. Thus, claiming available resources and running jobs is determined by the application itself. Apache Mesos allows fine-grained control of the resources in a system such as cpus, memory, disks, and ports. Apache Mesos also offers course-grained control control of resources where Spark allocates a fixed number of CPUs to each executor in advance which are not released until the application exits. Note that in the same cluster, some applications can be set to use fine-grained control while others are set to use course-grained control.

Apache Hadoop YARN has a ResourceManager with two parts, a Scheduler, and an ApplicationsManager. The Scheduler is a pluggable component. Two implementations are provided, a CapacityScheduler, useful in a cluster shared by more than one organization, and the FairScheduler, which ensures all applications, on average, get an equal number of resources. Both schedulers assign applications to a queues and each queue gets resources that are shared equally between them. Within a queue, resources are shared between the applications. The ApplicationsManager is responsible for accepting job submissions and starting the application specific ApplicationsMaster. In this case, the ApplicationsMaster is the Spark application. In the Spark application, resources are specified in the application’s SparkConf object.

High Availability (HA)

The Spark standalone cluster manager supports automatic recovery of the master by using standby masters in a ZooKeeper quorum. It also supports manual recovery using the file system. The cluster is resilient to Worker failures regardless of whether recovery of the Master is enabled.

The Apache Mesos cluster manager also supports automatic recovery of the master using Apache ZooKeeper. to enable recovery of the Master. Tasks which are currently executing continue to do so in the case of failover.

Apache Hadoop YARN supports manual recovery using a command line utility and supports automatic recovery via a Zookeeper-based ActiveStandbyElector embedded in the ResourceManager. Therefore, unlike Mesos and the Standalone managers, there is no need to run a separate ZooKeeper Failover Controller. ZooKeeper is only used to record the state of the ResourceManagers.

Security

Spark supports authentication via a shared secret with all the cluster managers. The standalone manager requires the user configure each of the nodes with the shared secret. Data can be encrypted using SSL for the communication protocols. SASL encryption is supported for block transfers of data. Other options are also available for encrypting data. Access to Spark applications in the Web UI can be controlled via access control lists.

Mesos provides authentication for any entity interacting with the cluster. This includes the slaves registering with the master, frameworks (that is, applications) submitted to the cluster, and operators using endpoints such as HTTP endpoints. Each of these entities can be enabled to use authentication or not. Mesos’ default authentication module, Cyrus SASL, can be replaced with a custom module. Access control lists are used to authorize access to services in Mesos. By default, communication between the modules in Mesos is unencrypted. SSL/TLS can be enabled to encrypt this communication. HTTPS is supported for the Mesos WebUI.

Hadoop YARN has security for authentication, service level authorization, authentication for Web consoles and data confidentiality. Hadoop authentication uses Kerberos to verify that each user and service is authenticated by Kerberos. Service level authorization ensures that clients using Hadoop services are authorized to use them. Access to the Hadoop services can be finely controlled via access control lists. Additionally, data and communication between clients and services can be encrypted using SSL and data transferred between the Web console and clients with HTTPS.

Monitoring

Each Apache Spark application has a Web UI to monitor the application. The Web UI shows information about tasks running in the application, executors, and storage usage. Additionally, Spark’s standalone cluster manager has a Web UI to view cluster and job statistics as well as detailed log output for each job. If an application has logged events for its lifetime the Spark Web UI will automatically reconstruct the application’s UI after the application exists. If Spark is running on Mesos or YARN then a UI can be reconstructed after an application exits through Spark’s history server.

Apache Mesos provides numerous metrics for the master and slave nodes accessible via a URL. These metrics include, for example, percentage and number of allocated cpu’s, total memory used, percentage of available memory used, total disk space, allocated disk space, elected master, uptime of a master, slave registrations, connected slaves, etc. Also, per container network monitoring and isolation is supported.

Hadoop YARN has a Web UI for the ResourceManager and the NodeManager. The ResourceManager UI provides metrics for the cluster while the NodeManager provides information for each node and the applications and containers running on the node.

So, which cluster manager is best for your project?
Apache Spark is agnostic to the underlying cluster manager so choosing which manager to use depends on your goals. In the sections above we discussed several aspects of Spark’s Standalone cluster manager, Apache Mesos, and Hadoop YARN including:

Scheduling
High Availability
Security
Monitoring

All three cluster managers provide various scheduling capabilities but Apache Mesos provides the finest grained sharing options.

High availability is offered by all three cluster managers but Hadoop YARN doesn’t need to run a separate ZooKeeper Failover Controller.

Security is provided on all of the managers. Apache Mesos uses a pluggable architecture for its security module with the default module using Cyrus SASL. The Standalone cluster manager uses a shared secret and Hadoop YARN uses Kerberos. All three use SSL for data encryption.

Finally, the Apache Standalone Cluster Manager is the easiest to get started with and provides a fairly complete set of capabilities. The scripts are simple and straightforward to use. So, if developing a new application this is the quickest way to get started.