EMR Cluster Configuration and Scaling

Last Updated : 10-Oct-2020

This article is about choosing between the various compute options of master, core and task nodes on EMR clusters based on the workload profile

Instance Fleets vs Instance Groups

Instance Groups are limited to one instance type and one purchase type (reserved, demand, spot, etc) per group

Instance Fleets support multiple instances and multiple payments types and hence offer more flexibility and possible efficiency in matching the workload demands to compute resources available.

What Instance Type Should You Use?

There are several ways to add EC2 instances to a cluster, which depend on whether you use the instance groups configuration or the instance fleets configuration for the cluster.

Instance Groups

  • AUTOMATIC: Set up automatic scaling in Amazon EMR for an instance group, adding and removing instances automatically based on the value of an Amazon CloudWatch metric that you specify. For more information, see Scaling Cluster Resources.
  • MANUAL GROUP UPDATE: Manually add instances of the same type to existing core and task instance groups.
  • MANUAL ADD GROUPS: Manually add a task instance group, which can use a different instance type.

Instance Fleets

  • Add a single task instance fleet.
  • Change the target capacity for On-Demand and Spot Instances for existing core and task instance fleets.

Node Types

The master node type, which assigns tasks, doesn’t require an EC2 instance with much processing power. Consider using an m4.xlarge, or m5.xlarge.

EC2 instances for the core node type, which process tasks and store data in HDFS, need both processing power and storage capacity.  m5.xlarge is typical

EC2 instances for the task node type, which don’t store data, need only processing power. m5.xlarge is typical

Core Nodes on Spot Instances

Core nodes process data and store information using HDFS. Terminating a core instance risks data loss. For this reason, you should only run core nodes on Spot Instances when partial HDFS data loss is tolerable.

When you launch the core instance group as Spot Instances, Amazon EMR waits until it can provision all of the requested core instances before launching the instance group. In other words, if you request six Amazon EC2 instances, and only five are available at or below your maximum Spot price, the instance group won’t launch

Task Nodes on Spot Instances

The task nodes process data but do not hold persistent data in HDFS. If they terminate because the Spot price has risen above your maximum Spot price, no data is lost and the effect on your cluster is minimal.

When you launch one or more task instance groups as Spot Instances, Amazon EMR provisions as many task nodes as it can, using your maximum Spot price. This means that if you request a task instance group with six nodes, and only five Spot Instances are available at or below your maximum Spot price, Amazon EMR launches the instance group with five nodes, adding the sixth later if possible.

Instance Configurations for Application Scenarios

The following table is a quick reference to node type purchasing options and configurations that are usually appropriate for various application scenarios.

Application Scenario Master Node Core Nodes Task Nodes
Long-Running Clusters
and Data Warehouses
On-Demand On-Demand or instance-fleet mix Spot or instance-fleet mix
Cost-Driven Workloads Spot Spot Spot
Data-Critical Workloads On-Demand On-Demand Spot or instance-fleet mix
Application Testing Spot Spot Spot
Using Template: Template Post
magnifier linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram