This article is about choosing between the various compute options of master, core and task nodes on EMR clusters based on the workload profile
Instance Groups are limited to one instance type and one purchase type (reserved, demand, spot, etc) per group
Instance Fleets support multiple instances and multiple payments types and hence offer more flexibility and possible efficiency in matching the workload demands to compute resources available.
There are several ways to add EC2 instances to a cluster, which depend on whether you use the instance groups configuration or the instance fleets configuration for the cluster.
The master node type, which assigns tasks, doesn't require an EC2 instance with much processing power. Consider using an m4.xlarge, or m5.xlarge.
EC2 instances for the core node type, which process tasks and store data in HDFS, need both processing power and storage capacity. m5.xlarge is typical
EC2 instances for the task node type, which don't store data, need only processing power. m5.xlarge is typical
Core nodes process data and store information using HDFS. Terminating a core instance risks data loss. For this reason, you should only run core nodes on Spot Instances when partial HDFS data loss is tolerable.
When you launch the core instance group as Spot Instances, Amazon EMR waits until it can provision all of the requested core instances before launching the instance group. In other words, if you request six Amazon EC2 instances, and only five are available at or below your maximum Spot price, the instance group won't launch
The task nodes process data but do not hold persistent data in HDFS. If they terminate because the Spot price has risen above your maximum Spot price, no data is lost and the effect on your cluster is minimal.
When you launch one or more task instance groups as Spot Instances, Amazon EMR provisions as many task nodes as it can, using your maximum Spot price. This means that if you request a task instance group with six nodes, and only five Spot Instances are available at or below your maximum Spot price, Amazon EMR launches the instance group with five nodes, adding the sixth later if possible.
The following table is a quick reference to node type purchasing options and configurations that are usually appropriate for various application scenarios.
|Application Scenario||Master Node||Core Nodes||Task Nodes|
and Data Warehouses
|On-Demand||On-Demand or instance-fleet mix||Spot or instance-fleet mix|
|Data-Critical Workloads||On-Demand||On-Demand||Spot or instance-fleet mix|