Amazon's Arm-based Graviton2 Against AMD and Intel: Comparing Cloud Compute

Name: Amazon's Arm-based Graviton2 Against AMD and Intel: Comparing Cloud Compute
Item: Amazon's Arm-based Graviton2 Against AMD and Intel: Comparing Cloud Compute
Author: Andrei Frumusanu

by Andrei Frumusanu on March 10, 2020 8:30 AM EST

96 Comments | Add A Comment

96 Comments

It’s been a year and a half since Amazon released their first-generation Graviton Arm-based processor core, publicly available in AWS EC2 as the so-called 'A1' instances. While the processor didn’t impress all too much in terms of its performance, it was a signal and first step of what’s to come over the next few years.

This year, Amazon is doubling down on its silicon efforts, having announced the new Graviton2 processor last December, and planning public availability on EC2 in the next few months. The latest generation implements Arm’s new Neoverse N1 CPU microarchitecture and mesh interconnect, a combined infrastructure oriented platform that we had detailed a little over a year ago. The platform is a massive jump over previous Arm-based server attempts, and Amazon is aiming for nothing less than a leading competitive position.

Amazon’s endeavours in designing a custom SoC for its cloud services started back in 2015, when the company acquired Isarel-based Annapurna Labs. Annapurna had previously worked on networking-focused Arm SoCs, mostly used in products such as NAS devices. Under Amazon, the team had been tasked with creating a custom Arm server-grade chip, and the new Graviton2 is the first serious attempt at disrupting the space.

So, what is the Graviton2? It’s a 64-core monolithic server chip design, using Arm’s new Neoverse N1 cores (Microarchitectural derivatives of the mobile Cortex-A76 cores) as well as Arm’s CMN-600 mesh interconnect. It’s a pretty straightforward design that is essentially almost identical to Arm’s 64-core reference N1 platform that the company had presented back a year ago. Amazon did diverge a little bit, for example the Graviton2’s CPU cores are clocked in at a bit lower 2.5GHz as well as including only 32MB instead of 64MB of L3 cache into the mesh interconnect. The system is backed by 8-channel DDR-3200 memory controllers, and the SoC supports 64 PCIe4 lanes for I/O. It’s a relatively textbook design implementation of the N1 platform, manufactured on TSMC’s 7nm process node.

The Graviton2’s potential is of course enabled by the new N1 cores. We’ve already seen the Cortex-A76 perform fantastically in last year’s mobile SoCs, and the N1 microarchitecture is expected to bring even better performance and server-grade features, all whilst retaining the power efficiency that’s made Arm so successful in the mobile space. The N1 cores remain very lean and efficient, at a projected ~1.4mm² for a 1MB L2 cache implementation such as on the Graviton2, and sporting excellent power efficiency at around ~1W per core at the 2.5GHz frequency at which Amazon’s new chip arrives at.

Total power consumption of the SoC is something that Amazon wasn’t too willing to disclose in the context of our article – the company is still holding some aspects of the design close to its chest even though we were able to test the new chipset in the cloud. Given the chip’s more conservative clock rate, Arm’s projected figure of around 105W for a 64-core 2.6GHz implementation, and Ampere’s recent disclosure of their 80-core 3GHz N1 server chip coming in at 210W, we estimate that the Graviton2 must come in around anywhere between 80W as a low estimate to around 110W for a pessimistic projection.

Testing In The Cloud With EC2

Given that Amazon’s Graviton2 is a vertically integrated product specifically designed for Amazon’s needs, it makes sense that we test the new chipset in its intended environment (Besides the fact that it’s not available in any other way!). For the last couple of weeks, we’ve had preview access for Amazon Web Services (AWS) Elastic Compute Cloud (EC2) new Graviton2 based “m6g” instances.

For readers unfamiliar with cloud computing, essentially this means we’ve been deploying virtual machines in Amazon’s datacentres, a service for which Amazon has become famous for and which now represents a major share of the company’s revenues, powering some of the biggest internet services on the market.

An important metric determining the capabilities of such instances is their type (essentially dictating what CPU architecture and microarchitecture powers the underlying hardware) and possible subtype; in Amazon’s case this refers to variations of platforms that are designed for specialised use-cases, such as having better compute capabilities or having higher memory capacity capabilities.

For today’s testing we had access to the “m6g” instances which are designed for general purpose workloads. The “6” in the nomenclature designates Amazon’s 6^th generation hardware in EC2, with the Graviton2 currently being the only platform holding this designation.

Instance Throughput Is Defined in vCPUs

Beyond the instance type, the most important other metric that defined an instance’s capabilities is its vCPU count. “Virtual CPUs” essentially means your logical CPU cores that’s available to the virtual machine. Amazon offers instances ranging from 1 vCPU to up to 128, with the most common across the most popular platforms coming in sizes of 2, 4, 8, 16, 32, 48, 64, and 96.

The Graviton2 being a single-socket 64-core platform without SMT means that the maximum available vCPU instance size is 64.

However, what this also means, is that we’re quite in a bit of an apples-and-oranges conundrum of a comparison when talking about platforms which do come with SMT. When talking about 64 vCPU instances (“16xlarge” in EC2 lingo), this means that for a Graviton2 instance we’re getting 64 physical cores, while for an AMD or Intel system, we’d be only getting 32 physical cores with SMT. I’m sure there will be readers who will be considering such a comparison “unfair”, however it’s also the positioning that Amazon is out to make in terms of delivered throughput, and most importantly, the equivalent pricing between the different instance types.

Today’s Competition

Today’s article will focus around two main competitors to the Graviton2: AMD EPYC 7571 (Zen1) powered m5a instances, and Intel Xeon Platinum 8259CL (Cascade Lake) powered m5n instances. At the moment of writing, these are the most powerful instances available from the two x86 incumbents, and should provide the most interesting comparison data.

It’s to be noted that we would have loved to be able to include AMD EPYC2 Rome based (c5a/c5ad) instances in this comparison; Amazon had announced they had been working on such deployments last November, but alas the company wasn’t willing to share with us preview access (One reason given was the Rome C-type instances weren’t a good comparison to the Graviton2’s M-type instance, although this really doesn’t make any technical sense). As these instances are getting closer to preview availability, we’ll be working on a separate article to add that important piece of the puzzle of the competitive landscape.

Tested 16xlarge EC2 Instances
	m6g	m5a	m5n
CPU Platform	Graviton2	EPYC 7571	Xeon Platinum 8259CL
vCPUs	64
Cores Per Socket	64	32	24 (16 instantiated)
SMT	-	2-way	2-way
CPU Sockets	1	1	2
Frequencies	2.5GHz	2.5-2.9GHz	2.9-3.2GHz
Architecture	Arm v8.2	x86-64 + AVX2	x86-64 + AVX512
µarchitecture	Neoverse N1	Zen	Cascade Lake
L1I Cache	64KB	64KB	32KB
L1D Cache	64KB	32KB	32KB
L2 Cache	1MB	512KB	1MB
L3 Cache	32MB shared	8MB shared per 4-core CCX	35.75MB shared per socket
Memory Channels	8x DDR4-3200	8x DDR-2666 (2x per NUMA-node)	6x DDR4-2933 per socket
NUMA Nodes	1	4	2
DRAM	256GB
TDP	Estimated 80-110W?	180W	210W per socket
Price	$2.464 / hour	$2.752 / hour	$3.808 / hour

Comparing the Graviton2 m6g instances against the AMD m5a and Intel m5n instances, we’re seeing a few differences in the hardware capabilities that power the VMs. Again, the most notorious difference is the fact that the Graviton2 comes with physical core counts matching the deployed vCPU number, whilst the competition counts SMT logical cores as vCPUs as well.

Other aspects when talking about higher-vCPU count instances is the fact that you can receive a VM that spans across several sockets. AMD’s m5a.16xlarge here is still able to deploy the VM on a single socket thanks to the EPYC 7571’s 32 cores, however Intel’s Xeon system here employs two sockets as currently there’s no deployed Intel hardware in EC2 which can offer the required vCPU count in a single socket.

Both the EPYC 7571 and the Xeon Platinum 8259CL are parts which aren’t publicly available or even listed on either company’s SKU list, so these are custom parts for the likes of Amazon for datacentre deployments.

The AMD part is a 32-core Zen1 based single-socket solution (at least for the 16xlarge instances in our testing) clocking in at 2.5 GHz all-cores to up to 2.9GHz in lightly threaded scenarios. The peculiarity of this system is that it’s somewhat limited by AMD’s quad-chip MCM system which has four NUMA nodes (one per chip and 2-channel memory controller), a characteristic that’s been eliminated in the newer EPYC2 Zen2 based systems. We don’t have concrete confirmation on the data, but we suspect this is a 180W part based on the SKU number.

Intel’s Xeon Platinum 8259CL is based on the newer Cascade Lake generation CPU cores. This particular part is also specific to Amazon, and consists of 24 enabled cores per socket. To reach the 16xlarge 64 vCPU count, EC2 provides us a dual-socket system with 16 out of the 24 cores instantiated on each socket. Again, we have no confirmation on the matter, but these parts should be rated at 210W per socket, or 420W total. We do have to remind ourselves that we’re only ever using 66% of the system’s cores in our instance, although we do have access to the full memory bandwidth and caches of the system.

The cache configuration in particular is interesting here as things differ quite a bit between platforms. The private caches of the actual CPUs themselves are relatively self-explanatory, and the Graviton2 here does provide the highest capacity of cache out of the trio, but is otherwise equal to the Xeon platform. If we were to divide the available cache on a per-thread basis, the Graviton2 leads the set at 1.5MB, ahead of the EPYC’s 1.25MB and the Xeon’s 1.05MB. The Graviton2 and Xeon systems have the distinct advantage that their last level caches are shared across the whole socket, while AMD’s L3 is shared only amongst 4-core CCX modules.

The NUMA discrepancies between the systems aren’t that important in parallel processing workloads with actual multiple processes, but it will have an impact on multi-threaded as well as single-threaded performance, and the Graviton2’s unified memory architecture will have an important advantage in a few scenarios.

Finally, there’s quite a difference in the pricing between the instances. At $2.46 per hour, the Graviton2 system edges out the AMD system in price, and is massively cheaper than the $3.80 per hour cost of the Xeon based instance. Although when talking about pricing, we do have to remember that the actual value delivered will also wildly depend on the performance and throughput of the systems, which we’ll be covering in more detail later in the article.

We thank Amazon for providing us with preview access to the m6g Graviton2 instances. Aside from giving us access, Amazon nor any other of the mentioned companies have had influence in our testing methodology, and we paid for our EC2 instance testing time ourselves.

CPU Chip Topologies

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

96 Comments

View All Comments

anonomouse - Tuesday, March 10, 2020 - link
Will there be more articles on this, covering other workloads than SPEC? You see lots of academic and industry papers talking about how real cloud/hyperscaler/server workloads have deep software stacks with large instruction-side footprints and static branch footprints, whereas SPEC is really... not that. Those workloads tend to have lower IPC on all platforms, and it would be interesting to see how Graviton2 performs on those from the instruction-supply side of things (1 core) as well as how I-side bandwidth scales horizontally with thread counts given the coherent I-Cache.
Andrei Frumusanu - Tuesday, March 10, 2020 - link
Concrete suggestions in terms of workloads too look at and can be reasonably deployed are welcome- we currently don't have a well defined test suite for such things.
FunBunny2 - Tuesday, March 10, 2020 - link
"Concrete suggestions in terms of workloads"

OLTP on RDBMS?? real one, of course, not MySql. :)
Andrei Frumusanu - Tuesday, March 10, 2020 - link
I mean an actual concrete example of such a structured benchmark, me going around doing random DB operations just opens up more criticism on why we didn't use test framework XYZ.
FunBunny2 - Tuesday, March 10, 2020 - link
here's one: https://hammerdb.com/ don't know, perhaps likely, that you can get the source and compile for any db/OS of interest. didn't say it was simple. :)
Andrei Frumusanu - Wednesday, March 11, 2020 - link
It's just I'm hearing a lot of "we want something specific" without actually specifying anything, me doing some random workload myself that isn't validated in terms of characterisation isn't in my view any better than the well understood nature of SPEC.
anonomouse - Wednesday, March 11, 2020 - link
Have you looked at the benchmarks in GCP PerfKitBenchmarker (https://github.com/GoogleCloudPlatform/PerfKitBenc... It includes benchmark versions of various popular benchmarks including variants of ycsb on different databases, oltp, cloudsuite, hadoop, and a bunch of wrapper infrastructure around running the tests on cloud providers.
anonomouse - Wednesday, March 11, 2020 - link
Okay so maybe the comment system doesn't have well with links:

https://github.com/GoogleCloudPlatform/PerfKitBenc...
http://googlecloudplatform.github.io/PerfKitBenchm...
yeeeeman - Tuesday, March 10, 2020 - link
Ok, now imagine this chip with apple custom cores. Even Zen wouldn't stand a chance.
HStewart - Tuesday, March 10, 2020 - link
You can't truly say that. Keep in mind both Apple and Amazon are aim at there own custom environments - things are like different in real world.

Amazon's Arm-based Graviton2 Against AMD and Intel: Comparing Cloud Compute

Testing In The Cloud With EC2

Instance Throughput Is Defined in vCPUs

Today’s Competition

Post Your Comment

96 Comments

View All Comments

anonomouse - Tuesday, March 10, 2020 - link

Andrei Frumusanu - Tuesday, March 10, 2020 - link

FunBunny2 - Tuesday, March 10, 2020 - link

Andrei Frumusanu - Tuesday, March 10, 2020 - link

FunBunny2 - Tuesday, March 10, 2020 - link

Andrei Frumusanu - Wednesday, March 11, 2020 - link

anonomouse - Wednesday, March 11, 2020 - link

anonomouse - Wednesday, March 11, 2020 - link

yeeeeman - Tuesday, March 10, 2020 - link

HStewart - Tuesday, March 10, 2020 - link

Log in

Don't have an account? Sign up now