Innovation Series, Tech/Engineering

Exploring ARM and heterogeneous compute architecture

In June 2020, Apple announced its plans to transition the Mac to what they called “a world-class custom silicon to deliver industry-leading performance and powerful new technologies.” Their custom silicon is based on ARM SoC (System-on-Chip) architecture, and is an evolution from the chips that powered Apple’s iPhones and iPads for more than half a decade. The Apple announcement also claims the new family of SoCs, custom built for the Mac, will lead to higher performance per watt, better performing GPUs, and that access to a neural processing engine will make the Mac an amazing platform for developers to use machine learning.

In November 2020, Apple announced M1 as the most powerful chip it had ever created, and the first chip designed specifically for the Mac. It further claimed the M1 was the world’s fastest CPU core in low-power silicon, best CPU performance per watt, fastest integrated graphics in a personal computer, and breakthrough machine learning performance with the Apple neural engine. The power consumption and thermal output reports of the Mac mini show a significant reduction on both fronts when compared to its 2018 counterpart with Intel processors.

In this article, we’ll look into two ARM hardware architectural features that have powered the advances of ARM-based mobile processors, including the Apple M1, on both overall performance and performance per watt. Here, performance per watt refers to the ratio of peak CPU performance to average power consumed

System-on-Chip

System-on-Chip (SoC) is an integrated circuit that combines many components of a computer onto a single substrate. The primary advantage of SoC architecture over CPU-based PC architecture is its size. Along with microprocessors, SoCs could come integrated with one or more memory components, a graphics processing unit (GPU), digital signal processors (DSP), neural processing units (NPU), I/O controllers, custom application-specific integrated circuits (ASIC), and more. The integrated design of these components also means that they can be developed with a unified approach for performance and energy efficiency, and deliver more performance per watt compared to their PC equivalents. The energy efficiency combined with the small form factor of the SoC-based chips make them ideal for the mobile, consumer, wearable, and edge computing markets.

Devices of the future like AR smart glasses are getting lighter and smaller in form, but more demanding on performance to run complex and advanced multi-compute workloads. So, future SoC-based designs will strive to achieve higher performance with even smaller form factors, and power envelope with performance delivered per watt being the new paradigm.

ARM big.LITTLE

ARM big.LITTLE technology is a heterogeneous processing architecture which uses two different types of processors arranged as two clusters. Each cluster contains the same type of processor. The ”LITTLE” processors are designed for maximum power efficiency, while the ”big” processors are designed to provide maximum compute performance. Both types of processors are coherent and share the same instruction set architecture (ISA). The Apple M1 is an example of an SoC chip built with ARM big.LITTLE technology, and has four ‘big’ high-performance cores called “Firestorm,” and four “LITTLE” energy-efficient cores called “Icestorm.” Each task can be dynamically allocated to a big or LITTLE core depending on the instantaneous performance requirement of that task. With the combination of processors, the system can deliver peak performance on demand with maximum energy efficiency while staying within the thermal bounds of the system.

ARM big.LITTLE technology has been designed to address two main requirements:

  1. At the high performance end — High compute capability within the system’s thermal bounds
  2. At the low performance end — Very low power consumption

The big.LITTLE system has two major software execution models:

  1. CPU migration: In this model, each big core is paired with a LITTLE core. Only one core in each pair is active at any one time, with the inactive core being powered down. The active core in the pair is chosen according to current load conditions. On a system identical to the Apple M1, the operating system sees four logical processors. Each logical processor can physically be a big or LITTLE processor, and this choice is driven by dynamic voltage and frequency scaling (DVFS). This model requires the same number of processors in both clusters.
  2. Global task scheduling: In this model, the scheduler is aware of the differences in the performance and energy characteristics of the big and LITTLE cores. The scheduler tracks the performance requirement for each individual thread, and uses that information to decide which type of processor to use. Unused processors can be powered off, and if this includes all processors, the cluster itself can be powered off. This model can work on a big.LITTLE system with any number of processors in any cluster.

Big.LITTLE architecture can have at most two clusters with a maximum of four processors of the same type in each cluster. This architecture combines two types of processors with different microarchitectures, but the same architecture to create an energy-efficient compute solution.

Shared virtual memory

On traditional PCs, a discrete GPU was built separate from the main CPU and had its own separate memory. Later systems came with integrated GPUs built into the processor with the ability to access and reserve memory for their own use. But discrete or integrated applications using the GPU for graphics processing, or more general compute purposes, were still required to move data back and forth between the main CPU memory and GPU memory, and incur latency and power penalties.

With SoC-based architectures, systems can have a unified memory architecture for all built-in compute units, including the CPU, GPU, NPU, and others, as they are all collocated and can access the main memory. However, though the SoC design makes it physically possible to have a unified view of the memory, it would still require support from the hardware memory management units, the operating system software, and the application software APIs for multiple compute workloads to fully realize the performance benefits. The Apple M1 unified memory architecture is an example of shared virtual memory (SVM) architecture.

SVM allows different processors to see the same view of available memory. So, within the virtual address space of a single application, all compute units like CPU and GPU using the same virtual address actually refer to the same physical memory location. With this architecture, modern complex workloads that require machine learning, image processing, graphics rendering, and more, can seamlessly leverage the available heterogeneous compute resources by passing pointers to data between them rather than moving data around.

Though different processors have the same view of the memory, each processor has its own private memory cache which poses the problem of maintaining cache coherency between the different processors. ARM introduced coherency extensions (ACE) to their bus architecture (AMBA) protocol that allows for hardware coherence between processor clusters. For example, in a system with two processor clusters, any shared access to memory can ‘snoop’ into the other cluster’s caches to see if the data is already on chip; if not, it is fetched from external memory. The AMBA 4 ACE bus interface extends the hardware cache coherency outside of the processor cluster and into the system.

With hardware cache coherency extending into the system, the GPU can now read any shared data directly from the CPU caches, and writes to shared memory will automatically invalidate the corresponding lines in the CPU caches. Hardware coherency also reduces the cost of sharing data between CPU and GPU, and allows tighter coupling.

Looking ahead

ARM has followed up their two (max 4-core) homogeneous cluster big.LITTLE architecture with a new single cluster DynamIQ architecture with up to eight heterogeneous CPUs. The ability to build heterogeneous compute units onto a single SoC chip, and allow them to share a single view of the memory provides flexibility in designing solutions to fit the diverse complex and multi-compute needs of modern consumer, mobile, wearable, automotive, and edge devices.

Looking to learn more about the technical innovations and best practices powering cloud backup and data management? Visit the Innovation Series section of Druva’s blog archive.