Platform
- Data Resiliency Cloud
  Data Resiliency Cloud
  Enterprise Cloud Backup and data management across edge, on-premises and cloud workloads
- Data Protection
  Data Protection
  Modernize data protection to reduce costs and complexity
- Cyber Resiliency
  Cyber Resiliency
  Be ready for cyber attacks with data that is always safe, always ready
  - Accelerated Ransomware Recovery
  - Security Posture & Observability
- Governance & Compliance
  Governance & Compliance
  Secure, protect, and streamline data governance for all your critical data, wherever it lives
  - eDiscovery and Legal Hold
  - Sensitive Data Management
- Take a Tour
Solutions
- Business Drivers
  Business Drivers
  Learn how Druva helps you accelerate key business initiatives
- SaaS Applications
  SaaS Applications
  Druva provides comprehensive data protection that supports multiple SaaS applications from a single platform. Discover the Druva difference today.
- Enterprise Workloads
  - Virtualization
    Virtualization
    Transform data center backup and disaster recovery for virtual environments
    
    VMware
    
    Nutanix
  - Databases
    Databases
    Reduce the cost and complexity of data protection for enterprise databases
    
    Oracle
    
    MS SQL
    
    SAP HANA
  - Files
    Files
    Discover a more cost-efficient way to protect on-premises and cloud NAS
    
    NAS/files
  - Public Cloud
    Public Cloud
    Protect native AWS and Azure deployments with secure backups without the cost and complexity
    
    AWS
    
    Microsoft Azure
- Enterprise Endpoints
  Enterprise Endpoints
  Unify SaaS apps and end-user device protection to reduce data risks. Improve cyber resilience and compliance by protecting enterprise workloads and assets.
- Free Trial
Customers
- Explore All Customer Stories
  We are trusted by the world's leading organizations to protect their data. Explore customer success stories to see how your peers are using Druva.
- Ransomware recovery ready
  Learn why Medallia chose Druva
  
  SaaS data protection across the enterprise
  See why Regeneron partnered with Druva
Resources
- 2023 Gartner® Magic Quadrant™
  See why Druva is recognized as a Visionary
  
  Data Resiliency for Dummies
  Get your guide to data resiliency
Partners
- Strategic Partners
  Strategic Partners
  Learn about Druva's strategic capabilities across platform, OEM, and other partnerships. Find out how Druva accelerates and protects customers' cloud journeys.
  - Dell Technologies
  - AWS
  - VMware
  - Nutanix
- Programs
  Programs
  Learn how you can profit with Druva and a cloud-first SaaS selling motion. Explore partner programs, access resources, and discover the benefits of partnering with Druva.
- Become a Partner
Company
- - Company
  - Leadership
  - Investors
  - Careers
  - Contact Us
  - Newsroom
  - Awards
  - Events
  - Blog
  - Diversity, Equity & Inclusion
- Get in touch with us
  Contact Us
  
  News, product innovations, and more
  Blog
Get Started
Support
Login
Language

Tech/Engineering, Innovation Series

Exploring ARM and heterogeneous compute architecture

September 01, 2021 Srikiran Gottipati, Senior Technical Director - Engineering

In June 2020, Apple announced its plans to transition the Mac to what they called “a world-class custom silicon to deliver industry-leading performance and powerful new technologies.” Their custom silicon is based on ARM SoC (System-on-Chip) architecture, and is an evolution from the chips that powered Apple’s iPhones and iPads for more than half a decade. The Apple announcement also claims the new family of SoCs, custom built for the Mac, will lead to higher performance per watt, better performing GPUs, and that access to a neural processing engine will make the Mac an amazing platform for developers to use machine learning.

In November 2020, Apple announced M1 as the most powerful chip it had ever created, and the first chip designed specifically for the Mac. It further claimed the M1 was the world’s fastest CPU core in low-power silicon, best CPU performance per watt, fastest integrated graphics in a personal computer, and breakthrough machine learning performance with the Apple neural engine. The power consumption and thermal output reports of the Mac mini show a significant reduction on both fronts when compared to its 2018 counterpart with Intel processors.

In this article, we’ll look into two ARM hardware architectural features that have powered the advances of ARM-based mobile processors, including the Apple M1, on both overall performance and performance per watt. Here, performance per watt refers to the ratio of peak CPU performance to average power consumed.

System-on-Chip

System-on-Chip (SoC) is an integrated circuit that combines many components of a computer onto a single substrate. The primary advantage of SoC architecture over CPU-based PC architecture is its size. Along with microprocessors, SoCs could come integrated with one or more memory components, a graphics processing unit (GPU), digital signal processors (DSP), neural processing units (NPU), I/O controllers, custom application-specific integrated circuits (ASIC), and more. The integrated design of these components also means that they can be developed with a unified approach for performance and energy efficiency, and deliver more performance per watt compared to their PC equivalents. The energy efficiency combined with the small form factor of the SoC-based chips make them ideal for the mobile, consumer, wearable, and edge computing markets.

Devices of the future like AR smart glasses are getting lighter and smaller in form, but more demanding on performance to run complex and advanced multi-compute workloads. So, future SoC-based designs will strive to achieve higher performance with even smaller form factors, and power envelope with performance delivered per watt being the new paradigm.

ARM big.LITTLE

ARM big.LITTLE technology is a heterogeneous processing architecture which uses two different types of processors arranged as two clusters. Each cluster contains the same type of processor. The ”LITTLE” processors are designed for maximum power efficiency, while the ”big” processors are designed to provide maximum compute performance. Both types of processors are coherent and share the same instruction set architecture (ISA). The Apple M1 is an example of an SoC chip built with ARM big.LITTLE technology, and has four ‘big’ high-performance cores called “Firestorm,” and four “LITTLE” energy-efficient cores called “Icestorm.” Each task can be dynamically allocated to a big or LITTLE core depending on the instantaneous performance requirement of that task. With the combination of processors, the system can deliver peak performance on demand with maximum energy efficiency while staying within the thermal bounds of the system.

ARM big.LITTLE technology has been designed to address two main requirements:

At the high performance end — High compute capability within the system’s thermal bounds
At the low performance end — Very low power consumption

The big.LITTLE system has two major software execution models:

CPU migration: In this model, each big core is paired with a LITTLE core. Only one core in each pair is active at any one time, with the inactive core being powered down. The active core in the pair is chosen according to current load conditions. On a system identical to the Apple M1, the operating system sees four logical processors. Each logical processor can physically be a big or LITTLE processor, and this choice is driven by dynamic voltage and frequency scaling (DVFS). This model requires the same number of processors in both clusters.
Global task scheduling: In this model, the scheduler is aware of the differences in the performance and energy characteristics of the big and LITTLE cores. The scheduler tracks the performance requirement for each individual thread, and uses that information to decide which type of processor to use. Unused processors can be powered off, and if this includes all processors, the cluster itself can be powered off. This model can work on a big.LITTLE system with any number of processors in any cluster.

Big.LITTLE architecture can have at most two clusters with a maximum of four processors of the same type in each cluster. This architecture combines two types of processors with different microarchitectures, but the same architecture to create an energy-efficient compute solution.

Shared virtual memory

On traditional PCs, a discrete GPU was built separate from the main CPU and had its own separate memory. Later systems came with integrated GPUs built into the processor with the ability to access and reserve memory for their own use. But discrete or integrated applications using the GPU for graphics processing, or more general compute purposes, were still required to move data back and forth between the main CPU memory and GPU memory, and incur latency and power penalties.

With SoC-based architectures, systems can have a unified memory architecture for all built-in compute units, including the CPU, GPU, NPU, and others, as they are all collocated and can access the main memory. However, though the SoC design makes it physically possible to have a unified view of the memory, it would still require support from the hardware memory management units, the operating system software, and the application software APIs for multiple compute workloads to fully realize the performance benefits. The Apple M1 unified memory architecture is an example of shared virtual memory (SVM) architecture.

SVM allows different processors to see the same view of available memory. So, within the virtual address space of a single application, all compute units like CPU and GPU using the same virtual address actually refer to the same physical memory location. With this architecture, modern complex workloads that require machine learning, image processing, graphics rendering, and more, can seamlessly leverage the available heterogeneous compute resources by passing pointers to data between them rather than moving data around.

Though different processors have the same view of the memory, each processor has its own private memory cache which poses the problem of maintaining cache coherency between the different processors. ARM introduced coherency extensions (ACE) to their bus architecture (AMBA) protocol that allows for hardware coherence between processor clusters. For example, in a system with two processor clusters, any shared access to memory can ‘snoop’ into the other cluster’s caches to see if the data is already on chip; if not, it is fetched from external memory. The AMBA 4 ACE bus interface extends the hardware cache coherency outside of the processor cluster and into the system.

With hardware cache coherency extending into the system, the GPU can now read any shared data directly from the CPU caches, and writes to shared memory will automatically invalidate the corresponding lines in the CPU caches. Hardware coherency also reduces the cost of sharing data between CPU and GPU, and allows tighter coupling.

Looking ahead

ARM has followed up their two (max 4-core) homogeneous cluster big.LITTLE architecture with a new single cluster DynamIQ architecture with up to eight heterogeneous CPUs. The ability to build heterogeneous compute units onto a single SoC chip, and allow them to share a single view of the memory provides flexibility in designing solutions to fit the diverse complex and multi-compute needs of modern consumer, mobile, wearable, automotive, and edge devices.

Looking to learn more about the technical innovations and best practices powering cloud backup and data management? Visit the Innovation Series section of Druva’s blog archive.

Exploring ARM and heterogeneous compute architecture

System-on-Chip

ARM big.LITTLE

Shared virtual memory

Looking ahead

Blog

Druva Data Resiliency Cloud

Cloud Backup & Recovery

Data Protection

Governance & Compliance

Cyber Resilience

Business drivers

Workloads

Partners

Customers

Resources

Company