News/Trends, Tech/Engineering

The Rise of the Cloud OS

Online businesses must anticipate and manage hard-to-predict spikes in user activity. If a business fails to meet peak demand, there is a risk of a service crash (remember, which could result in downtime and negatively impact customer experience. To avoid nasty surprises, companies outline stringent service-level agreements (SLAs) for their IT systems.

Traditional IT systems are challenged by the combination of bursty loads and strict SLAs as these systems have limited “elasticity.” Today’s system architects must consider the influence of fluctuating loads on processing power as well as storage capacity. They must design IT systems that can address storage at scale (i.e. big data), computing power at scale (i.e. retail or gaming), and both storage and computing power (i.e. cloud information management). Today, IT systems are undergoing very fundamental changes to support bursty loads. Today’s systems are built on top of a cloud platform, be it a public cloud or private cloud. The change is not limited to just the hardware architectures. It deeply affects the way a software engineer develops an elastic system. At Druva, we experienced a similar shift as we built a service that scales to millions of users and petabytes of data. We realized that developing a cloud system is fundamentally different than developing an on-premises system. There are multiple aspects of a cloud system that require a completely new way of thinking. In this article, I’d like to focus on the differences between OS interfaces for on-premises and cloud systems.

The Operating System is an Abstraction
Typically, a software developer doesn’t directly program in machine language to the underlying hardware resources. Instead, there are multiple levels of abstraction or virtualization between the hardware and the application programming interface (API) used by the software developer that dramatically simplify this process.

The first virtualization layer above the hardware is the operating system (OS). Software does not directly access a hard drive. A file system (FAT, NTFS, EXT3, etc.) virtualizes hard drives and presents the storage as files and folders. A database adds a further level of virtualization on top of these files to present the same storage as relational tables — MySQL, Oracle, and SQL Server are good examples. Each subsequent level of virtualization provides an API that can be called to access the relevant function. For instance, a file system provides APIs for creating, deleting, writing, and reading files.

As mentioned, the first level of virtualization on top of the hardware is the OS. The APIs used to access the OS today are mostly based on the Unix APIs defined in the 1970’s. It’s incredible to think how hardware has changed and improved over the last 40 years — just think of the IBM systems of the 1980’s compared to today’s Windows machines. A single Apple iPhone 5 has 2.7 times more processing power than the 1985 Cray-2 supercomputer. And remember floppy disks? Yet, the OS APIs have stayed more or less the same over the years.

There are four aspects of the hardware which are virtualized by an OS: CPU, memory, storage, and network. CPU and memory go together and can be referred to as “compute.” The Unix-style virtualized view of computing is processes and threads. Of course, you can use the Unix-style APIs to run your program on a cloud server. However, if you want to take full advantage of the scalability and availability that the cloud provides, you need cloud-native APIs.

How is a Cloud OS Different?
Let’s explore the cloud OS equivalents for processes, threads, TCP sockets, and file systems. The cloud OS replaces a file system with object storage, enabling infinitely scalable storage capacity and I/O throughput. Object storage was the first paradigm shift brought in by cloud OS and has since become the norm for storage virtualization. Object storage API is fundamentally different than a Unix file-system API. The Unix file-system API exposes data as a stream of bytes, whereas the object storage API exposes data as an object that is fetched in single API call. The performance characteristics of object storage are also different than a Unix file-system. Typically, object storages have higher latency and higher scalability. That is, one API call can take longer, but multiple API calls can be issued in parallel. As an effect of the change in the storage API, a software developer must think differently when building a cloud system.

On the database side, the relational database does not scale to leverage the full potential of the cloud. Cloud applications are instead built around distributed, NoSQL databases that scale up or down with the load and offer better fault tolerance than traditional systems. As the name suggests, distributed NoSQL databases use distributed computing and storage resources. Hence, these databases offer higher scalability in terms of both amount of storage and transaction rate. For a software developer, the scalability offered by distributed databases comes are the cost of change in the API. Distributed databases do not typically support SQL, the de-facto relational database API. Distributed databases expose APIs to store and retrieve key-value pairs. This change in API significantly influences the architecture of a cloud system.

As cloud systems evolve, SOA is giving way to event-driven, serverless programming.

The first wave of cloud computing did not affect the process and thread-related APIs. Though cloud systems tend to have a service-oriented architecture, as opposed to the layered system architecture found in on-premises system,  Service Oriented Architecture (SOA) is better suited for cloud systems because individual services can be deployed, scaled, and upgraded independently. As cloud systems evolve, SOA is giving way to event-driven, serverless programming. Serverless programming allows you to define the function to be called for a certain event, and then the function itself can generate more events that, in turn, are processed by other functions. With this model, you don’t need to keep a server running all the time, and you don’t have to worry about running additional servers when the load increases. More events and more instances of the event-handling function can simply be spawned by the cloud OS. Serverless programming is also a major shift from how a software programmer developed an on-prem system.

Network virtualization also changes with computing when it comes to server programming. Instead of a process listening on a TCP socket, you code a serverless function to be triggered by an event generated by a RESTful API call. As a result, you don’t need to keep a server running all the time, and you pay only for the time the event handler function executes. It scales out as the load (number of RESTful API calls) increases.

Some of these cloud OS APIs are quickly becoming the gold standard. For example, object storage APIs are common across all major public cloud providers. Other cloud OS APIs vary between different public cloud providers.

In the harsh realities of the business world, a cloud OS offers the compute and storage capacity needed to handle massive data sets with ease. Cloud architecture is forming the foundation of a cutting-edge OS that will power new apps and services. The world is witnessing the dawn of consumer and enterprise services that will take full advantage of this storage and compute power of the cloud OS, and Druva is excited to be leading the way.

Recommended Reading