When Multi-threading Hits a Scalability Dead End

When Multi-threading Hits a Scalability Dead End

Multithreading solves a multitude of software development problems, especially for network-centric applications that must respond to user need with near-instant performance. Unfortunately, multithreading is not enough to solve problems related to concurrency at large scale.

Addressing such problems requires changing the programming model to use asynchronous events and callback-based mechanisms. So, at Druva, we created a Python-based library called Dhaga to address large scale concurrency without requiring a significant change in the programming model.

Software developers live in a concurrent world. Threads are first-class citizens in the development process today, especially when your application performs intensive network operations, as does Druva’s own inSync. Multithreading helps the programming code flow for network operations become simple and sequential. And when our application needs enhanced performance or to improve its scalability, we increase the number of threads.

But to reach a scale that can support thousands of concurrent requests, threads aren’t enough.

We discovered several limitations in our use of multithreading:

  • The inSync client needs to back up a large number of files to the server by using network RPC calls. Developers’ typical approach to speed things up is to use threads. But multithreading brings with it the performance cost of increased memory and CPU usage; developers have to keep a balance between speed and threads.
  • Our server needs to handle concurrent connections and notifications to-and-from thousands of inSync clients. To process the connections efficiently, we use threads for serving the requests. But the ever-increasing number of inSync clients means we had to keep increasing the number of threads, which in turn consumed significant server memory and CPU.
  • Our Web server needs to handle thousands of parallel HTTP requests. Most of the work is in receiving and sending data from network sockets and passing it on to the inSync back end. That causes most of the threads to wait on network operations. As established by the C10K problem, spawning a thread for each request doesn’t scale when there are thousands of simultaneous requests to the Web server.

Asynchronous Frameworks’ Limitations

Many asynchronous frameworks, including TwistedTornado, and asyncore, help developers move away from the popular approach of using threads. These frameworks rely on non-blocking sockets and callback mechanisms. If we had used these frameworks as-is, a major part of our Druva code would have to be restructured wherever network operations were performed. That wasn’t something we wanted to do. Restructuring code would have added to the development and test cycle, thereby preventing us from meeting the scale requirements in time. Given that multiple parts of the product needed to scale, we would have had to restructure each of them — thus doubling or tripling the effort.

To avoid changing so much code, we had to move away from using existing frameworks directly. Fortunately, we found a few useful tools, and developed one of our own.

Because we wanted to control code execution during network I/O, we needed a way for micro-threads to be spawned inside a thread. We found greenlets. A “greenlet” is a more primitive notion of a micro-thread with no implicit scheduling – co-routines, in other words. Greenlets are useful when you want to control exactly when your code runs. You can build custom scheduled micro-threads because you can control when a greenlet yields control. This was perfect for us because it gave us full control over the scheduling of our code.

Tornado is a simple, non-blocking Web server framework written in Python, designed to handle thousands of asynchronous requests. We use its core components, IOLoop and IOStream. IOLoop is an I/O event loop for non-blocking sockets; it uses epoll (on Linux) or queue (on BSD and Mac OS X) if they are available, or else select() (on Windows). IOStream provides convenient wrappers for non-blocking sockets such as read and write. We hand over all the socket operations to tornado and use callbacks for triggering code on operation completion.

That was a good start, but we needed more. If we had directly used the above modules in our code, a significant amount of our RPC code would have had to change to schedule RPCs via greenlets, ensure greenlets don’t block (if a greenlet blocks, it blocks the entire thread and all its greenlets), and handle callbacks from the tornado module. Similar intrusive changes would have been needed in all other areas of inSync where we needed to address large scale concurrency.

We needed an abstraction that would manage and schedule greenlets without making them block outside inSync and tie in the scheduling to the asynchronous callbacks generated by tornado. This abstraction could then be used wherever there was a need to scale beyond threads.

Here is a sample synchronous function:

    http_client = HTTPClient()
    response = http_client.fetch(url)
    return response.body

And here is the same function rewritten to be asynchronous with an AsyncHTTPClient:

defasynchronous_fetch(url, callback):
    http_client = AsyncHTTPClient()
    http_client.fetch(url, callback=handle_response, args=(callback, ))

defhandle_response(response, callback):

If we could build an abstraction such that the callbacks are hidden within the abstraction so that application code could remain closer to the synchronous code, we concluded, it would minimize code changes. Without sharing any of our intellectual property, it would look something like the following:

def asynchronous_fetch(url):
    http_client = DhagaHTTPClient()
    response = http_client.fetch(url) #Causes suspension of current greenlet
    return response.body

Here, DhagaHTTPClient() is derived out of the regular HTTPClient and Dhaga so that the http request is fired through a greenlet. Within the Dhaga code, the greenlet issues an asynchronous RPC and suspends itself, yielding control to other Dhagas. When the http response is received, the suspended greenlet is woken up to return the response to the code above. The details of asynchronous RPC, events, and callbacks are abstracted within the Dhaga module, thus keeping the application code flow as close to synchronous flow as possible.

The code snippet above aptly demonstrates the simplicity achieved using Dhaga abstraction while still giving us better scale and performance (as exemplified by the performance and scale improvement numbers posted at the bottom). Though it’s not possible for us to open-source the code for Dhaga, we believe that the approach of building such an abstraction and scaling different parts of an application without significant code rewrite is useful to others.

Introducing Dhaga

Dhaga (originally from the Hindi language, meaning thread) is our abstraction for a lightweight thread of execution. The Dhaga class is derived from greenlet, which implements stack switching to execute multiple code flows in one operating system thread. An OS-level thread executes multiple dhagas with cooperative scheduling. Whenever a dhaga is about to wait (mostly for a RPC call to return), it yields control to the parent (that is, the execution context of the OS level thread that created it). The parent then schedules another dhaga that is ready to run. The RPC call in question is handed over to the tornado IOLoop to write to the socket asynchronously and to register a callback when it returns. When it does return, the waiting dhaga is added to the runnable queue, which is later picked up by the parent thread.

We could use Dhaga instead of threading when dealing with high-latency operations and when the number of threads required to increase the throughput is beyond reasonable limits &emdash; for example, we are using 512 dhagas in a single thread in our code.

Dhaga Scheduling

To perform cooperative scheduling between dhagas, we wrote a scheduler that works similar to an OS scheduler. A dhaga can ask the scheduler to suspend itself; any dhaga (or thread) can schedule another dhaga for execution. The Dhaga class passes these requests to the scheduler.

A scheduler runs a thread that switches between the execution flows of a set of dhagas, and  maintains a queue of runnable dhagas. Whenever the scheduler is executed, it switches to the first dhaga in the runnable queue. When the running dhaga is about to wait for a certain condition, it switches to the scheduler. The scheduler then picks the next runnable dhaga. When the condition that the first dhaga is waiting for is triggered, the dhaga is put on the runnable queue. In effect, each dhaga could be in any of the three states: running, runnable, or waiting on a condition.

To be developer friendly, we implemented Dhaga-safe objects similar to thread-safe objects for concurrency handling, such as lock, queue, and condition. These allow developers to use the same synchronization concepts for dhagas as they do for threads.

Integration with code

Since we intercepted at the lowest layer of our code (i.e sockets), we needed a few small changes in the InSync code to use non-blocking sockets. This let us move away from threading as the concurrent unit of execution. Dhagas have a similar interface to threads; they also get the target function from which to start execution. Most important: The code stayed simple and sequential.

Using Dhaga, we could access and backup many more files in parallel without creating as many operating system threads. When one dhaga fires a network RPC call to sync a file, a dhaga switch occurs; that leads to other dhagas syncing another file and similarly for other files too. We stretch the limits of number of parallel files synced on the basis of administrator-specified bandwidth setting. The increase in the number of parallel files being accessed gives us better network utilization &emdash; resulting in better overall throughput and increased backup speed. Which, of course, is a primary goal for what InSync does!

We also use dhaga in our notification server. Previously. each thread used to handle one client; now each thread can spawn thousands of dhaga, which corresponds to thousand of clients. Since that reduced the number of threads, we are seeing lower CPU usage on our servers.

Recently we added dhaga to our Web server. Now each of the HTTP requests is handled by dhaga instead of threads, so we can scale our Web server from processing a few hundred requests to thousands of concurrent requests. This leads to long polling on requests as there is not much side-effect of pending requests.

Micro-benchmarking inSync performance using Dhaga

Data Size

Time with threads (32 threads)

Time with Dhaga (4 threads * 32 dhagas)

1GB (32K files)

44 minutes

14 minutes

2.2GB (Windows Program Files)

37 minutes

15 minutes

 After changing from a threads model to dhaga model, we are seeing really positive results. We can increase the concurrency of our application much higher than what we’ve achieved  with threads (and even that was pretty impressive). We’ll continue to increase our usage of dhaga for increased scalability and throughput of other components of our product.

Interested in learning more about Druva? Get a free trial of Druva’s single dashboard for backup, availability, and governance, or find out more information by checking out these useful resources:

Try Druva


Prahlad Nishal

Prahlad is a senior software engineer on Druva's cloud team. His primary area of focus is scalability and the performance of the company's cloud architecture.


Leave a reply

Your email address will not be published. Required fields are marked *