@richardmcdougll

Towards an Elastic Elephant: Enabling Hadoop for the Cloud

October 22, 2012

by guest blogger, Tariq Magdon-Ismail (@tariqmi)

In his joint presentation at Hadoop Summit 2012 titled “Hadoop in Virtual Machines”, Richard McDougall talked about  the benefits and challenges of virtualizing Hadoop. In particular, he introduced the idea of separating Hadoop’s compute runtime from data storage  on virtual infrastructure and touched on why this architecture is both desirable and feasible in a cloud environment. In this blog post I hope to examine this topic a bit further and present some initial performance results.

 

 

The Evolution of Virtual Hadoop

The common approach to virtualizing many applications is to perform a P2V (physical-to-virtual) migration where the physical deployment is directly cloned into virtual machines. Hadoop is no different. While this is a reasonable first step, there are a few drawbacks that make this model less than ideal in a cloud environment. The first has to do with elasticity. Adding/removing Hadoop nodes to increase/reduce compute resources available to jobs is a cumbersome, coarse grained, activity due to the tight coupling between the compute runtime and data storage layers (see below.) This makes it difficult to dynamically scale the virtual cluster to either make use of spare physical capacity or relinquish it to another workload.  In other words, the virtual Hadoop cluster is not elastic.

In order to gain elasticity, we need to be able to separate the compute layer from the storage layer so that each can be independently provisioned and scaled (the details of which are the subject of the following sections.) This is the next leg in the journey to virtualize Hadoop. This level of flexibility enables Hadoop to more efficiently share resources with other workloads and consequently raise the utilization of the underlying physical infrastructure. Moreover, this split architecture allows for more efficient hosting of multiple tenants, with their own private virtual clusters. It goes beyond the level of multi-tenancy offered by Hadoop’s built-in scheduler and security controls by relying on the hypervisor for much stronger VM level security and resource isolation guarantees. Further, since each compute cluster is independent, each tenant could have their own version of the Hadoop runtime. All these characteristics combine to form a very flexible, elastic and secure service that is the end goal to providing Hadoop-as-a-service.

Data Friction

Why is quickly scaling a Hadoop cluster difficult? In typical Hadoop deployments today, the compute and storage engines (known as the DataNode and TaskTracker) run inside each node, so the lifecycle of a node is tightly coupled to its data. Powering it off means that we lose the DataNode having to replicate the data blocks it managed.

Similarly, adding a node would mean we need to rebalance the distribution of data across the cluster for the optimum use of storage.

 

This decommissioning/provisioning and the large volume of data transfers involved makes responding to changing resource pressures slow and cumbersome. A further complication is that compute and storage capacity requirements change at very different velocities and we cannot respond to these different needs independently or efficiently. A potential solution is to scale storage by adding VMs (to add DataNodes) but scale compute by growing/shrinking a VM by adding/removing virtual CPU and memory resources. This would require close coordination between the hypervisor, guest OS and Hadoop and introduces some complexities that are beyond the scope of this blog. Another potential solution is to separate Hadoop’s compute layer from its data layer. This is a more natural solution, since Hadoop inherently maintains this separation via independent DataNode and TaskTracker JVM processes loosely coupled over TCP/IP.

 

 

Separating Compute from Data

The primary reason for having co-located DataNodes and TaskTrackers, is to maintain compute-data locality. This is  important in physical clusters because of network bandwidth limitations and the impact fetching large amounts of data from a remote node would have on job performance. But, virtual networking between VMs in a host is free of physical link limitations and throughput purely a function of CPU and memory speed. This coupled with the work we are doing on virtual topology awareness for Hadoop, so that nodes are aware of physical vs virtual host locality, opens up the prospect of being able to split compute and data into separate virtual nodes in our quest for the elastic elephant.


An added benefit with this approach is that the operational model is firmly established. It is a widely accepted technique in distributed application architectures in the cloud to add/remove VM instances as a way of scaling the workload. But a key question remains: what about job performance?

Performance

In the co-located setup data packets transfer between the DataNode and a Task over a loopback interface within the Guest OS. But when you split the JVMs into two separate virtual machines packets that are sent by Hadoop pass through many more layers before they get to their destination. Lets look at the typical path taken by a message that is transmitted from VM. The message is first processed by TCP/IP in the guest operating system. After the required headers have been added to the packet, it is sent to the device driver in the virtual machine. Once the packet has been received by the virtual device, it is passed on to the virtual I/O stack in ESX for additional processing, where the destination of the packet is also determined. If the packet is destined for a VM on the same host, it is forwarded over vSwitch to that virtual machine and not sent on the wire (yellow path in the figure above.) All other packets are sent to the physical NIC’s device driver in ESX to be sent out on the physical network (red path above.) Unlike VM-to-physical machine communication, VM-to-VM throughput is not limited by the speed of network cards since packets travel through memory. It solely depends on the CPU architecture and clock speeds. For example, with two Intel Xeon @ 2.80GHz processors running ESX 4.1, Linux VM- to-VM throughput approaches 27Gbps, which is nearly three times the rate supported by the 10Gbps network cards available on the market today.

Given the impressive raw virtual network bandwidth we can drive within a host today, we’ve started to gather some data on what impact it would have to Hadoop running in a split configuration. Following is the result of a 4 node TeraSort run where each physical node is an 8 core, 96GB machine with 16 disks. The HDFS replication factor was set to 2. Note also that these experiments were conducted with Hadoop 0.23 where the NodeManager replaced the TaskTracker from older versions.

 

In terms of both elapsed time and CPU cycles/byte used the split configuration is within 10% of the combined case. The primary reason we are focused on cycles/byte here instead of just CPU utilization is that we want to quantify the tax imposed by the split architecture on data movement.  In addition, CPU cycles afford us a way to normalize out CPU frequency and core count variations across different machines and tests, so that results are directly comparable. Note, even though the graphs above suggest that the split case is faster than combined, this difference is within experimental noise and is not statistically significant.

Apart from the TeraSort experiments above, we also ran TestDFSIO on a single node to understand the performance of raw I/O.

As with TeraSort, bandwidth in the split case is within 10% of combined, but cycles/byte is much higher. As an optimization we’ve done some early prototype work using VMCI (instead of TCP/IP) to communicate between Tasks and the local DataNode. With VMCI we were able to reduce the cycles/byte needed for the TestDFSIO read test.

Just the Beginning…

While this is still early days, preliminary data suggests that a split Hadoop model over virtual networking is quite viable. It adds the necessary flexibility to Hadoop with a tolerable impact on performance, and enables the cloud delivery of Hadoop. But, we still have more work to do to understand and tame this elastic beast. Watch this space!

 

 

@richardmcdougll

Richard McDougall

vSphere Storage, Big Data

Richard McDougall is the CTO for Storage and Availability at VMware. He is responsible for the technical strategy for core vSphere storage and application storage services, including Big Data, Hadoop ... More

Leave a Reply