We are excited to announce that on September 12th, 2014, we will be migrating the CTO blog to a new version of the blogging platform.
christoskaramanolis

A Preview of Distributed Storage

September 7, 2012

VMware shared a technology preview of Distributed Storage at VMworld 2012 as part of our broader storage strategy. Steve Herrod described this technology as “virtual SAN” in his Day 1 keynote at VMworld.

Distributed Storage (DS), currently under development by VMware engineers, is a distributed layer of software running natively as part of the ESX hypervisor. It aggregates the hosts’ local storage devices (SSD and HDD) and makes them appear as a single pool of storage shared across all hosts. In other words, we are doing with local storage what we have done in the past with CPU and memory – virtualize the physical resources of ESX hosts and turn them into pools that can be carved up and assigned to VMs and applications according to their QoS requirements.

The result is a converged platform, where compute and storage resources can grow in tandem according to evolving workload needs. Managing physical storage is extremely simple – clicking a radio button enables DS. DS reports on the storage utilization trends in the cluster (space and throughput). When necessary the administrator may add disks and/or hosts to the cluster. That’s it!

Beyond that, the administrator can focus entirely on virtual machines and virtual disks. No need to deal with esoteric RAID options, cache configurations, LUN management, zoning, masking and such. DS is fully compatible with the new generation of Policy-Based Storage Management  to be introduced in vSphere. The administrator specifies the required policies for their VM (including availability, reliability, performance reservations and limits, to mention a few) and DS provisions, monitors and reports on policy compliance during the lifecycle of the VM. If some components (hosts, disks, network) fail or simply the workload changes so that some policy is violated, DS takes automatic remediation action in the background – reconfigures the data of the affected VMs and optimizes the utilization of resources across the cluster. And it does all that while minimizing the impact on the regular workload execution.

Motivation and technology trends

For years now, data has been growing at an exponential rate and this trend is expected to continue for a while. Industry analysts expect data to grow 9x between 2010 and 2015. This growth is driven by a combination of traditional and new-generation applications such as social media and big-data analytics. Consider the challenges and cost of storing the 100s of Terabytes of your enterprise social portal data on a traditional disk array. The sheer volume and processing requirements of such applications are posing enormous challenges on existing data management paradigms.

At the same time, even modest server platforms are packing impressive computational power. By 2013, servers with 4 sockets and 32 logical CPUs per socket (16 cores x 2 threads) for a total of 128 logical CPUs will be commonplace. HDD capacities are also growing rapidly and are expected to reach up to 60TB by 2016 (ref. Seagate’s announcement in March 2012). Such disks are offering capacity virtually for free (at just a few cents per GB). But their throughput and especially I/O operations per second (IOPS) have not grown as fast. Thankfully, Flash-based local storage is becoming a viable tier between CPU/memory and HDDs, from a price-performance standpoint. At a cost of approximately 1 cent per IOPS, they offer lots of cheap throughput (at least 2 orders of magnitude cheaper than conventional disks).

Thus the obvious question is: Why can’t we use the cheap capacity of local HDDs and the cheap IOPS of SSDs (or PCIe devices) together with a small tax on the large CPU resources of modern servers to build a new generation of distributed storage platforms? With new technologies such as Google’s GFS, VMware’s VMFS and DRS, Facebook’s vast analytics farms, engineers have learned how to build large, reliable distributed systems. The challenge here is management simplicity. An IT professional should not need a PhD in complexity theory to manage a storage cluster. Indeed, the desire we see for a new take on storage solutions is driven not only by the need for low storage CAPEX but also by the need for lower operational costs. Virtualization is the catalyst for making such solutions accessible and manageable for a wide range of applications, both old and new.

Distributed Storage differentiation

VMware is not the only company to recognize these technology trends and how they can be turned into a radically new model of delivering storage in the datacenter. There are a plethora of companies and a few shipping products that introduce some form of distributed, software-based storage platform, as an alternative to traditional SANs. Many of these technologies are specific to virtualization and often involve a converged compute and storage architecture. A main theme of many of these solutions is simplicity of storage management, a big pain point in virtualized environments.

What is unique about VMware’s Distributed Storage technology?

  • Natively implemented as part of the ESX hypervisor, for improved resource efficiency and low latency.
  • Scales to the size of a vSphere cluster (32 nodes as of vSphere 5.1) and can manage the data and I/O workloads of thousands of VMs.
  • Intuitively integrated with vSphere management concepts and UI, by making it a property of a vSphere Cluster.
  • Natively integrated with ESX and Cluster resource management (DRS) for holistic CPU, memory, storage and network control.
  • The first storage platform to fully support a VM-oriented storage management approach using VMware’s new Policy-Based Storage Management stack.
  • Builds on VMware’s proven record of developing enterprise-class, scalable distributed software including our flagship VMFS and DRS products.

Integration with vSphere solutions and management

To the administrator, a DS datastore looks the same as a VMFS or NFS datastore. It exposes a single file system namespace with VM metadata residing in sub-directories of the datastore. By default, all hosts in the cluster have access to the DS datastore, even those that don’t have any local disks. VMs can be provisioned on any host and have their data on DS. A VM may be registered and run on any host irrespective of what hosts and disks its data may be distributed on.

Existing VMware solutions that require shared storage (e.g., HA, vMotion, DRS) work seamlessly with DS. For example, if a host becomes overloaded, DRS may decide to migrate VMs out from that host and to other hosts in the cluster. The VMs will be migrated safely using VMFS locking semantics and continue running and accessing their state on the DS datastore.

Lastly, DS integrates with the existing vSphere data management features used today with VMFS or NFS storage, including snapshots (using delta disks), linked clones, vSphere Replication (vR) and vStorage APIs for Data Protection (vADP).

Design approaches

Let’s go over the primary design approaches behind the Distributed Storage technology:

Scalability and clustering. DS is a highly scalable platform that can grow to tens of hosts. In principle, DS could be represented by a new management abstraction in vSphere (say a “Distributed Storage Cluster”) but for usability and integration reasons, we decided to make it a property of vSphere Cluster. A cluster with DS enabled has a single DS datastore accessible by all the hosts in the cluster. Of course, those hosts may also mount any other VMFS and NFS datastores. Not all hosts in the cluster need to be identical and not even all hosts need to have local storage to participate and have access to the DS datastore. Note, however, that if a host contributes local storage devices, it has to contribute at least one SSD. DS makes those checks automatically.

Administrators may choose the disk “auto-claim” mode – DS claims and utilizes any local devices (SAS or SATA) that do not contain other partitions. Alternatively, administrators may manually select the devices that each host contributes to DS.

Object-based storage. DS stores and manages data in the form of flexible data containers called objects. Think of an object as a logical volume that has its data and metadata distributed and accessed across the entire cluster. In the ESX storage stack, those objects appear as devices. DS can store and manage 10s of thousands of objects in a single cluster. These are mutable, strongly consistent objects, unlike “Blob” storage objects (S3, Azure, etc).

For each VM provisioned on a DS datastore, an object is created for each virtual disk of the VM, plus a container object which holds a VMFS volume and stores all of the metadata files of the VM. DS exposes a single namespace (like VMFS and NFS datastores) and enforces VMFS locking semantics for the metadata of every VM as required for solutions such as HA and vMotion.

DS provisions and manages each object individually. For example, as the diagram below illustrates, to create the object for a virtual disk, DS takes into account: 1) the policies specified by the administrator for this specific virtual disk; 2) the cluster resources and their utilization at the time of provisioning. Based on that, it decides how to distribute the object in the cluster. For example:

  • According to the availability policy it decides how many replicas to create.
  • According to the performance policy, it decides how much SSD to allocate for caching for each replica or if necessary how many stripes to create for each replica.

In other words, DS creates a RAID configuration over the network for every single object.

As the DS cluster and the workloads evolve over time, DS monitors the compliance of the virtual disk with its policies and if necessary it replaces or reconfigures parts or all of it, either to bring the object back to compliance or to optimize the utilization of the cluster resources. DS actively throttles the storage and network throughput used for reconfiguration to minimize the impact of reconfiguration to normal workload execution.

Replication for data reliability and availability. DS uses RAID-1 (synchronous replication) across hosts to meet the availability and reliability policies of objects. The number of replicas depends on those policies’ values (e.g., number of 9s for availability). One may ask, “why not some more space-efficient approach such as RAID-5 or RAID-6”. The quick answer is “because it is cheaper”. The details are outside the scope of this blog, however the main idea is that since RAID-5 and RAID-6 require read-modify-write for less than full stripe writes, many writes would require additional read operations at the HDD, which would require more and smaller drives, which would actually increase the overall cost of the system. The essential point is that SSD cost is approximately $7/GB and $0.01/IOPS, whereas HDD cost is under $0.10/GB and $2/IOPS, so we burn the cheapest resources: SSD IOPS and HDD space.

DS may create multiple stripes (across HDD spindles) for each replica, if necessary, for example when a replica needs to be broken down in smaller chunks to fit on disks, or to meet the throughput requirements of sequential workloads.

SSDs for performance acceleration. Few workloads are sequential. Especially in virtualized environments where you have 1000s of VMs on a cluster sharing storage, the aggregate workload is definitely random and thus not ideal for HDDs.

To address the performance challenges of random workloads, DS uses SSDs in front of HDDs for both read caching and write buffering. The amount of SSD assigned for read caching per object replica is determined by the object’s performance policies and an optional cache profile specification. DS’s replication algorithms intelligently route read operations to different replicas to maximize the read hit ratio on each replica’s cache (equivalently reduce the required cache size). Write operations are first replicated and then persisted on the write buffers of every replica before completing the operation. DS uses a state-of-the art elevator algorithm to retire writes from SSD to HDD; it takes into account properties of contemporary HDDs (such as proximal I/O) to extract the last bit out of a disk’s potential bandwidth.

In summary, Distributed Storage is a software-based distributed platform that converges the compute and storage resources of ESX hosts. It provides enterprise-class features and performance with a much simpler management experience for the user. Stay tuned for more updates about this new exciting technology currently under development at VMware.

Tags:

christoskaramanolis

Christos Karamanolis

Chief Architect and Principal Engineer, Storage and Availability

Christos is the Chief Architect and a Principal Engineer in the Storage and Availability Engineering Organization at VMware. He has over 20 years of research and development experience in the fields of distributed systems, fault tolerance, storage and storage management ... More

Leave a Reply