The architecture of Sheepdog is fully symmetric; there is no central node such as a meta-data server. This design enables following features.
- Linear scalability in performance and capacityWhen more performance or capacity is needed, Sheepdog can be grown linearly by simply adding new machines to the cluster.
- No single point of failureEven if a machine fails, the data is still accessible through other machines.
- Easy administrationThere is no config file about cluster’s role. When administrators launch Sheepdog programs at the newly added machine, Sheepdog automatically detects the added machine and begins to configure it as a member of the cluster.
Architecture
Sheepdog is a storage system that provides a simple key-value interface to Sheepdog client (qemu block driver). Sheepdog is consists of multiple nodes.
Sheepdog consists of only one server (we call collie) and patched QEMU/KVM.
Virtual Disk Image (VDI)
A Sheepdog client divides a VM image into fixed-size objects (4 MB by default) and store them on the distributed storage system. Each object is identified by globally unique 64 bit id, and replicated to multiple nodes.
Object
Sheepdog objects are grouped into two types.
- VDI Object: A VDI object contains metadata for a VM image such as image name, disk size, creation time, etc.
- Data Object: A VM images is divided into a data object. Sheepdog client generally access this object.
Sheepdog uses consistent hashing to decide where objects store. Consistent hashing is a scheme that provides hash table functionality, and the addition or removal of nodes does not significantly change the mapping of objects. I/O load is balanced across the nodes by features of hash table. A mechanism of distributing the data not randomly but intelligently is a future work.
Each node is placed on consistent hashing ring based on its own id. To determine where to store the object, Sheepdog client gets the object id, finds the corresponding point on the ring, and walk clockwise to determine the target nodes.
VDI Operation
In most cases, Sheepdog clients can access their images independently because we do not allow for clients to access the same image at the same time. But some VDI operations (e.g. cloning VDI, locking VDI) must be done exclusively because the operations updating global information. To implement this in the highly available system, we use a group communication system (GCS). Group communication systems provide specific guarantees such as total ordering of messages. We use corosync, one of most famous GCS.
No comments:
Post a Comment