Unlimited-Data. moved to lab.itbee.vn : 2010

Tuesday 14 December 2010

Armitage – Cyber Attack Management & GUI For Metasploit

Armitage – Cyber Attack Management & GUI For Metasploit: "Armitage is a graphical cyber attack management tool for Metasploit that visualizes your targets, recommends exploits, and exposes the advanced capabilities of the framework. Armitage aims to make Metasploit usable for security practitioners who understand hacking but don’t use Metasploit every day. If you want to learn Metasploit and grow...

Read the full post at darknet.org.uk

Big Just Got Bigger - 5 Terabyte Object Support in Amazon S3

Big Just Got Bigger - 5 Terabyte Object Support in Amazon S3: "

Today, Amazon S3 announced a new breakthrough in supporting customers with large files by increasing the maximum supported object size from 5 gigabytes to 5 terabytes. This allows customers to store and reference a large file as a single object instead of smaller 'chunks'. When combined with the Amazon S3 Multipart Upload release, this dramatically improves how customers upload, store and share large files on Amazon S3.

Who has files larger than 5GB?

Amazon S3 has always been a scalable, durable and available data repository for almost any customer workload. However, as use of the cloud as grown, so have the file sizes customers want to store in Amazon S3 as objects. This is especially true for customers managing HD video or data-intensive instruments such as genomic sequencers. For example, a 2-hour movie on Blu-ray can be 50 gigabytes. The same movie stored in an uncompressed 1080p HD format is around 1.5 terabytes.

By supporting such large object sizes, Amazon S3 better enables a variety of interesting big data use cases. For example, a movie studio can now store and manage their entire catalog of high definition origin files on Amazon S3 as individual objects. Any movie or collection of content could be easily pulled in to Amazon EC2 for transcoding on demand and moved back into Amazon S3 for distribution through edge locations throughout the word with Amazon CloudFront. Or, BioPharma researchers and scientists can stream genomic sequencer data directly into Amazon S3, which frees up local resources and allows scientists to store, aggregate, and share human genomes as single objects in Amazon S3. Any researcher anywhere in the world then has access to a vast genomic data set with the on-demand compute power for analysis, such as Amazon EC2 Cluster GPU Instances, previously only available to the largest research institutions and companies.

Multipart Upload and moving large objects into Amazon S3

To make uploading large objects easier, Amazon S3 also recently announced Multipart Upload, which allows you to upload an object in parts. You can create parallel uploads to better utilize your available bandwidth and even stream data into Amazon S3 as it's being created. Also, if a given upload runs into a networking issue, you only have to restart the part, not the entire object allowing you recover quickly from intermittent network errors.

Multipart Upload isn't just for customers with files larger than 5 gigabytes. With Multipart Upload, you can upload any object larger than 5 megabytes in parts. So, we expect customers with objects larger than 100 megabytes to extensively use Multipart Upload when moving their data into Amazon S3 for a faster, more flexible upload experience.

More information

For more information on Multipart Upload and managing large objects in Amazon S3, see Jeff Barr's blog posts on Amazon S3 Multipart Upload and Large Object Support as well as the Amazon S3 Developer Guide.

Amazon Route 53 DNS Service

Amazon Route 53 DNS Service: "

Even working in Amazon
Web Services,
I’m finding the frequency of new product announcements and updates a bit dizzying.
It’s amazing how fast the cloud is taking shape and the feature set is filling out.
Utility computing has really been on fire over the last 9 months. I’ve never seen
an entire new industry created and come fully to life this fast. Fun times.

Before joining AWS, I used to say that I had
an inside line on what AWS was working upon and what new features were coming in the
near future. My trick? I went to AWS customer meetings and just listened. AWS
delivers what customers are asking for with such regularity that it’s really not all
that hard to predict new product features soon to be delivered. This trend continues
with today’s announcement. Customers have been asking for a Domain
Name Service with consistency and,
today, AWS is announcing the availability of Route
53, a scalable, highly-redundant
and reliable, global DNS service.

The Domain
Name System is essentially a global,
distributed database that allows various pieces of information to be associated with
a domain name. In the most common case, DNS is used to look up the numeric IP
address for an domain name. So, for example, I just looked up Amazon.com and
found that one of the addresses being used to host Amazon.com is 207.171.166.252.
And, when your browser accessed this blog (assuming you came here directly rather
than using RSS) it would have looked up perspectives.mvdirona.com to
get an IP address. This mapping is stored in an DNS “A” (address) record. Other popular
DNS records are CNAME (canonical name), MX (mail exchange), and SPF (Sender Policy
Framework). A full list of DNS record types is at: http://en.wikipedia.org/wiki/List_of_DNS_record_types.
Route 53 currently supports:

• A
(address record)

• AAAA
(IPv6 address record)

• CNAME
(canonical name record)

• MX
(mail exchange record)

• NS
(name server record)

• PTR
(pointer record)

• SOA
(start of authority record)

• SPF
(sender policy framework)

• SRV
(service locator)

• TXT
(text record)

DNS, on the surface, is fairly
simple and is easy to understand. What is difficult with DNS is providing absolute
rock-solid stability at scales ranging from a request per day on some domains to billions
on others. Running DNS rock-solid, low-latency, and highly reliable is hard. And
it’s just the kind of problem that loves scale. Scale allows more investment in the
underlying service and supports a wide, many-datacenter footprint.

The AWS Route 53 Service is hosted
in a global network of edge locations including the
following 16 facilities:

· United
States

• Ashburn,
VA

• Dallas/Fort
Worth, TX

• Los
Angeles, CA

• Miami,
FL

• New
York, NY

• Newark,
NJ

• Palo
Alto, CA

• Seattle,
WA

• St.
Louis, MO

· Europe

• Amsterdam

• Dublin

• Frankfurt

• London

· Asia

• Hong
Kong

• Tokyo

• Singapore

Many DNS lookups are resolved in local caches
but, when there is a cache miss, it will need to be routed back to the authoritative
name server. The right approach
to answering these requests with low latency is to route to the nearest datacenter
hosting an appropriate DNS server. In Route 53 this is done using anycast.
Anycast is a cool routing trick where the same IP address range is advertised to be
at many different locations. Using this technique, the same IP address range is advertized
as being in each of the world-wide fleet of datacenters. This results in the request
being routed to the nearest facility from a network perspective.

Route 53 routes to the nearest
datacenter to deliver low-latency, reliable results. This is good but Route 53 is
not the only DNS service that is well implemented over a globally distributed fleet
of datacenters. What makes Route 53 unique is it’s a cloud service. Cloud means the
price is advertised rather than negotiated. Cloud means you make an API call
rather than talking to a sales representative. Cloud means it’s a simple API and you
don’t need professional services or a customer support contact. And cloud means its
running NOW rather than tomorrow morning when the administration team comes in. Offering
a rock-solid service is half the battle but it’s the cloud aspects of Route 53 that
are most interesting.

Route 53 pricing is advertised
and available to all:

· Hosted
Zones: $1 per hosted zone per month

· Requests:
$0.50 per million queries for first billion queries and $0.25 per million queries
over 1B month

You can have it running in less time than
it took to read this posting. Go to: ROUTE
53 Details. You don’t
need to talk to anyone, negotiate a volume discount, hire a professional service team,
call the customer support group, or wait until tomorrow. Make the API calls to set
it up and, on average, 60 seconds later you are fully operating.

--jrh

James
Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

From Perspectives."

Thursday 9 December 2010

Expanding the Cloud with DNS - Introducing Amazon Route 53

Expanding the Cloud with DNS - Introducing Amazon Route 53: "I am very excited that today we have launched Amazon Route 53, a high-performance and highly-available Domain Name System (DNS) service. DNS is one of the fundamental building blocks of internet applications and was high on the wish list of our customers for some time already. Route 53 has the business properties that you have come to expect from an AWS service: fully self-service and programmable, with transparent pay-as-you-go pricing and no minimum usage commitments.

Some fundamentals on Naming

Naming is one of the fundamental concepts in Distributed Systems. Entities in a system are identified through their name, which is separate from the way that you would choose to access that entity, the address that the access point resides at and what route to take to get to that address.

A simple example is the situation with Persons and Telephones; a person has a name, a person can have one or more telephones and each phone can have one or more telephone numbers. To reach an individual you will look up him or her in your address book, and select a phone (home, work, mobile) and then a number to dial. The number will be used to route the call through the myriad of switches to its destination. The person is the entity with its name, the phones are access points and the phones numbers are addresses.

Names do not necessarily need to be unique, but it makes life a lot easier if that is the case. There is more than one Werner Vogels in this world and although I never get emails, snail mail or phones calls for any of my peers, I am sure they are somewhat frustrated if they type in our name in a search engine :-).

In distributed systems we use namespaces to ensure that we can create rich naming without having to continuously worry about whether these names are indeed globally unique. Often these namespaces are hierarchical in nature such that it becomes easier to manage them and to decentralize control, which makes the system more scalable.
The naming system that we are all most familiar with in the internet is the Domain Name System (DNS) that manages the naming of the many different entities in our global network; its most common use is to map a name to an IP address, but it also provides facilities for aliases, finding mail servers, managing security keys, and much more. The DNS namespace is hierarchical in nature and managed by organizations called registries in different countries. Domain registrars are the commercial interface between the DNS registries and those wishing to manage their own namespace.

DNS is an absolutely critical piece of the internet infrastructure. If it is down or does not function correctly, almost everything breaks down. It would not be a first that a customer thinks that his EC2 instance is down when in reality it is some name server somewhere that is not functioning correctly.

DNS looks relatively simple on the outside, but is pretty complex on the inside. To ensure that this critical component of the internet scales and is robust in the face of outages, replication is used pervasively using epidemic style techniques. The DNS is one of those systems that rely on Eventual Consistency to manage its globally replicated state.

While registrars manage the namespace in the DNS naming architecture, DNS servers are used to provide the mapping between names and the addresses used to identify an access point. There are two main types of DNS servers: authoritative servers and caching resolvers. Authoritative servers hold the definitive mappings. Authoritative servers are connected to each other in a top down hierarchy, delegating responsibility to each other for different parts of the namespace. This provides the decentralized control needed to scale the DNS namespace.

But the real robustness of the DNS system comes through the way lookups are handled, which is what caching resolvers do. Resolvers operate in a completely separate hierarchy which is bottoms up, starting with software caches in a browser or the OS, to a local resolver or a regional resolver operated by an ISP or a corporate IT service. Caching resolvers are able to find the right authoritative server to answer any question, and then use eventual consistency to cache the result. Caching techniques ensure that the DNS system doesn't get overloaded with queries.

The Domain Name System is a wonderful practical piece of technology; it is a fundamental building block of our modern internet. As always there are many improvements possible, and many in the area of security and robustness are always in progress.

Amazon Route 53

Amazon Route 53 is a new service in the Amazon Web Services suite that manages DNS names and answers DNS queries. Route 53 provides Authoritative DNS functionality implemented using a world-wide network of highly-available DNS servers.
Amazon Route 53 sets itself apart from other DNS services that are being offered in several ways:

A familiar cloud business model: A complete self-service environment with no sales people in the loop. No upfront commitments are necessary and you only pay for what you have used. The pricing is transparent and no bundling is required and no overage fees are charged.

Very fast update propagation times: One of the difficulties with many of the existing DNS services are the very long update propagation times, sometimes it may even take up to 24 hours before updates are received at all replicas. Modern systems require much faster update propagation to for example deal with outages. We have designed Route 53 to propagate updates very quickly and give the customer the tools to find out when all changes have been propagated.

Low-latency query resolution The query resolution functionality of Route 53 is based on anycast, which will route the request automatically to the DNS server that is the closest. This achieves very low-latency for queries which is crucial for the overall performance of internet applications. Anycast is also very robust in the presence of network or server failures as requests are automatically routed to the next closest server.

No lock-in. While we have made sure that Route 53 works really well with other Amazon services such as Amazon EC2 and Amazon S3, it is not restricted to using it within AWS. You can use Route 53 with any of the resources and entities that you want to control, whether they are in the cloud or on premise.

We chose the name 'Route 53' as a play on the fact that DNS servers respond to queries on port 53. But in the future we plan for Route 53 to also give you greater control over the final aspect of distributed system naming, the route your users take to reach an endpoint. If you want to learn more about Route 53 visit http://aws.amazon.com/route53 and read the blog post at the AWS Developer weblog.

Sunday 5 December 2010

Kafka : A high-throughput distributed messaging system.

Kafka : A high-throughput distributed messaging system.: "

Found an interesting new open source project which I hadn’t heard about before. Kafka is a messaging system used by linkedin to serve as the foundation of their activity stream processing.

Kafka is a distributed publish-subscribe messaging system. It is designed to support the following
Persistent messaging with O(1) disk structures that provide constant time performance even with many TB of stored messages.

High-throughput: even with very modest hardware Kafka can support hundreds of thousands of messages per second.

Explicit support for partitioning messages over Kafka servers and distributing consumption over a cluster of consumer machines while maintaining per-partition ordering semantics.

Support for parallel data load into Hadoop.

Kafka is aimed at providing a publish-subscribe solution that can handle all activity stream data and processing on a consumer-scale web site. This kind of activity (page views, searches, and other user actions) are a key ingredient in many of the social feature on the modern web. This data is typically handled by “logging” and ad hoc log aggregation solutions due to the throughput requirements. This kind of ad hoc solution is a viable solution to providing logging data to an offline analysis system like Hadoop, but is very limiting for building real-time processing. Kafka aims to unify offline and online processing by providing a mechanism for parallel load into Hadoop as well as the ability to partition real-time consumption over a cluster of machines.
The use for activity stream processing makes Kafka comparable to Facebook’s Scribe or Cloudera’s Flume, though the architecture and primitives are very different for these systems and make Kafka more comparable to a traditional messaging system. See our design page for more details.

Almost half of cloud revenues from storage!

Almost half of cloud revenues from storage!: "

A new report from the 451 Group says that the cloud computing marketplace will reach $16.7bn in revenue by 2013. Even more interesting, however, the Group reports the cloud-based storage will play a starring role in cloud growth, accounting for nearly 40% of the core cloud pie in 2010. “We view storage as the most fertile sector, and predict that cloud storage will experience the strongest growth in the cloud platforms segment,” the report says.

More insights from the report…

Including the large and well-established software-as-a-service (SaaS) category, cloud computing will grow from revenue of $8.7bn in 2010 to $16.7bn in 2013, a compound annual growth rate (CAGR) of 24%.

The core cloud computing market will grow at much more rapid pace as the cloud increasingly becomes a mainstream IT strategy embraced by corporate enterprises and government agencies. Excluding SaaS revenue, cloud-delivered platform and infrastructure services will grow from $964m in revenue in 2010 to $3.9bn 2013 – a CAGR of 60% – the report said. The core market includes platform-as-a-service (PaaS) and infrastructure-as-a-service (IaaS) offerings, as well as the cloud-delivered software used to build and manage a cloud environment, which The 451 Group calls ‘software infrastructure as a service’ (SIaaS).

Pragmatic Programming Techniques: Scalable System Design Patterns

Pragmatic Programming Techniques: Scalable System Design Patterns
Looking back after 2.5 years since my previous post on scalable system design techniques, I've observed an emergence of a set of commonly used design patterns. Here is my attempt to capture and share them.

Load Balancer

In this model, there is a dispatcher that determines which worker instance will handle the request based on different policies. The application should best be "stateless" so any worker instance can handle the request.

This pattern is deployed in almost every medium to large web site setup.

Scatter and Gather

In this model, the dispatcher multicast the request to all workers of the pool. Each worker will compute a local result and send it back to the dispatcher, who will consolidate them into a single response and then send back to the client.

This pattern is used in Search engines like Yahoo, Google to handle user's keyword search request ... etc.

Result Cache

In this model, the dispatcher will first lookup if the request has been made before and try to find the previous result to return, in order to save the actual execution.

This pattern is commonly used in large enterprise application. Memcached is a very commonly deployed cache server.

Shared Space

This model also known as "Blackboard"; all workers monitors information from the shared space and contributes partial knowledge back to the blackboard. The information is continuously enriched until a solution is reached.

This pattern is used in JavaSpace and also commercial product GigaSpace.

Pipe and Filter

This model is also known as "Data Flow Programming"; all workers connected by pipes where data is flow across.

This pattern is a very common EAI pattern.

Map Reduce

The model is targeting batch jobs where disk I/O is the major bottleneck. It use a distributed file system so that disk I/O can be done in parallel.

This pattern is used in many of Google's internal application, as well as implemented in open source Hadoop parallel processing framework. I also find this pattern can be used in many many application design scenarios.

Bulk Synchronous Parellel

This model is based on lock-step execution across all workers, coordinated by a master. Each worker repeat the following steps until the exit condition is reached, when there is no more active workers.

Each worker read data from input queue
Each worker perform local processing based on the read data
Each worker push local result along its direct connection

This pattern has been used in Google's Pregel graph processing modelas well as the Apache Hama project.

Execution Orchestrator

This model is based on an intelligent scheduler / orchestrator to schedule ready-to-run tasks (based on a dependency graph) across a clusters of dumb workers.

This pattern is used in Microsoft's Dryad project

Although I tried to cover the whole set of commonly used design pattern for building large scale system, I am sure I have missed some other important ones. Please drop me a comment and feedback.

Also, there is a whole set of scalability patterns around data tier that I haven't covered here. This include some very basic patterns underlying NOSQL. And it worths to take a deep look at some leading implementations.

Friday 3 December 2010

The Full Stack, Part I

The Full Stack, Part I: "

One of my most vivid memories from school was the day our chemistry teacher let us in on the Big Secret: every chemical reaction is a joining or separating of links between atoms. Which links form or break is completely governed by the energy involved and the number of electrons each atom has. The principle stuck with me long after I'd forgotten the details. There existed a simple reason for all of the strange rules of chemistry, and that reason lived at a lower level of reality. Maybe other things in the world were like that too.

xkcd.com/435

A 'full-stack programmer" is a generalist, someone who can create a non-trivial application by themselves. People who develop broad skills also tend to develop a good mental model of how different layers of a system behave. This turns out to be especially valuable for performance & optimization work. No one can know everything about everything, but you should be able to visualize what happens up and down the stack as an application does its thing. An application is shaped by the requirements of its data, and performance is shaped by how quickly hardware can throw data around.

Consider this harmless-looking SQL query:

DELETE FROM some_table WHERE id = 1234;

If the id column is not indexed, this code will usually result in a table scan: all of the records in some_table will be examined one-by-one to see if id equals 1234. Let's assume id is the indexed primary key. That's a good as it gets, right? Well, if the table is in InnoDB format it will result in one disk-seek, because the data is stored next to the primary key and can be deleted in one operation. If the table is MyISAM it will result in at least two seeks, because indexes and data are stored in different files. A hard drive can only do one seek at a time, so this detail can make the difference between 1X or 2X transactions per second. Digging deeper into how these storage engines work, you can find ways to trade safety for even more speed.

The shape of the data

One way to visualize a system is how its data is shaped and how it flows. Here are a some useful factors to think about:

Working data size: This is the amount of data a system has to deal with during normal operation. Often it is identical to the total data size minus things like old logs, backups, inactive accounts, etc. In time-based applications such as email or a news feed the working set can be much smaller than the total set. People rarely access messages more than a few weeks old.

Average request size: How much data does one user transaction have to send over the network? How much data does the system have to touch in order to serve that request? A site with 1 million small pictures will behave differently from a site with 1,000 huge files, even if they have the same data size and number of users. Downloading a photo and running a web search involve similar-sized answers, but the amounts of data touched are very different.

Request rate: How many transactions are expected per user per minute? How many concurrent users are there at peak (your busiest period)? In a search engine you may have 5 to 10 queries per user session. An online ebook reader might see constant but low volumes of traffic. A game may require multiple transactions per second per user.

Mutation rate: This is a measure of how often data is added, deleted, and edited. A webmail system has a high add rate, a lower deletion rate, and an almost-zero edit rate. An auction system has ridiculously high rates for all three.

Consistency: How quickly does a mutation have to spread through the system? For a keyword advertising bid, a few minutes might be acceptable. Trading systems have to reconcile in milliseconds. A comments system is generally expected to show new comments within a second or two.

Locality: This has to do with the probability that a user will read item B if they read item A. Or to put it another way, what portion of the working set does one user session need access to? On one extreme you have search engines. A user might want to query bits from anywhere in the data set. In an email application, the user is guaranteed to only access their inbox. Knowing that a user session is restricted to a well-defined subset of the data allows you to shard it: users from India can be directed to servers in India.

Computation: what kinds of math do you need to run on the data before it goes out? Can it be precomputed and cached? Are you doing intersections of large arrays? The classic flight search problem requires lots of computation over lots of data. A blog does not.

Latency: How quickly are transactions supposed to return success or failure? Users seem to be ok with a flight search or a credit card transaction taking their time. A web search has to return within a few hundred milliseconds. A widget or API that outside systems depend on should return in 100 milliseconds or less. More important is to maintain application latency within a narrow band. It is worse to answer 90% of queries in 0.1 seconds and the rest in 2 seconds, rather than all requests in 0.2 seconds.

Contention: What are the fundamental bottlenecks? A pizza shop's fundamental bottleneck is the size of its oven. An application that serves random numbers will be limited by how many random-number generators it can employ. An application with strict consistency requirements and a high mutation rate might be limited by lock contention. Needless to say, the more parallelizability and the less contention, the better.

This model can be applied to a system as a whole or to a particular feature like a search page or home page. It's rare that all of the factors stand out for a particular application; usually it's 2 or 3. A good example is ReCAPTCHA. It generates a random pair of images, presents them to the user, and verifies whether the user spelled the words in the images correctly. The working set of data is small enough to fit in RAM, there is minimal computation, a low mutation rate, low per-user request rate, great locality, but very strict latency requirements. I'm told that ReCAPTCHA's request latency (minus network latency) is less than a millisecond.

A horribly oversimplified model of computation

aturingmachine.com

How an application is implemented depends on how real computers handle data. A computer really does only two things: read data and write data. Now that CPU cycles are so fast and cheap, performance is a function of how fast it can read or write, and how much data it must move around to accomplish a given task. For historical reasons we draw a line at operations over data on the CPU or in memory and call that 'CPU time'. Operations that deal with storage or network are lumped under 'I/O wait'. This is terrible because it doesn't distinguish between a CPU that's doing a lot of work, and a CPU that's waiting for data to be fetched into its cache.[0] A modern server works with five kinds of input/output, each one slower but with more capacity than the next:

Registers & CPU cache (1 nanosecond): These are small, expensive and very fast memory slots. Memory controllers try mightily to keep this space populated with the data the CPU needs. A cache miss means a 100X speed penalty. Even with a 95% hit rate, CPU cache misses waste half the time.

Main memory (10^2 nanoseconds): If your computer was an office, RAM would be the desk scattered with manuals and scraps of paper. The kernel is there, reserving Papal land-grant-sized chunks of memory for its own mysterious purposes. So are the programs that are either running or waiting to run, network packets you are receiving, data the kernel thinks it's going to need, and (if you want your program to run fast) your working set. RAM is hundreds of times slower than a register but still orders of magnitude faster than anything else. That's why server people go to such lengths to jam more and more RAM in.

Solid-state drive (10^5 nanoseconds): SSDs can greatly improve the performance of systems with working sets too large to fit into main memory. Being 'only' one thousand times slower than RAM, solid-state devices can be used as ersatz memory. It will take a few more years for SSDs to replace magnetic disks. And then we'll have to rewrite software tuned for the RAM / magnetic gap and not for the new reality.

Magnetic disk (10^7 nanoseconds): Magnetic storage can handle large, contiguous streams of data very well. Random disk access is what kills performance. The latency gap between RAM and magnetic disks is so great that it's hard to overstate its importance. It's like the difference between having a dollar in your wallet and having your mom send you a dollar in the mail. The other important fact is that access time varies wildly. You can get at any part of RAM or SSD in about the same time, but a hard disk has a physical metal arm that swings around to reach the right part of the magnetic platter.

Network (10^6 to 10^9 nanoseconds): Other computers. Unless you control that computer too, and it's less than a hundred feet away, network calls should be a last resort.

Trust, but verify

The software stack your application runs on is well aware of the memory/disk speed gap, and does its best to juggle things around such that the most-used data stays in RAM. Unfortunately, different layers of the stack can disagree about how best to do that, and often fight each other pointlessly. My advice is to trust the kernel and keep things simple. If you must trust something else, trust the database and tell the kernel to get out of the way.

Thumbs and envelopes

I'm using approximate powers-of-ten here to make the mental arithmetic easier. The actual numbers are less neat. When dealing with very large or very small numbers it's important to get the number of zeros right quickly, and only then sweat the details. Precise, unwieldy numbers usually don't help in the early stages of analysis. [1]

Suppose you have ten million (10^7) users, each with 10MB (10^7) bytes of data, and your network uplink can handle 100 megabits (10^7 bytes) per second. How long will it take to copy that data to another location over the internet? Hmm, that would be 10^7 seconds, or about 4 months: not great, but close to reasonable. You could use compression and multiple uplinks to bring the transfer time down to, say, a week. If the approximate answer had been not 4 but 400 months, you'd quickly drop the copy-over-the-internet idea and look for another answer.

movies.example.com

So can we use this model to identify the performance gotchas of an application? Let's say we want to build a movies-on-demand service like Netflix or Hulu. Videos are professionally produced and 20 and 200 minutes long. You want to support a library of 100,000 (10^5) films and 10^5 concurrent users. For simplicity's sake we'll consider only the actual watching of movies and disregard browsing the website, video encoding, user comments & ratings, logs analysis, etc.

Working data size: The average video is 40 minutes long, and the bitrate is 300kbps. 40 * 60 * 300,000 / 8 is about 10^8 bytes. Times 10^5 videos means that your total working set is 10^13 bytes, or 10TB.

Average request size: A video stream session will transfer somewhere between 10^7 and 10^9 bytes. In Part One we won't be discussing networking issues, but if we were this would be cause for alarm.

Request rate: Fairly low, though the concurrent requests will be high. Users should have short bursts of browsing and long periods of streaming.

Mutation rate: Nearly nil.

Consistency: Unimportant except for user data. It would be nice to keep track of what place they were in a movie and zip back to that, but that can be handled lazily (eg in a client-side cookie).

Locality: Any user can view any movie. You will have the opposite problem of many users accessing the same movie.

Computation: If you do it right, computation should be minimal. DRM or on-the-fly encoding might eat up cycles.

Latency: This is an interesting one. The worst case is channel surfing. In real-world movie services you may have noticed that switching streams or skipping around within one video takes a second or two in the average case. That's at the edge of user acceptability.

Contention: How many CPU threads do you need to serve 100,000 video streams? How much data can one server push out? Why do real-world services seem to have this large skipping delay? When multiple highly successful implementations seem to have the same limitation, that's a strong sign of a fundamental bottleneck.

It's possible to build a single server that holds 10TB of data, but what about throughput? A hundred thousand streams at 300kbps (10^5 * 3 * 10^5) is 30 gigabits per second (3 * 10^10). Let's say that one server can push out 500mbps in the happy case. You'll need at least 60 servers to support 30gbps. That implies about 2,000 concurrent streams per server, which sounds almost reasonable. These guesses may be off by a factor or 2 or 4 but we're in the ballpark.

You could store a copy of the entire 10TB library on each server, but that's kind of expensive. You probably want either:

A set of origin servers and a set of streaming servers. The origins are loaded with disks. The streamers are loaded with RAM. When a request comes in for a video, the streamer first checks to see if it has a local cache. If not, it contacts the origins and reads it from there.
A system where each video is copied to only a few servers and requests are routed to them. This might have problems with unbalanced traffic.

An important detail is the distribution of popularity of your video data. If everyone watches the same 2GB video, you could just load the whole file into the RAM of each video server. On the other extreme, if 100,000 users each view 100,000 different videos, you'd need a lot of independent spindles or SSDs to keep up with the concurrent reads. In practice, your traffic will probably follow some kind of power-law distribution in which the most popular video has X users, the second-most has 0.5X users, the third-most 0.33X users, and so on. On one hand that's good; the bulk of your throughput will be served hot from RAM. On the other hand that's bad, because the rest of the requests will be served from cold storage.

Whatever architecture you use, it looks as though the performance of movies.example.com will depend almost completely on the random seek time of your storage devices. If I were building this today I would give both SSDs and non-standard data prefetching strategies a serious look.

It's been fun

This subject is way too large for a short writeup to do it justice. But absurd simplifications can be useful as long as you have an understanding of the big picture: an application's requirements are shaped by the data, and implementations are shaped by the hardware's ability to move data. Underneath every simple abstraction is a world of details and cleverness. The purpose of the big fuzzy picture is to point you where to start digging.

Carlos Bueno, an engineer at Facebook, thinks it's turtles all the way down.

Notes

[*] This article is part of Perf Planet's 2010 Performance Calendar.

[0] Fortunately there is a newish tool for Linux called 'perf counters'.

[1] Jeff Dean of Google deserves a lot of credit for popularizing the 'numbers you should know' approach to performance and systems work. As my colleague Keith Adams put it, 'The ability to quickly discard bad solutions, without actually building them, is a lot of what good systems programming is. Some of that is instinct, some experience, but a lot of it is algebra.'

Monday 29 November 2010

Design — Sheepdog Project

The architecture of Sheepdog is fully symmetric; there is no central node such as a meta-data server. This design enables following features.

Linear scalability in performance and capacity
When more performance or capacity is needed, Sheepdog can be grown linearly by simply adding new machines to the cluster.

No single point of failure

Even if a machine fails, the data is still accessible through other machines.

Easy administration

There is no config file about cluster’s role. When administrators launch Sheepdog programs at the newly added machine, Sheepdog automatically detects the added machine and begins to configure it as a member of the cluster.

Architecture

Sheepdog is a storage system that provides a simple key-value interface to Sheepdog client (qemu block driver). Sheepdog is consists of multiple nodes.

Sheepdog consists of only one server (we call collie) and patched QEMU/KVM.

Virtual Disk Image (VDI)

A Sheepdog client divides a VM image into fixed-size objects (4 MB by default) and store them on the distributed storage system. Each object is identified by globally unique 64 bit id, and replicated to multiple nodes.

Object

Sheepdog objects are grouped into two types.

VDI Object: A VDI object contains metadata for a VM image such as image name, disk size, creation time, etc.

Data Object: A VM images is divided into a data object. Sheepdog client generally access this object.

Sheepdog uses consistent hashing to decide where objects store. Consistent hashing is a scheme that provides hash table functionality, and the addition or removal of nodes does not significantly change the mapping of objects. I/O load is balanced across the nodes by features of hash table. A mechanism of distributing the data not randomly but intelligently is a future work.

Each node is placed on consistent hashing ring based on its own id. To determine where to store the object, Sheepdog client gets the object id, finds the corresponding point on the ring, and walk clockwise to determine the target nodes.

VDI Operation

In most cases, Sheepdog clients can access their images independently because we do not allow for clients to access the same image at the same time. But some VDI operations (e.g. cloning VDI, locking VDI) must be done exclusively because the operations updating global information. To implement this in the highly available system, we use a group communication system (GCS). Group communication systems provide specific guarantees such as total ordering of messages. We use corosync, one of most famous GCS.