Unlimited-Data. moved to lab.itbee.vn : January 2011

Wednesday, 26 January 2011

InfoQ: Asynchronous, Event-Driven Web Servers for the JVM: Deft and Loft

Asynchronous, event-driven architectures have been gaining a lot of attention lately, mostly with respect toJavaScript and Node.js. Deft and Loft are two solutions that bring "asynchronous purity" to the JVM.

RelatedVendorContent

Domain-Specific Languages for Functional Testing

Future of Java Track - EE, Spring, JVMs+ @QConLondon

Agile Maturity Model Applied to Building and Releasing Software

WebSphere Application Server V7.0 Security Redbook

JBoss versus IBM WebSphere: Cost, Performance, Efficiency, Innovation (IBM wins)

Related Sponsor

Dynamic Application Infrastructure delivers the innovation, performance and scalability to build, deploy and manage all types of highly robust applications.

InfoQ had an interview with Roger Schildmeijer, one of the two creators, about these two non blocking web server frameworks:

InfoQ: What is Deft and Loft?

Roger: So before I start to describe what Deft and Loft is I would like to start from the beginning. September 2009 Facebook open sourced a piece of software called Tornado which was a relatively simple, non-blocking Web server framework written in Python, designed to handle thousands of simultaneous connection. Tornado gained a lot of attraction pretty quickly and became quite popular because of its strength and simplistic design. At this time a lot of developers out there became aware of the "c10k problem" (from Wikipedia: The C10k problem is the numeronym given to a limitation that most web servers currently have which limits the web server's capabilities to only handle about ten thousand simultaneous connections.)
In the late summer of 2010 Jim Petersson and I started started to discuss and design an asynchronous non-blocking web server/framework running on the JVM using pure Java NIO. I would say that the main reason for this initiative was our curiosity about the potential speed improvements that could be achieved with a system similar to Tornado but written in Java. We knew that we could never create an API as clean and simple as Tornado's.
(Clean APIs have never been the advantage of Java, if you ask me)
We got something up and running within the first 48h and saw some extraordinary (very non-scientific) benchmark results. This was the feedback we aimed for and Deft was born.
Just to clarify, I would be the last person to suggest that someone should throw away their current system using Tornado (or some other asynchronous non-blocking web server) and replace it with Deft. Deft is still pretty young and has a lot to learn (there are a lot of issues to be implemented). By the time this interview is published I hope that the next stable Deft release, 0.2.0, will be ready and released.
After a couple of weeks of Deft hacking we started to discuss how Deft would look like if we had used another language, like Scala. It would be very gratifying to create something with nice performance but also a system that had a clean and elegant syntax. This was the seed for another project, and Loft was born. To make another important clarification: Loft is still very much in its infancy and there is yet no stable release available.
The main features that set Loft apart from Deft are:
it's written in Scala

it uses a feature in Scala called continuations. The reason for this is to make asynchronous programming easier. (Will explain that in detail below)

the initial version will be a pure Tornado clone

InfoQ: Would you like to explain to us the motivation behind those architectural decisions?

Roger: We wanted to create something similar to Tornado that runs on the Java Virtual Machine. There are a lot of good (multi-threaded) open source web servers and web frameworks (apache tomcat and jetty are two popular examples) out there already, and we didn't want to compete with those beasts. Instead we wanted to build something that was good at handling thousands of simultaneous connections. The single threaded approach was already tested (and proved successful) by frameworks like Tornado.

InfoQ: What is a typical development workflow in Deft - from coding and debugging, up to deploying and monitoring?

Roger: This is a very good question and something that is very important to address. However, because Deft is so new and its user base is currently small, I'm afraid I can't give an example of a typical development workflow. That is something we hope a growing community will help flesh out..
The biggest difference from the coding that most Java developers do is that everything executes inside a single threaded environment. The benefits of coding that way are that you don't have to use explicit locks and you don't have to think about deadlocks because of inadequate synchronized code. The downside is that you are not allowed to do blocking operations. A blocking operation will stall the entire server and make it unresponsive
Deft 0.2.0 will contain an experimental JMX API used for monitoring. Some examples of things that could be monitored are the number of pending timeouts/keep-alive timeouts, number of registered IO handlers in the event loop and the select timeout of the selector.

InfoQ: Are there any benchmarks that evaluate Deft's performance?

Roger: We have made some (non-scientific) benchmarks against a simple web server using a single request handler that responds with a 200 OK HTTP status code and writes "hello world" to the client (The code for the benchmarks is also available on github.com/rschildmeijer). Against a simple "hello-world server" we have seen speed improvements by a factor of 8-10x (compared to Tornado). The entire results from the benchmarks are available on http://deftserver.org/.

InfoQ: How does Deft compare with other solutions like Tornado, Node.js or Netty?

Roger: Tornado and Node.js are the two other asynchronous web servers that we used in the benchmark. We didn't include Netty because it felt a little bit like comparing apples with oranges. But I wouldn't doubt if Netty showed numbers equal to (or greater?) the results we have seen for Deft. Netty, the successor to apache mina, is a really cool socket framework written by a really smart guy (Trustin Lee).

InfoQ: What was the motivation for Loft and how does it use continuations?

Roger: So finally time to show some code! (I would like to (once again) remind you that Loft is pretty much in the starting blocks and the code snippets for Loft are the proposed design)
A simple request handler for Loft could look something like this:
@Asynchronous   def get() {     val http = AsyncHTTPClient()     reset {       val id = database get("roger_schildmeijer");   //async call       val result = http fetch("http://127.0.0.1:8080/" + id); //async call       write(result)       finish      }   }    val application = Application(Map("/".r -> this))    def main(args: Array[String]) {     val httpServer = HTTPServer(application)     httpServer listen(8888)     IOLoop start   } } 
The main method contains the canonical way to start a Loft instance. The ExampleHandler.get is where things start to get interesting. Inside the method two asynchronous calls are made. Asynchronous programming is often conducted by supplying a callback as an additional parameter, and that callback will be called when the result is ready. And if you have two or more consecutive asynchronous calls, you (usually) will have to chain these calls together.
E.g.:
database get("roger_schildmeijer", (id) => http fetch("http://127.0.0.1:8080/" + id, (result) => write(result))); 
So what is actually going in the ExampleHandler.get method above?
You might have noticed the "reset" word in method, this indicates that some Scala continuations are about to happen. Continuations, as a concept, are hard to grasp. But when the magic disappears, continuations are just functions representing another point in the program (It contains information such as the process's current stack). If you call the continuation, it will cause execution to automatically switch to the point that function represents. (Actually you use restricted versions of them every time you do exception handling).
Just so I don't confuse anyone. The two asynchronous methods "get" and "fetch" must be implemented in a certain way in order for this example to work.

InfoQ: The asynchronous, event-driven paradigm has gained lots of attention lately. Why do you think is that and how do you see this trend evolving in the future?

One reason that systems like Tornado, node.js and Netty have received a lot of attention in recent years is because of big social networks that need a huge number of idle connections.
As long as you and I use services like Facebook and twitter I think the need for systems like Deft, Loft and Tornado will exist.
As a final note I would like to add that we are looking for contributors that are interested in supporting Deft (current status: two committers, two contributors) and/or Loft (two committers).

Tuesday, 18 January 2011

Google Megastore: The Data Engine Behind GAE

Google Megastore: The Data Engine Behind GAE: "

Megastore is the data engine supporting the Google
Application Engine.
It’s a scalable structured data store providing full ACID semantics within partitions
but lower consistency guarantees across partitions.

I wrote up some notes on it back in 2008 Under
the Covers of the App Engine Datastore and
posted Phil Bernstein’s excellent notes from a 2008
SIGMOD talk: Google
Megastore. But there has been remarkably
little written about this datastore over the intervening couple of years until this
year’s CIDR conference papers were posted. CIDR 2011 includes Megastore:
Providing Scalable, Highly Available Storage for Interactive Services.

My rough notes from the paper:

· Megastore
is built upon BigTable

· Bigtable
supports fault-tolerant storage within a single datacenter

· Synchronous
replication based upon Paxos and
optimized for long distance inter-datacenter links

· Partitioned
into a vast space of small databases each with its own replicated log

· Each
log stored across a Paxos cluster

· Because
they are so aggressively partitioned, each Paxos group only has to accept logs for
operations on a small partition. However, the design does serialize updates on each
partition

· 3
billion writes and 20 billion read transactions per day

· Support
for consistency unusual for a NoSQL database but driven by (what I believe to be)
the correct belief that inconsistent updates make many applications difficult to write
(see I
Love Eventual Consistency but …)

· Data
Model:

· The
data model is declared in a strongly typed schema

· There
are potentially many tables per schema

· There
are potentially many entities per table

· There
are potentially many strongly typed properties per entity

· Repeating
properties are allowed

· Tables
can be arranged hierarchically where child tables point to root tables

· Megastore
tables are either entity group root tables or child tables

· The
root table and all child tables are stored in the same entity group

· Secondary
indexes are supported

· Local
secondary indexes index a specific entity group and are maintained consistently

· Global
secondary indexes index across entity groups are asynchronously updates and eventually
consistent

· Repeated
indexes: supports indexing repeated values (e.g. photo tags)

· Inline
indexes provide a way to denormalize data
from source entities into a related target entity as a virtual repeated column.

· There
are physical storage hints:

· “IN
TABLE” directs Megastore to store two tables in the same underlying BigTable

· “SCATTER”
attribute prepends a 2 byte hash to each key to cool hot spots on tables with monotonically
increasing values like dates (e.g. a history table).

· “STORING”
clause on an index supports index-only-access by redundantly storing additional data
in an index. This avoids the double access often required of doing a secondary index
lookup to find the correct entity and then selecting the correct properties from that
entity through a second table access. By pulling values up into the secondary index,
the base table doesn’t need to be accessed to obtain these properties.

· 3
levels of read consistency:

· Current:
Last committed value

· Snapshot:
Value as of start of the read transaction

· Inconsistent
reads: used for cross entity group reads

· Update
path:

· Transaction
writes its mutations to the entity groups write-ahead log and then apply the mutations
to the data (write ahead logging).

· Write
transaction always begins with a current read to determine the next available log
position. The commit operation gathers mutations into a log entry, assigns an increasing
timestamp, and appends to log which is maintained using paxos.

· Update
rates within a entity group are seriously limited by:

· When
there is log contention, one wins and the rest fail and must be retried.

· Paxos
only accepts a very limited update rate (order 10^2 updates per second).

· Paper
reports that “limiting updates within an entity group to a few writes per second per
entity group yields insignificant write conflicts”

· Implication:
programmers must shard aggressively to get even moderate update rates and consistent
update across shards is only supported using two phase commit which is not recommended.

· Cross
entity group updates are supported by:

· Two-phase
commit with the fragility
that it brings

· Queueing
and asynchronously applying the changes

· Excellent
support for backup and redundancy:

· Synchronous
replication to protect against media failure

· Snapshots
and incremental log backups

Overall, an excellent paper with
lots of detail on a nicely executed storage system. Supporting consistent read and
full ACID update semantics is impressive although the limitation of not being able
to update an entity group at more than a “few per second” is limiting.

The paper: http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper32.pdf.

Thanks to Zhu
Han, Reto
Kramer, and Chris
Newcombe for all sending
this paper my way.

--jrh

James
Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

From Perspectives."

Riak's Bitcask - A Log-Structured Hash Table for Fast Key/Value Data

Riak's Bitcask - A Log-Structured Hash Table for Fast Key/Value Data: "

How would you implement a key-value storage system if you were starting from scratch? The approach Basho settled on with Bitcask, their new backend for Riak, is an interesting combination of using RAM to store a hash map of file pointers to values and a log-structured file system for efficient writes. In this excellent Changelog interview, some folks from Basho describe Bitcask in more detail.

The essential Bitcask:

Google Megastore - 3 Billion Writes and 20 Billion Read Transactions Daily

Google Megastore - 3 Billion Writes and 20 Billion Read Transactions Daily: "

A giant step into the fully distributed future has been taken by the Google App Engine team with the release of their High Replication Datastore. The HRD is targeted at mission critical applications that require data replicated to at least three datacenters, full ACID semantics for entity groups, and lower consistency guarantees across entity groups.

This is a major accomplishment. Few organizations can implement a true multi-datacenter datastore. Other than SimpleDB, how many other publicly accessible database services can operate out of multiple datacenters? Now that capability can be had by anyone. But there is a price, literally and otherwise. Because the HRD uses three times the resources as Google App Engine's Master/Slave datastatore, it will cost three times as much. And because it is a distributed database, with all that implies in the CAP sense, developers will have to be very careful in how they architect their applications because as costs increased, reliability increased, complexity has increased, and performance has decreased. This is why HRD is targeted ay mission critical applications, you gotta want it, otherwise the Master/Slave datastore makes a lot more sense.

The technical details behind the HRD are described in this paper, Megastore: Providing Scalable, Highly Available Storage for Interactive Services. This is a wonderfully written and accessible paper, chocked full of useful and interesting details. James Hamilton wrote an excellent summary of the paper in Google Megastore: The Data Engine Behind GAE. There are also a few useful threads in Google Groups that go into some more details about how it works, costs, and performance (the original announcement, performance comparison).

Some Megastore highlights:

A Plea to Software Vendors from Sysadmins - 10 Do's and Don'ts - ACM Queue

What can software vendors do to make the lives of sysadmins a little easier?

THOMAS A. LIMONCELLI, GOOGLE

A friend of mine is a grease monkey: the kind of auto enthusiast who rebuilds engines for fun on a Saturday night. He explained to me that certain brands of automobiles were designed in ways to make the mechanic's job easier. Others, however, were designed as if the company had a pact with the aspirin industry to make sure there are plenty of mechanics with headaches. He said those car companies hate mechanics. I understood completely because, as a system administrator, I can tell when software vendors hate me. It shows in their products.

A panel discussion at CHIMIT (Computer-Human Interaction for Management of Information Technology) 2009 discussed a number of do's and don'ts for software vendors looking to make software that is easy to install, maintain, and upgrade. This article highlights some of the issues uncovered. CHIMIT is a conference that focuses on computer-human interaction for IT workers—the opposite of most CHI research, which is about the users of the systems that IT workers maintain. This panel turned the microscope around and gave system administrators a forum to share how they felt about the speakers who were analyzing them.

Here are some highlights:

1. DO have a "silent install" option. One panelist recounted automating the installation of a software package on 2,000 desktop PCs, except for one point in the installation when a window popped up and the user had to click OK. All other interactions could be programmatically eliminated through a "defaults file." Linux/Unix tools such as Puppet and Cfengine should be able to automate not just installation, but also configuration. Deinstallation procedures should not delete configuration data, but there should be a "leave no trace" option that removes everything except user data.

2. DON'T make the administrative interface a GUI. System administrators need a command-line tool for constructing repeatable processes. Procedures are best documented by providing commands that we can copy and paste from the procedure document to the command line. We cannot achieve the same repeatability when the instructions are: "Checkmark the 3rd and 5th options, but not the 2nd option, then click OK." Sysadmins do not want a GUI that requires 25 clicks for each new user. We want to craft the commands to be executed in a text editor or generate them via Perl, Python, or PowerShell.

3. DO create an API so that the system can be remotely administered. An API gives us the ability to do things with your product you didn't think of. That's a good thing. System administrators strive to automate, and automate to thrive. The right API lets me provision a service automatically as part of the new employee account creation system. The right API lets me write a chat bot that hangs out in a chat room to make hourly announcements of system performance. The right API lets me integrate your product with a USB-controlled toy missile launcher. Your other customers may be satisfied with a "beep" to get their attention; I like my way better (http://www.kleargear.com/5004.html).

4. DO have a configuration file that is an ASCII file, not a binary blob. This way the files can be checked into a source-code control system. When the system is misconfigured it becomes important to be able to "diff" against previous versions. If the file can't be uploaded back into the system to re-create the same configuration, then we can't trust that you're giving us all the data. This prevents us from cloning configurations for mass deployment or disaster recovery. If the file can be edited and uploaded back into the system, then we can automate the creation of configurations. Archives of configuration backups make for interesting historical analysis.¹

5. DO include a clearly defined method to restore all user data, a single user's data, and individual items (for example, one e-mail message). The method to make backups is a prerequisite, obviously, but we care primarily about the restore procedures.

6. DO instrument the system so that we can monitor more than just, "Is it up or down?" We need to be able to determine latency, capacity, and utilization, and we need to be able to collect this data. Don't graph it yourself. Let us collect and analyze the raw data so we can make the "pretty picture" graphs that our nontechnical management will understand. If you aren't sure what to instrument, imagine the system being completely overloaded and slow: what parameters would we need to be able to find and fix the problem?

7. DO tell us about security issues. Announce them publicly. Put them in an RSS feed. Tell us even if you don't have a fix yet; we need to manage risk. Your PR department doesn't understand this, and that's OK. It is your job to tell them to go away.

8. DO use the built-in system logging mechanism (Unix syslog or Windows Event Logs). This allows us to leverage preexisting tools that collect, centralize, and search the logs. Similarly, use the operating system's built-in authentication system and standard I/O systems.

9. DON'T scribble all over the disk. Put binaries in one place, configuration files in another, data someplace else. That's it. Don't hide a configuration file in /etc and another one in /var. Don't hide things in \Windows. If possible, let me choose the path prefix at install time.

10. DO publish documentation electronically on your Web site. It should be available, linkable, and findable on the Web. If someone blogs about a solution to a problem, they should be able to link directly to the relevant documentation. Providing a PDF is painfully counterproductive. Keep all old versions online. The disaster recovery procedure for a 5-year-old, unsupported, pathetically outdated installation might hinge on being able to find the manual for that version on the Web.

Software is not just bits to us. It has a complicated life cycle: procurement, installation, use, maintenance, upgrades, deinstallation. Often vendors think only about the use (and some seem to think only about the procurement). Features that make software more installable, maintainable, and upgradable are usually afterthoughts. To be done right, these things need to be part of the design from the beginning, not bolted on later.

Be good to the sysadmins of the world. As one panelist said, "The inability to rapidly deploy your product affects my ability to rapidly purchase your products."

I should point that that this topic wasn't the main point of the CHIMIT panel. It was a very productive tangent. When I suggested that each panelist name his or her single biggest "don't," I noticed that the entire audience literally leaned forward in anticipation. I was pleasantly surprised to see software developers and product managers alike take an interest. Maybe there's hope, after all.
Q

REFERENCES

1. Plonka, D., Tack, A. J. 2009. An analysis of network configuration artifacts. In Proceedings of the 23rd Large Installation System Administration Conference (November): 79-91.

ACKNOWLEDGMENTS

I would like to thank the members of the panel: Daniel Boyd, Google; Æleen Frisch, Exponential Consulting and author; Joseph Kern, Delaware Department of Education; and David Blank-Edelman, Northeastern University and author. I was the panel organizer and moderator. I would also like to thank readers of my blog,www.EverythingSysadmin.com, for contributing their suggestions.