Sunday, 26 June 2011

SIGMOD 2011 in Athens

SIGMOD 2011 in Athens: "

Earlier this week, I was in Athens Greece
attending annual conference of the ACM
Machinery Special Interest Group on Management of Data
.
SIGMOD is one of the top two database events held each year attracting academic researchers
and leading practitioners from industry.

I kicked off the conference with the Plenary
keynote
. In this talk
I started with a short retrospection on the industry over the last 20 years. In my
early days as a database developer, things were moving incredibly quickly. Customers
were loving our products, the industry was growing fast and yet the products really
weren’t all that good. You know you are working on important technology when customers
are buying like crazy and the products aren’t anywhere close to where they should
be.

In my first release as lead architect on DB2
20 years ago, we completely rewrote the DB2 database engine process model moving from
a process-per-connected-user model to a single process where each connection only
consumes a single thread supporting many more concurrent connections. It was a fairly
fundamental architectural change completed in a single release. And in that same release,
we improved TPC-A performance
a booming factor of 10 and then did 4x more in the next release. It was a fun time
and things were moving quickly.


From the mid-90s through to around
2005, the database world went through what I refer to as the dark ages. DBMS code
bases had grown to the point where the smallest was more than 4 million lines of code,
the commercial system engineering teams would no longer fit in a single building,
and the number of database companies shrunk throughout the entire period down to only
3 major players. The pace of innovation was glacial and much of the research during
the period was, in the words of Bruce Lindsay, “polishing the round ball”. The problem
was that the products were actually passably good, customers didn’t have a lot of
alternatives, and nothing slows innovation like large teams with huge code bases.

In the last 5 years, the database
world has become exciting again. I’m seeing more opportunity in the database world
now than any other time in the last 20 years. It’s now easy to get venture funding
to do database products and the number of and diversity of viable products is exploding.
My talk focused on what changed, why it happened, and some of the technical backdrop
influencing.


A background thesis of the talk is that cloud
computing solves two of the primary reasons why customers used to be stuck standardizing
on a single database engine even though some of their workloads may have run poorly.
The first is cost. Cloud computing reduces costs dramatically (some of the cloud economics
argument: http://perspectives.mvdirona.com/2009/04/21/McKinseySpeculatesThatCloudComputingMayBeMoreExpensiveThanInternalIT.aspx)
and charges by usage rather than via annual enterprise license. One of the favorite
lock-ins of the enterprise software world is the enterprise license. Once you’ve signed
one, you are completely owned and it’s hard to afford to run another product. My
fundamental rule of enterprise software is that any company that can afford to give
you 50% to 80% reduction from “list price” is pretty clearly not a low margin operator.
That is the way much of the enterprise computing world continues to work: start with
a crazy price, negotiate down to a ½ crazy price, and then feel like a hero while
you contribute to incredibly high profit margins.

Cloud computing charges by the use in small
increments and any of the major database or open source offerings can be used at low
cost. That is certainly a relevant reason but the really significant factor is the
offloading of administrative complexity to the cloud provider. One
of the primary reasons to standardize on a single database is that each is so complex
to administer, that it’s hard to have sufficient skill on staff to manage more than
one. Cloud offerings like AWS
Relational Database Service
transfer
much of the administrative work to the cloud provider making it easy to chose the
database that best fits the application and to have many specialized engines in use
across a given company.

As costs fall, more workloads
become practical and existing workloads get larger. For
example, If analyzing three months of customer usage data has value to the business
and it becomes affordable to analyze two years instead, customers correctly want to
do it. The plunging cost of computing is fueling database size growth at a super-Moore
pace requiring either partitioned (sharded) or parallel DB engines.

Customers now have larger and
more complex data problems, they need the products always online, and they are now
willing to use a wide variety of specialized solutions if needed. Data intensive workloads
are growing quickly and never have there been so many opportunities and so many unsolved
or incompletely solved problems. It’s a great time to be working on database systems.





· Proceedings
extended abstract: http://www.sigmod2011.org/keynote_1.shtml


What Larry Page really needs to do to return Google to its startup roots | Slacy’s Blog

What Larry Page really needs to do to return Google to its startup roots | Slacy’s Blog

What Larry Page really needs to do to return Google to its startup roots


I worked at Google from 2005-2010, and saw the company go through many changes, and a huge increase in staff. Most importantly, I saw the company go from a place where engineers were seen as violent disruptors and innovators, to a place where doing things “The Google Way” was king, and where thinking outside the box was discouraged and even chastised. So, here’s a quick list of things I think Larry could do to bring the startup feel back to Google:

Let engineers do what they do best, and forget the rest.

This is probably the most important single point. Engineers at Google spend way too much time fussing about with everything other than engineering and product design. Focusing on shipping great, innovative products needs to be put before all else. Here’s a quick rundown of engineering frustrations at Google when I left:
  • Compiling & fixing other people’s code. This is a huge problem for the C++ developers at Google. They spend massive amounts of time compiling (and bug fixing) “the world” to make their project work. This needs to end. Put an end to source-code distributions for cross-team dependencies. Make teams (bigtable, GFS, Stubby, Chubby, etc.) deliver binaries & headers in some reasonable format.
  • Machine Resource Requests for products in the “less than a petabyte” class. Just hand out the resources pro-bono, track usage, and if they exceed some very high limit, then start charging. Why is this a struggle?
  • LCE & SRE “blockers”. Having support for Launch Coordination & Site Reliability is great, but when these people say “you can’t launch unless…” then you know they’re being a hindrance, and not a help.
  • Meetings. Seriously, people are drenched in “status update” and “team” meetings. If your company has to have “No meetings Thursday” then you’re doing it wrong. How about “No meetings except for Thursday”. That would make for a productive engineering team, not the other way around.
  • Weekly Snippets, perf, etc. I was continually amazed by the amount of “extra cruft work” that goes on. I know it sounds important, but engineers should be coding & designing.
  • Perf, Interviews & lengthly interview feedback. The old fashioned model of getting together in a room to discuss a candidate is way more efficient. Make sure that every single engineer in the building is participating in the interview process to spread the load more evenly. Don’t let the internal recruiters pick engineers for interviews, as they have favoritism and are improperly motivated. Limit to 1 interview per week, maximum. Make a simple system for “I can’t make this interview” and “I think this resume looks shitty and don’t want to talk to this candidate.”
  • Discourage of open source software. There is so much innovation going on in the open source world: Hadoop, MongoDB, Redis, Cassandra, memcached, Ruby on Rails, Django, Tornado (web framework), and many, many other products put Google infrastructure to shame when it comes to ease-of-use and product focus. Engineers are discouraged from using these systems, to the point where they’re chastised for even thinking of using anything other than Bigtable/Spanner and GFS/Colossus for their products.

Get rid of the proprietary cluster management system.

Yes, seriously. What they have is a glorified batch-scheduling system that makes modern datacenters feel like antiquated mainframes. Dedicated machines and resources are what startups have, so give them to your best engineers, and they’ll do great things. You should have learned this from the teragoogle team. Start building a better, Virtual Machine based system where engineers can own & manage machine images themselves, all the way down to the operating system, dependencies, etc. If more structure is needed, use existing open source packages or develop new systems in house, and open source those. Build new “non-standard” data centers that don’t use the old system, and that every engineer can use.
The cluster management system’s fatal flaw is that it requires too large of an ecosystem, and pidgeon-holes running jobs into a far too restrictive container. It doesn’t allow persistent local disk storage, since jobs can be terminated and relocated at any time. Services running there are then cajoled into using Bigtable and/or Colossus for their persistent storage, which rules out virtually all other external database systems (MySQL, etc.). This is an antiquated and overly constrained model for job allocation.

Switch to team-based distributed source control.

Teams or large related teams should manage their own source code. Provide git-based hosting, and nothing else. Cross-team deliverables should be done at the binary release level, not at the source code level. Hard Makefile-type dependencies between teams need to be abolished.

Rethink the “lots of redundant, unreliable hardware” mantra.

Having to launch a simple service in multiple datacenters around the world, and having to deal with near-weekly datacenter maintenance shutdowns is unacceptable for an agile startup. Startups need to focus on product, not process and infrastructure. One persistent Amazon EC2 instance is much more valuable than a 100 batch scheduled jobs in a cluster that goes down for maintenance every week. Stop doing that.

Eliminate NIH-syndrome

Google has a very, very strong NIH (Not Invented Here) syndrome. Alternate solutions (Hadoop, MongoDB, Redis, Cassandra, MySQL, RabbitMQ, etc.) are all seen as technically inferior and poorly engineered systems. Google needs to get off it’s high horse, and look at what’s happening outside of it’s organization. Hugely scalable services like Twitter are built on almost entirely open source stack, and they’re doing it really efficiently. Open source solutions have a product-focus that’s missing from much of Google’s infrastructure for infrastructure’s sake engineering endeavors. Focusing on the product first, and using any available solution is the agile way to experiment in new spaces.
Additionally, by eliminating the NIH syndrome, Google needs to allow these open systems into it’s production environment. Amazon and RackSpace have nailed this with reliable, virtual hosting solutions, and this is allowing services built on those platforms to be portable, efficient, and agile.

Remember that small, special-purpose is more agile than big, general-purpose.

Google is famously good at building huge pieces of infrastructure that solve big, important problems. GFS & Colossus for file storage, Bigtable, Blobstore and Spanner for structured data storage, Caffeine for document storage and indexing.
But, when faced with a new problem or new requirements, projects are expected to pidgeon-hole their needs into one of these systems, or be chastised for “doing it wrong”. Additionally, when your application needs inevitably don’t fit or grow out of existing infrastructure capabilities, requests for improvement or enhancement are lost in the noise. This means small teams are crippled by the lack of agility of these monstrous systems.
Google’s engineers need to think & act like startup founders. Only develop what’s absolutely necessary to get your job done. Simplicity counts. Complex systems are hard to learn, debug, and maintain. Keep it small and focused.

Implement an in-house incubator.

Do this right now. When a current employee comes to you and submits their resignation letter, and says they’re joining a startup, you should immediately respond with “Oh! Well, let me tell you about our in-house startup incubator…”
Put smart people together in a room, let them think freely about products and infrastructure, and good things will come of it. In fact, I might argue that every Staff level engineer or higher should “go on sabbatical” to the in-house incubator for a period of a minimum of 6 months. Rotate people in & out, and let them bring their incubator learnings back on to the main campus. Have one incubator per geography, at a minimum, possibly more. Let people choose their best freinds/coworkers, and go off and do something great for 6 months. No managers, no meetings, no supervision.

Make it very clear that good, small ideas matter.

This is so important. One of the things I heard over and over was “If your product isn’t a billion-dollar idea, then it’s not worth Google’s time.” This message sucks. What you’re saying is “your great idea that might make millions per year is less important than a small tweak to ads or search”. Even if it’s true, you need to foster innovation of much lesser initial impact.
Google acquisitions of companies in the $5-50mm range means that at some level, small businesses are valued. Make this very clear. It sucks to have someone say “your $5mm idea isn’t big enough” on the one hand, and then watch Google buy up companies for $5mm each. This is bad precedent.

Eliminate internal language and framework cronyism.

By this, I mean: “Stop forcing people to do things The Google Way”. There were several times where I had seen “unGoogly” system desgins get shot down because they didn’t use Bigtable, GFS, Colossus, Spanner, MegaStore, BlobStore, or any of the other internal systems.
For example, languages like Python are shunned upon because they’re “too slow for web frontends”. Let teams use whatever tools and languages they want, and are most efficient in. Don’t pass judgement on infrastructure, pass judgement on Products. If someone launches a great system based on Oracle and a bunch of Perl CGI scripts running on Sun Sparc 5′s, then you should praise them. If they’re crushed under load, then praise them even more for their success.
Engineers at Google spend huge amounts of their time being forced to prematurely optimize their backend and frontend infrastructure. Most of the time, this benefits no one, as small products never get big enough to need such heavyweight systems, and are bogged down with the cost of multiple redundancy, and by using poorly behaved internal APIs that don’t meet direct product needs.

Make a general purpose cloud for internal use.

Amazon EC2 is a better ecosystem for fast iteration and innovation than Google’s internal cluster management system. EC2 gives me reliability, and an easy way to start and stop entire services, not just individual jobs. Long-running processes and open source code are embraced by EC2, running a replicated sharded MongoDB instance on EC2 is almost a breeze. Google should focus on making a system that works within the entire Open Source ecosystem.

Acknowledge that 20% time is a lie.

Virtually no one I knew in my entire career there had an effective use of 20% time. There are stories about how some products are launched exclusively via 20% time, and I’ve seen people use their 20% time to effectively search for a new internal position, but for the vast majority of engineers, 20% time is a myth.
I think it’s a great idea, and it needs to be made effective. 1 day per week isn’t reasonable (you can’t get enough done in just one day and it’s hard to carry momentum). 1 week per month would be great, but doesn’t do justice to your “main” project. Something needs to budge here, and engineers need to be encouraged to take large amounts of time exploring new ideas and new directions. Really fostering internal tools and collaboration might be the right answer. I’m not sure, maybe they should just give up on it and give everyone a 20% raise. Oh wait, they did that already.

Repeat your mistakes.

Engineers learn by doing, and learn by making mistakes. Having rules about system design puts unnecessary constraints on thinking and products. Having internal lore around things like “Google will never let another thing like Orkut ever happen again” is blatantly wrong. Orkut was (and still is) a huge success, period. None of the infrastructure stuff matters. Even recent mistakes (Wave, etc.) should be praised and engineers should be encouraged to repeat those mistakes.

“Google Scale” is a myth.

Yes, I said it.
Google Search (the product) requires vast resources. Almost nothing else does, and yet is constrained and forced to run “at Google scale” when it’s completely unnecessary.
Giving engineers the freedom to think & design out of the box with respect to infrastructure and systems means you’ll be more efficient in the long run. Providing reliable platforms and data centers means you’ll have less redundancy, and be more efficient.
Given that a single machine can easily have 64GB of RAM, 10TB of disk, and 8 CPUs, it’s amazing that any product launch needs more than just a couple of that class of machine. Let engineers push the boundaries, make mistakes, and run on the edge.
A small system that falls down under load is a huge success
A large system that’s wasting resources and has only a few users is a huge failure.

Google BigTable, MapReduce, MegaStore vs. Hadoop, MongoDB

Google BigTable, MapReduce, MegaStore vs. Hadoop, MongoDB: "Google BigTable, MapReduce, MegaStore vs. Hadoop, MongoDB:
Dhanji R. Prasanna leaving Google:


Here is something you’ve may have heard but never quite believed before: Google’s vaunted scalable software infrastructure is obsolete. Don’t get me wrong, their hardware and datacenters are the best in the world, and as far as I know, nobody is close to matching it. But the software stack on top of it is 10 years old, aging and designed for building search engines and crawlers. And it is well and truly obsolete.

Protocol Buffers, BigTable and MapReduce are ancient, creaking dinosaurs compared to MessagePack, JSON, and Hadoop. And new projects like GWT, Closure and MegaStore are sluggish, overengineered Leviathans compared to fast, elegant tools like jQuery and mongoDB. Designed by engineers in a vacuum, rather than by developers who have need of tools.



Maybe it is just the disappointment of someone whose main project was killed

. Or maybe it is true. Or maybe it is just another magic triangle:

Agility Scalability Coolness factor Triangle

Edward Ribeiro mentioned a post from another ex-Googler which points out similar issues with Google’s philosophy.


Original title and link: Google BigTable, MapReduce, MegaStore vs. Hadoop, MongoDB (NoSQL databases © myNoSQL)

"

Thursday, 23 June 2011

High Scalability - High Scalability - 35+ Use Cases for Choosing Your Next NoSQL Database



Now we get to the point of considering use cases and which systems might be appropriate for those use cases.

What Are Your Options?

First, let's cover what are the various data models. These have been adapted from Emil Eifrem andNoSQL databases.
Document Databases
  • Lineage: Inspired by Lotus Notes.
  • Data model: Collections of documents, which contain key-value collections.
  • Example: CouchDB, MongoDB
  • Good at: Natural data modeling. Programmer friendly. Rapid development. Web friendly, CRUD.
Graph Databases
  • Lineage: Euler and graph theory.
  • Data model: Nodes & relationships, both which can hold key-value pairs
  • Example: AllegroGraph, InfoGrid, Neo4j
  • Good at: Rock complicated graph problems. Fast.
Relational Databases
  • Lineage: E. F. Codd in A Relational Model of Data for Large Shared Data Banks
  • Data Model: a set of relations
  • Example: VoltDB, Clustrix, MySQL
  • Good at: High performing, scalable OLTP. SQL access. Materialized views. Transactions matter. Programmer friendly transactions.
Object Oriented Databases
  • Lineage: Graph Database Research
  • Data Model: Objects
  • Example: Objectivity, Gemstone
Key-Value Stores
  • Lineage: Amazon's Dynamo paper and Distributed HashTables.
  • Data model: A global collection of KV pairs.
  • Example: Membase, Riak
  • Good at: Handles size well. Processing a constant stream of small reads and writes. Fast. Programmer friendly.
BigTable Clones
  • Lineage: Google's BigTable paper.
  • Data model: Column family, i.e. a tabular model where each row at least in theory can have an individual configuration of columns.
  • Example: HBase, Hypertable, Cassandra
  • Goog at: Handles size well. Stream massive write loads. High availability. Multiple-data centers. MapReduce.
Data Structure Servers
  • Lineage: ?
  • Example: Redis
  • Data model: Operations over dictionaries, lists, sets and string values.
  • Good at: Quirky stuff you never thought of using a database for before.
Grid Databases
  • Lineage: Data Grid and Tuple Space research.
  • Data Model: Space Based Architecture
  • Example: GigaSpaces, Coherence
  • Good at: High performance and scalable transaction processing.

What Should Your Application Use?

  • Key point is to rethink how your application could work differently in terms of the different data models and the different products. Right data model for the right problem. Right product for the right problem.
  • To see what models might help your application take a look at What The Heck Are You Actually Using NoSQL For? In this article I tried to pull together a lot of unconventional use cases of the different qualities and features developers have used in building systems.
  • Match what you need to do with these use cases. From there you can backtrack to the products you may want to include in your architecture. NoSQL, SQL, it doesn't matter.
  • Look at Data Model + Product Features + Your Situation. Products have such different feature sets it's almost impossible to recommend by pure data model alone.
  • Which option is best is determined by your priorities.

If Your Application Needs...

  • complex transactions because you can't afford to lose data or if you would like a simple transaction programming model then look at a Relational or Grid database.
    • Example: an inventory system that might want full ACID. I was very unhappy when I bought a product and they said later they were out of stock. I did not want a compensated transaction. I wanted my item!
  • to scale then NoSQL or SQL can work. Look for systems that support scale-out, partitioning, live addition and removal of machines, load balancing, automatic sharding and rebalancing, and fault tolerance.
  • to always be able to write to a database because you need high availability then look at Bigtable Clones which feature eventual consistency.
  • to handle lots of small continuous reads and writes, that may be volatile, then look at Document or Key-value or databases offering fast in-memory access. Also consider SSD.
  • to implement social network operations then you first may want a Graph database or second, a database like Riak that supports relationships. An in- memory relational database with simple SQL joins might suffice for small data sets. Redis' set and list operations could work too.

If Your Application Needs...

  • to operate over a wide variety of access patterns and data types then look at a Document database, they generally are flexible and perform well.
  • powerful offline reporting with large datasets then look at Hadoop first and second, products that support MapReduce. Supporting MapReduce isn't the same as being good at it.
  • to span multiple data-centers then look at Bigtable Clones and other products that offer a distributed option that can handle the long latencies and are partition tolerant.
  • to build CRUD apps then look at a Document database, they make it easy to access complex data without joins.
  • built-in search then look at Riak.
  • to operate on data structures like lists, sets, queues, publish-subscribe then look at Redis. Useful for distributed locking, capped logs, and a lot more.
  • programmer friendliness in the form of programmer friendly data types like JSON, HTTP, REST, Javascript then first look at Document databases and then Key-value Databases.

If Your Application Needs...

  • transactions combined with materialized views for real-time data feeds then look at VoltDB. Great for data-rollups and time windowing.
  • enterprise level support and SLAs then look for a product that makes a point of catering to that market. Membase is an example.
  • to log continuous streams of data that may have no consistency guarantees necessary at all then look at Bigtable Clones because they generally work on distributed file systems that can handle a lot of writes.
  • to be as simple as possible to operate then look for a hosted or PaaS solution because they will do all the work for you.
  • to be sold to enterprise customers then consider a Relational Database because they are used to relational technology.
  • to dynamically build relationships between objects that have dynamic properties then consider a Graph Database because often they will not require a schema and models can be built incrementally through programming.
  • to support large media then look storage services like S3. NoSQL systems tend not to handle large BLOBS, though MongoDB has a file service.

If Your Application Needs...

  • to bulk upload lots of data quickly and efficiently then look for a product supports that scenario. Most will not because they don't support bulk operations.
  • an easier upgrade path then use a fluid schema system like a Document Database or a Key-value Database because it supports optional fields, adding fields, and field deletions without the need to build an entire schema migration framework.
  • to implement integrity constraints then pick a database that support SQL DDL, implement them in stored procedures, or implement them in application code.
  • a very deep join depth the use a Graph Database because they support blisteringly fast navigation between entities.
  • to move behavior close to the data so the data doesn't have to be moved over the network then look at stored procedures of one kind or another. These can be found in Relational, Grid, Document, and even Key-value databases.

If Your Application Needs...

  • to cache or store BLOB data then look at a Key-value store. Caching can for bits of web pages, or to save complex objects that were expensive to join in a relational database, to reduce latency, and so on.
  • a proven track record like not corrupting data and just generally working then pick an established product and when you hit scaling (or other issues) use on of the common workarounds (scale-up, tuning, memcached, sharding, denormalization, etc).
  • fluid data types because your data isn't tabular in nature, or requires a flexible number of columns, or has a complex structure, or varies by user (or whatever), then look at Document, Key-value, and Bigtable Clone databases. Each has a lot of flexibility in their data types.
  • other business units to run quick relational queries so you don't have to reimplement everything then use a database that supports SQL.
  • to operate in the cloud and automatically take full advantage of cloud features then we may not be there yet.

If Your Application Needs...

  • support for secondary indexes so you can look up data by different keys then look at relational databases and Cassandra's new secondary index support.
  • creates an ever-growing set of data (really BigData) that rarely gets accessed then look at Bigtable Clone which will spread the data over a distributed file system.
  • to integrate with other services then check if the database provides some sort of write-behind syncing feature so you can capture database changes and feed them into other systems to ensure consistency.
  • fault tolerance check how durable writes are in the face power failures, partitions, and other failure scenarios.
  • to push the technological envelope in a direction nobody seems to be going then build it yourself because that's what it takes to be great sometimes.
  • to work on a mobile platform then look at CouchDB/Mobile couchbase.

Which Is Better?

  • Moving for a 25% improvement is probably not a reason to go NoSQL.
  • Benchmark relevancy depends on the use case. Does it match your situation(s)?
  • Are you a startup that needs to release a product as soon as possible and you are playing around with ideas? Both SQL and NoSQL can make an argument.
  • Performance may be equal on one box, but what happens when you need N?
  • Everything has problems, if you look at Amazon forums it's EBS is slow, or my instances won't reply, etc. For GAE it's the datastore is slow or X. Every product which people are using will have problems. Are you OK with the problems of the system you've selected?

Saturday, 18 June 2011

LexisNexis open-sources its Hadoop killer

LexisNexis open-sources its Hadoop killer: "

LexisNexis is releasing a set of open-source, data-processing tools that it says outperforms Hadoop and even handles workloads Hadoop presently can’t. The technology (and new business line) is called HPCC Systems, and was created 10 years ago within the LexisNexis Risk Solutions division that analyzes huge amounts of data for its customers in intelligence, financial services and other high-profile industries. There have been calls for a legitimate alternative to Hadoop, and this certainly looks like one.


According to Armando Escalante, CTO of LexisNexis Risk Solutions, the company decided to release HPCC now because it wanted to get the technology into the community before Hadoop became the de facto option for big data processing. Escalante told me during a phone call that he thinks of Hadoop as “a guy with a machete in front of a jungle — they made a trail,” but that he thinks HPCC is superior.


But in order to compete for mindshare and developers, he said, the company felt it had to open-source the technology. One big thing Hadoop has going for it is its open-source model, Escalante explained, which attracts a lot of developers and a lot of innovation. If his company wanted HPCC to “remain relevant” and keep improving through new use cases and ideas from a new community, the time for release was now and open source had to be the model.


Hadoop, of course, is the Apache Software Foundation project created several years ago by then-Yahoo employee Doug Cutting. It has become a critical tool for web companies — including Yahoo and Facebook — to process their ever-growing volumes of unstructured data, and is fast making its way into organizations of all types and sizes. Hadoop has spawned a number of commercial distributions and products, too, including from Cloudera, EMC  and IBM.


How HPCC works


Hadoop relies on two core components to store and process huge amounts of data: the Hadoop Distributed File System and Hadoop MapReduce. However, as Cloudant CEO Mike Miller explained in a post over the weekend, MapReduce is still a relatively complex language for writing parallel-processing workflows. HPCC seeks to remedy this with its Enterprise Control Language.


Escalante says ECL is a declarative, data-centric language that abstracts a lot of the work necessary within MapReduce. For certain tasks that take a thousand lines of code in MapReduce, he said, ECL only requires 99 lines. Furthermore, he explained, ECL doesn’t care how many nodes are in the cluster because the system automatically distributes data across however many nodes are present. Technically, though, HPCC could run on just a single virtual machine. And, says Escalante, HPCC is written in C++ — like the original Google MapReduce  on which Hadoop MapReduce is based — which he says makes it inherently faster than the Java-based Hadoop version.


HPCC offers two options for processing and serving data: the Thor Data Refinery Cluster and the Roxy Rapid Data Delivery Cluster. Escalante said Thor — so named for its hammer-like approach to solving the problem — crunches, analyzes and indexes huge amounts of data a la Hadoop. Roxie, on the other hand, is more like a traditional relational database or database warehouse that even can serve transactions to a web front end.


We didn’t go into detail on HPCC’s storage component, but Escalante noted that it does utilize a distributed file system, although it can support a variety of off-node storage architectures and/or local solid-state drives.


He added that in order to ensure LexisNexis wasn’t blinded by “eating its own dogfood,” his team hired a Hadoop expert to kick the tires on HPCC. The consultant was impressed, Escalante said, but did note some shortcomings that the team addressed as it readied the technology for release. It also built a converter for migrating Hadoop applications written in the Pig language to ECL.


Can HPCC Systems actually compete?


The million-dollar question is whether HPCC Systems can actually attract an ecosystem of contributors and users that will help it rise above the status of big data also-ran. Escalante thinks it can, in large part because HPCC already has been proven in production dealing with LexisNexis Risk Solutions’ 35,000 data sources, 5,000 transactions per second and large, paying customers. He added that the company also will provide enterprise licenses and proprietary applications in addition to the open-source code. Plus, it already has potential customers lined up.


It’s often said that competition means validation. Hadoop has moved from a niche set of tools to the core of a potentially huge business that’s growing every day, and even Microsoft has a horse in this race with its Dryad set of big data tools. Hadoop has already proven itself, but the companies and organizations relying on it for the their big data strategies can’t rest on their laurels.


Image courtesy of Flickr user NileGuide.com.


Related content from GigaOM Pro (subscription req’d):






"

Tuesday, 14 June 2011

InfoQ: Netflix’s Cloud Data Architecture

InfoQ: Netflix’s Cloud Data Architecture
Siddharth Anand overviews Netflix’s business model, then he explains why they chose Amazon AWS, and how they moved their data into the cloud using a NoSQL solution.