Unlimited-Data. moved to lab.itbee.vn : High Performance Computing Hits the Cloud

High Performance Computing Hits the Cloud: "

High
Performance Computing (HPC) is
defined by Wikipedia as:

High-performance computing
(HPC) uses supercomputers and computer
clusters to solve advanced computation
problems. Today, computer systems approaching the teraflops-region
are counted as HPC-computers. The term is most commonly associated with computing
used for scientific research or computational
science. A related term, high-performance
technical computing (HPTC),
generally refers to the engineering applications of cluster-based computing (such
as computational
fluid dynamics and the building
and testing of virtual prototypes).
Recently, HPC has come to be applied to business uses
of cluster-based supercomputers, such as data
warehouses, line-of-business
(LOB) applications, and transaction
processing.

Predictably, I use the broadest
definition of HPC including data intensive computing and all forms of computational
science. It still includes the old stalwart applications of weather modeling and weapons
research but the broader definition takes HPC from a niche market to being a big part
of the future of server-side computing. Multi-thousand node clusters, operating at
teraflop rates, running simulations over massive data sets is how petroleum exploration
is done, it’s how advanced financial instruments are (partly) understood, it’s how
brick and mortar retailers do shelf space layout and optimize their logistics chains,
it’s how automobile manufacturers design safer cars through crash simulation, it’s
how semiconductor designs are simulated, it’s how aircraft engines are engineered
to be more fuel efficient, and it’s how credit card companies measure fraud risk.
Today, at the core of any well run business, is a massive data store –they all have
that. The measure of a truly advanced company is the depth of analysis, simulation,
and modeling run against this data store. HPC
workloads are incredibly important today and the market segment is growing very quickly
driven by the plunging cost of computing and the business value understanding large
data sets deeply.

High Performance Computing is one of those
important workloads that many argue can’t move to the cloud. Interestingly, HPC has
had a long history of supposedly not being able to make a transition and then, subsequently,
making that transition faster than even the most optimistic would have guessed possible.
In the early days of HPC, most of the workloads were run on supercomputers.
These are purpose built, scale-up servers made famous by Control
Data Corporation and later by Cray
Research with the Cray
1 broadly covered in
the popular press. At that time, many argued that slow processors and poor performing
interconnects would prevent computational clusters from ever being relevant for these
workloads. Today more than ¾ of the fastest HPC systems in the world are based upon
commodity compute clusters.

The HPC community uses the Top-500 list
as the tracking mechanism for the fastest systems in the world. The goal of the Top-500
is to provide a scale and performance metric for a given HPC system. Like all benchmarks,
it is a good thing in that it removes some of the manufacturer hype but benchmarks
always fail to fully characterize all workloads. They abstract performance to a single
or small set of metrics which is useful but this summary data can’t faithfully represent
all possible workloads. Nonetheless, in many communities including HPC and Relational
Database Management Systems, benchmarks
have become quite important. The HPC world uses the Top-500
list which depends upon LINPACK as
the benchmark.

Looking at the most recent Top-500
list published in June 2010, we see that Intel processors now dominate the list with
81.6% of the entries. It is very clear that the HPC move to commodity clusters has
happened. The move that “couldn’t happen” is near complete and the vast majority of
very high scale HPC systems are now based upon commodity processors.

What about HPC in the cloud, the
next “it can’t happen” for HPC? In many respects, HPC workloads are a natural for
the cloud in that they are incredibly high scale and consume vast machine resources.
Some HPC workloads are incredibly spiky with mammoth clusters being needed for only
short periods of time. For example semiconductor design simulation workloads are incredibly
computationally intensive and need to be run at high-scale but only during some phases
of the design cycle. Having more resources to throw at the problem can get a design
completed more quickly and possibly allow just one more verification run to potentially
save millions by avoiding a design flaw. Using
cloud resources, this massive fleet of servers can change size over the course of
the project or be freed up when they are no longer productively needed. Cloud computing
is ideal for these workloads.

Other HPC uses tend to be more steady state
and yet these workloads still gain real economic advantage from the economies of extreme
scale available in the cloud. See Cloud
Computing Economies of Scale (talk, video)
for more detail.

When I dig deeper into “steady state HPC workloads”,
I often learn they are steady state as an existing constraint rather than by the fundamental
nature of the work. Is there ever value in running one more simulation or one more
modeling run a day? If someone on the team got a good idea or had a new approach to
the problem, would it be worth being able to test that theory on real data without
interrupting the production runs? More resources, if not accompanied by additional
capital expense or long term utilization commitment, are often valuable even for what
we typically call steady state workloads. For example, I’m guessing BP,
as it battles the Gulf
of Mexico oil spill,
is running more oil well simulations and tidal flow analysis jobs than originally
called for in their 2010 server capacity plan.

No workload is flat and unchanging.
It’s just a product of a highly constrained model that can’t adapt quickly to changing
workload quantities. It’s a model from the past.

There is no question there is value to being
able to run HPC workloads in the cloud. What makes many folks view HPC as non-cloud
hostable is these workloads need high performance, direct access to underlying server
hardware without the overhead of the virtualization common in most cloud computing
offerings and many of these applications need very high bandwidth, low latency networking.
A big step towards this goal was made earlier today when Amazon
Web Services announced the EC2
Cluster Compute Instance
type.

The cc1.4xlarge instance specification:

· 23GB
of 1333MHz DDR3 Registered ECC

· 64GB/s
main memory bandwidth

· 2
x Intel Xeon X5570 (quad-core Nehalem)

· 2
x 845GB 7200RPM HDDs

· 10Gbps
Ethernet Network Interface

It’s this last point that I’m particularly
excited about. The difference between just a bunch of servers in the cloud and a high
performance cluster is the network. Bringing 10GigE direct to the host isn’t that
common in the cloud but it’s not particularly remarkable. What is more noteworthy
is it is a full bisection
bandwidth network within the cluster. It
is common industry practice to statistically
multiplex network traffic over
an expensive network core with far less than full bisection bandwidth. Essentially,
a gamble is made that not all servers in the cluster will transmit at full interface
speed at the same time. For many workloads this actually is a good bet and one that
can be safely made. For HPC workloads and other data intensive applications like Hadoop,
it’s a poor assumption and leads to vast wasted compute resources waiting on a poor
performing network.

Why provide less than full bisection bandwidth? Basically,
it’s a cost problem. Because networking gear is still building on a mainframe design
point, it’s incredibly expensive. As a consequence, these precious resources need
to be very carefully managed and over-subscription levels of 60 to 1 or even over
100 to 1 are common. See Datacenter
Networks are in my Way for
more on this theme.

For me, the most interesting aspect
of the newly announced Cluster Compute instance type is not the instance at all. It’s
the network. These servers are on a full bisection bandwidth cluster network. All
hosts in a cluster can communicate with other nodes in the cluster at the full capacity
of the 10Gbps fabric at the same time without blocking. Clearly not all can communicate
with a single member of the fleet at the same time but the network can support all
members of the cluster communicating at full bandwidth in unison. It’s a sweet network
and it’s the network that makes this a truly interesting HPC solution.

Each Cluster Compute Instance
is $1.60 per instance hour. It’s now possible to access millions of dollars of servers
connected by a high performance, full bisection bandwidth network inexpensively. An
hour with a 1,000 node high performance cluster for $1,600. Amazing.

As a test of the instance type
and network prior to going into beta Matt Klein, one of the HPC team engineers, cranked
up LINPACK using an 880 server sub-cluster. It’s a good test in that it stresses the
network and yields a comparative performance metric. I’m not sure what Matt expected
when he started the run but the result he got just about knocked me off my chair when
he sent it to me last Sunday. Matt’s
experiment yielded a booming 41.82 TFlop Top-500
run.

For those of you as excited as I am interested in the details from the
Top-500 LINPACK run:

In: Amazon_EC2_Cluster_Compute_Instances_Top500_hpccinf.txt

Out: Amazon_EC2_Cluster_Compute_Instances_Top500_hpccoutf.txt

The announcement: Announcing
Cluster Compute Instances for EC2

This is phenomenal performance for a pay-as-you-go EC2 instance. But
what makes it much more impressive is that result would place the EC2 Cluster Compute
instance at #146 on the Top-500.
It also appears to scale well which is to say bigger numbers look feasible if more
nodes were allocated to LINPACK testing. As fun as that would be, it is time to turn
all these servers over to customers so we won’t get another run but it was fun.

You can now have one of the biggest
super computers in the world for your own private use for $1.60 per instance per hour.
I love what’s possible these days.

Welcome to the cloud, HPC!

--jrh

James
Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com