Thursday, 26 August 2010

Designing a Web Application with Scalability in Mind

Designing a Web Application with Scalability in Mind: "Max Indelicato, a Software Development Director and former Chief Software Architect, has written a post on how to design a web application for scalability. He suggests choosing the right deploying and storage solution, a scalable data storage and schema, and using abstraction layers. By Abel Avram"

Tuesday, 24 August 2010

Tech talk: Get your distributed pub/sub on… Ben Reed talks about Hedwig

Tech talk: Get your distributed pub/sub on… Ben Reed talks about Hedwig: "


Hedwig

Benjamin Reed (Yahoo! Research)

June 7, 2010


ABSTRACT


Hedwig is a large scale cross data center publish/subscribe service developed at Yahoo! Research. We needed a scalable, fault tolerant, publish/subscribe service that has strong delivery and ordering guarantees to maintain the consistency of replicas of datasets in different data centers. We found that we could use ZooKeeper, a coordination service, and BookKeeper, a distributed write-ahead logging service to build such a service. To present Hedwig I will first review two of the services it is built on: ZooKeeper and BookKeeper. I will then present the motivation for Hedwig, review its design, and present its current status.


BIOGRAPHY


Benjamin Reed has worked for almost 2 decades in the industry: from an intern working on CAD/CAM systems, to shipping and receiving applications in OS/2, AIX, and CICS, to operations, to system admin research and Java frameworks at IBM Almaden Research (11 years), until finally arriving at Yahoo! Research (3 years ago) to work on distributed computing problems. His main interests are large scale processing environments and highly available and scalable systems. Dr. Reed’s research project at IBM grew into OSGI, which is now in application servers, IDEs, cars, and mobile phones. While at Yahoo, he has worked also worked on Pig and ZooKeeper, which are Apache projects for which he is a committer. Benjamin has Ph.D. in Computer Science from the University of California, Santa Cruz.


[This video is part of LinkedIn's tech talk series.]

"

6 Ways to Kill Your Servers - Learning How to Scale the Hard Way

6 Ways to Kill Your Servers - Learning How to Scale the Hard Way: "


This is a guest post by Steffen Konerow, author of the High Performance Blog.


Learning how to scale isn’t easy without any prior experience. Nowadays you have plenty of websites like highscalability.com to get some inspiration, but unfortunately there is no solution that fits all websites and needs. You still have to think on your own to find a concept that works for your requirements. So did I.


A few years ago, my bosses came to me and said “We’ve got a new project for you. It’s the relaunch of a website that has already 1 million users a month. You have to build the website and make sure we’ll be able to grow afterwards”. I was already an experienced coder, but not in these dimensions, so I had to start learning how to scale – the hard way.



"

Sunday, 22 August 2010

Hoff’s 5 Rules Of Cloud Security…

Hoff’s 5 Rules Of Cloud Security…: "
Mike Dahn pinged me via Twitter with an interesting and challenging question:



I took this as a challenge in 5 minutes or less to articulate this in succinct, bulleted form. I timed it. 4 minutes & 48 seconds. Loaded with snark and Hoffacino-fueled dogma.

Here goes:

  1. Get an Amazon Web Services account, instantiate a couple of AMIs as though you were deploying a web-based application with sensitive information that requires resilience, security, survivability and monitoring. If you have never done this and you’re in security spouting off about the insecurities of Cloud, STFU and don’t proceed to step 2 until you do. AWS (IaaS) puts much of the burden on you to understand what needs to be done to secure Cloud-based services which is why I focus on it. It’s also accessible and available to everyone.

    -
  2. Take some time to be able to intelligently understand that as abstracted as much of Cloud is in terms of the lack of exposed operational moving parts, you still need to grok architecture holistically in order to be able to secure it — and the things that matter most within it. Building survivable systems, deploying securable (and as secure as you can make it) code, focusing on protecting information and ensuring you understand system design and The Three R’s (Resistance, Recognition, Recovery) is pretty darned important. That means you have to understand how the Cloud provider actually works so when they don’t you’ll already have planned around that…

    -
  3. Employ a well-developed risk assessment/management framework and perform threat modeling. See OCTAVE, STRIDE/DREAD, FAIR. Understanding whether an application or datum is OK to move to “the Cloud” isn’t nuanced. It’s a simple application of basic, straightforward and prudent risk management. If you’re not doing that now, Cloud is the least of your problems. As I’ve said in the past “if your security sucks now, you’ll be pleasantly surprised by the lack of change when you move to Cloud.”

    -
  4. Proceed to the Cloud Security Alliance website and download the guidance. Read it. Join one or more of the working groups and participate to make Cloud Security better in any way you believe you have the capacity to do so. If you just crow about how “more secure” the Cloud is or how “horribly insecure by definition” it is, it’s clear you’ve not done steps 1-3. Skip 1-3, go to #5 and then return to #1.

    -
  5. Use common sense. There ain’t no patch for stupid. Most of us inherently understand that this is a marathon and not a sprint. If you take steps 1-4 seriously you’re going to be able to logically have discussions and make decisions about what deployment models and providers suit your needs. Not everything will move to the Cloud (public, private or otherwise) but a lot of it can and should. Being able to layout a reasonable timeline is what moves the needle. Being an idealog on either side of the tarpit does nobody any good. Arguing is for Twitter, doing is for people who matter.

Cloud is only rocket science if you’re NASA and using the Cloud for rocket science. Else, for the rest of us, it’s an awesome platform upon which we leverage various opportunities to improve the way in which we think about and implement the practices and technology needed to secure the things that matter most to us.

/Hoff

(Yeah, I know. Not particularly novel or complex, right? Nope. That’s the point. Just like ”How to Kick Ass in Information Security — Hoff’s Spritually-Enlightened Top Ten Guide to Health, Wealth and Happiness“)

Related articles by Zemanta



Enhanced by Zemanta

Share/Bookmark
"

Tuesday, 17 August 2010

Scaling an AWS infrastructure - Tools and Patterns

Scaling an AWS infrastructure - Tools and Patterns: "


This is a guest post by Frédéric Faure (architect at Ysance), you can follow him on twitter.

How do you scale an AWS (Amazon Web Services) infrastructure? This article will give you a detailed reply in two parts: the tools you can use to make the most of Amazon’s dynamic approach, and the architectural model you should adopt for a scalable infrastructure.

I base my report on my experience gained in several AWS production projects in casual gaming (Facebook), e-commerce infrastructures and within the mainstream GIS (Geographic Information System). It’s true that my experience in gaming (IsCool, The Game) is currently the most representative in terms of scalability, due to the number of users (over 800 thousand DAU – daily active users – at peak usage and over 20 million page views every day), however my experiences in e-commerce and GIS (currently underway) provide a different view of scalability, taking into account the various problems of availability and data management. I will therefore attempt to provide a detailed overview of the factors to take into account in order to optimise the dynamic nature of an infrastructure constructed in a Cloud Computing environment, and in this case, in the AWS environment.

"

Thursday, 12 August 2010

Three Papers on Load Balancing

Three Papers on Load Balancing: "
Found a reference to these three papers on load balancing in the Cassandra mailing list.

Simple Efficient Load Balancing Algorithms for Peer-to-Peer Systems


David R. Karger, Matthias Ruhl:


Load balancing is a critical issue for the efficient oper- ation of peer-to-peer networks. We give two new load- balancing protocols whose provable performance guar- antees are within a constant factor of optimal. Our proto- cols refine the consistent hashing data structure that un- derlies the Chord (and Koorde) P2P network. Both pre- serve Chord’s logarithmic query time and near-optimal data migration cost.
Our first protocol balances the distribution of the key address space to nodes, which yields a load-balanced system when the DHT maps items “randomly” into the address space. To our knowledge, this yields the first P2P scheme simultaneously achieving O(logn) degree, O(logn) look-up cost, and constant-factor load balance (previous schemes settled for any two of the three).
Our second protocol aims to directly balance the dis- tribution of items among the nodes. This is useful when the distribution of items in the address space cannot be randomized—for example, if we wish to support range- searches on “ordered” keys. We give a simple protocol that balances load by moving nodes to arbitrary locations “where they are needed.” As an application, we use the last protocol to give an optimal implementation of a dis- tributed data structure for range searches on ordered data.


PDF available ☞ here.

Load Balancing in Structured P2P Systems


Ananth Rao, Karthik Lakshminarayanan, Sonesh Surana, Richard Karp, Ion Stoica:


Most P2P systems that provide a DHT abstraction dis- tribute objects among “peer nodes” by choosing random identifiers for the objects. This could result in an O(log N) imbalance. Besides, P2P systems can be highly heteroge- neous, i.e. they may consist of peers that range from old desktops behind modem lines to powerful servers connected to the Internet through high-bandwidth lines. In this paper, we address the problem of load balancing in such P2P sys- tems.
We explore the space of designing load-balancing algo- rithms that uses the notion of “virtual servers”. We present three schemes that differ primarily in the amount of infor- mation used to decide how to re-arrange load. Our simu- lation results show that even the simplest scheme is able to balance the load within 80% of the optimal value, while the most complex scheme is able to balance the load within 95% of the optimal value.


PS document available ☞ here.

Simple Load Balancing for Distributed Hash Tables


John Byers, Jeffrey Considine, Michael Mitzenmacher:


Distributed hash tables have recently become a useful building block for a variety of distributed applications. However, current schemes based upon consistent hashing require both considerable implementation complexity and substantial storage overhead to achieve desired load balancing goals. We argue in this paper that these goals can be achieved more simply and more cost-effectively. First, we suggest the direct application of the “power of two choices” paradigm, whereby an item is stored at the less loaded of two (or more) random alternatives. We then consider how associating a small constant number of hash values with a key can naturally be extended to support other load balancing strategies, including load-stealing or load-shedding, as well as providing natural fault-tolerance mechanisms.


PS document available ☞ here.

Happy reading!


Three Papers on Load Balancing originally posted on the NoSQL blog: myNoSQL

"

Hadoop: The Problem of Many Small Files

Hadoop: The Problem of Many Small Files: "Hadoop: The Problem of Many Small Files:
On why storing small files in HDFS is inefficient and how to solve this issue using Hadoop Archive:


When there are many small files stored in the system, these small files occupy a large portion of the namespace. As a consequence, the disk space is underutilized because of the namespace limitation. In one of our production clusters, there are 57 millions files of sizes less than 128 MB, which means that these files contain only one block. These small files use up 95% of the namespace but only occupy 30% of the disk space.



Hadoop: The Problem of Many Small Files originally posted on the NoSQL blog: myNoSQL

"