Monday, 12 July 2010

Making data Vanish

Making data Vanish: "

Given how hard it is to save data you want (see The Universe hates your data) to keep, losing data on the web should be easy. It isn’t, because it gets stored so many places in its travels.


Problem

But the power of the web means that silliness can now be stored and found with the speed of a Google search. You don’t want sexy love notes – or pictures – to a former flame posted after infatuation ends.


Or maybe you want to discuss relationship, health or work problems with a friend over email – and don’t want your musings to be later shared with others. Wouldn’t it be nice to know that such messages will become unreadable even if your friend is unreliable?


Researchers built a prototype service – Vanish – that seeks to:



. . . ensure that all copies of certain data become unreadable after a user-specified time, without any specific action on the part of a user, without needing to trust any single third party to perform the deletion, and even if an attacker obtains both a cached copy of that data and the user’s cryptographic keys and passwords.


That’s a tall order. Their 1st proof-of-concept failed. But they are continuing the fight.


Vanish

In Vanish: Increasing Data Privacy with Self-Destructing Data Roxana Geambasu, Tadayoshi Kohno, Amit A. Levy and Henry M. Levy of the University of Washington computer science department present an architecture and a prototype to do just that.


Ironically, the project utilizes the same P2P infrastructures that preserves and distribute data: BitTorrent’s VUZE distributed hash table (DHT) client.


The basic idea is this: Vanish encrypts your data with a random key, destroys the key, and then sprinkles pieces of the key across random nodes of the DHT. You tell the system when to destroy the key and your data goes poof!


They developed a data structure called a Vanishing Data Object (VDO) that encapsulates user data and prevents the content from persisting. And the data becomes unreadable even if the attacker gets a pristine copy of the VDO from before its expiration and all the associated keys and passwords.


Here’s a timeline for that attack:




DHT overview



A DHT is a distributed, peer-to-peer (P2P) storage network. . . . DHTs like Vuze generally exhibit a put/get interface for reading and storing data, which is implemented internally by three operations: lookup, get, and store. The data itself consists of an (index, value) pair. Each node in the DHT manages a part of an astronomically large index name space (e.g., 2160 values for Vuze).


DHTs are available, scalable, broadly distributed and decentralized with rapid node churn. All these properties are ideal for an infrastructure that has to withstand a wide variety of attacks.


Vanish architecture



Data (D) is encrypted (E) with key (K) to deliver cyphertext (C). Then K is split into N shares – K1,…,KN – and distributed across the DHT using a random access key (L) and a secure pseudo-random number generator. The K split uses a redundant erasure code so that a user definable subset of N shares can reconstruct the key.


The erasure codes are needed because DHTs lose data due to node churn. It is a bug that is also a feature for secure destruction of data.


Prototype

They built a Firefox plug-in for Gmail to create self-destructing emails and another – FireVanish – for making any text in a web input box self-destructing. They also built a file app, so you can make any file self-destructing. Handy for Word backup files that you don’t want to keep around.


The major change to the Vuze BitTorrent client was less than 50 lines of code to prevent lookup sniffing attacks. Those changes only affect the client, not the DHT.


The Vanish proto was cracked by a group of researchers at UT Austin, Princeton, and U of Michigan. They found that an eavesdropper could collect the key shards from the DHT and reassemble the “vanished” content.


Who is going to collect all the shard-like pieces on DHTs? Other than the NSA and other major intelligence services, probably no one. For extra security the data can be encrypted before VDO encapsulation.


The StorageMojo take

The Internet is paid for with our loss of privacy. Young people may think it no great loss, check back in 20 years and we’ll see what you think then.


It is slowly dawning on the public that their lives are an open book on the Internet. Expect a growing market for private communication and storage if ease-of-use and trust issues can be resolved.


You don’t have to be Tiger Woods to want to keep your private life private. I hope the Vanish team succeeds.


Courteous comments welcome, of course. Figures courtesy of the Vanish team.



Copyright © 2010 StorageMojo. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at is guilty of copyright infringement. Please contact legal@storagemojo.com so we can take legal action immediately.
Plugin by Taragana

Related posts:

  1. MaxiScale’s Web-scale file system A new web scale – they claim linear scaling to...

  2. Avere Systems: dynamic tiering for the masses – of data? Last week RisingTide Systems, a stealth startup with no web...

  3. A RisingTide lifts all clouds Check out their homepage and, as of today, “This page...


Related posts brought to you by Yet Another Related Posts Plugin.

"

Long tailed workloads and the return of hierarchical storage management

Long tailed workloads and the return of hierarchical storage management: "


Hierarchical
storage management
(HSM) also called
tiered storage management is back but in a different form. HSM exploits the access
pattern skew across data sets by placing cold, seldom accessed data on slow cheap
media and frequently accessed data on fast near media. In
old days, HSM typically referred to system mixing robotically managed tape libraries
with hard disk drive staging areas. HSM was actually never gone – its just a very
old technique to exploit data access pattern skew to reduce storage costs. Here’s
an old unit from
FermiLab.











Hot data or data currently being
accessed is stored on disk and old data that has not been recently accessed is stored
on tape. It’s a great way to drive costs down below disk but avoid the people costs
of tape library management and to (slightly) reduce the latency of accessing data
on tape.








The basic concept shows up anywhere where
there is data access locality or skew in the access patterns where some data is rarely
accessed and some data is frequently accessed. Since evenly distributed, non-skewed
access pattern only show up in benchmarks, this concept works on a wide variety of
workloads. Processors have cache hierarchies where the top of the hierarchy is very
expensive register files and there are multiple layers of increasingly large caches
between the register file and memory. Database management systems have large in memory
buffer pools insulating access to slow disk. Many very high scale services like Facebook
have mammoth in-memory caches insulating access to slow database systems. In the Facebook
example, they have 2TB of Memcached in front of their vast MySQL fleet:
Velocity
2010
.








Flash memory again opens up the opportunity
to apply HSM concepts to storage. Rather
than using slow tape and fast disk, we use (relatively) slow disk and fast NAND flash.
There are many approaches to implementing HSM over flash memory and hard disk drives.
Facebook implemented
Flashcache which
tracks access patterns at the logical volume layer (below the filesystem) in Linux
with hot pages written to flash and cold pages to disk. LSI
is a good example implementation done at the disk controller level with their
CacheCade product.
Others have done it in application specific logic where hot indexing structures are
put on flash and cold data pages are written to disk. Yet another approach that showed
up around 3 years ago is a
Hybrid
Disk Drive
.









Hybrid drives combine large, persistent
flash memory caches with disk drives in a single package. When they were first announced,
I was excited by them but I got over the excitement once I started benchmarking. It
was what looked to be a good idea but the performance really was unimpressive.








Hybrid rives still looks like a good idea
but now we actually have respectable performance with the
Seagate
Momentus XT
. This part is actually
aimed at the laptop market but I’m always watching client progress to understand what
can be applied to servers. This finally looks like its heading in the right direction. See
the AnandTech article on this drive for more performance data:
Seagate's
Momentus XT Reviewed, Finally a Good Hybrid HDD
.
I still slightly prefer the Facebook FlashCache approach but these hybrid drives are
worth watching.








Thanks to Greg
Linden
for sending
this one my way.








--jrh








James
Hamilton



e: jrh@mvdirona.com




w: http://www.mvdirona.com




b: http://blog.mvdirona.com / http://perspectives.mvdirona.com














From Perspectives."

What’s new in CDH3 b2: HBase

What’s new in CDH3 b2: HBase: "

Over the last two years, Cloudera has helped a great number of customers achieve their business objectives on the Hadoop platform. In doing so, we’ve confirmed time and time again that HDFS and MapReduce provide an incredible platform for batch data analysis and other throughput-oriented workloads. However, these systems don’t inherently provide subsecond response times or allow random-access updates to existing datasets. HBase, one of the new components in CDH3b2, addresses these limitations by providing a real time access layer to data on the Hadoop platform.


HBase is a distributed database modeled after BigTable, an architecture that has been in use for hundreds of production applications at Google since 2006. The primary goal of HBase is simple: HBase provides an immensely scalable data store with soft real-time random reads and writes, a flexible data model suitable for structured or complex data, and strong consistency semantics for each row. As HBase is built on top of the proven HDFS distributed storage system, it can easily scale to many TBs of data, transparently handling replicated storage, failure recovery, and data integrity.


In this post I’d like to highlight two common use cases for which we think HBase will be particularly effective:


Analysis of continuously updated data


Many applications need to analyze data that is always changing. For example, an online retail or web presence has a set of customers, visitors, and products that is constantly being updated as users browse and complete transactions. Social or graph-analysis applications may have an ever-evolving set of connections between users and content as new relationships are discovered. In web mining applications, crawlers ingest newly updated web pages, RSS feeds, and mailing lists.


With data access methods available for all major languages, it’s simple to interface data-generating applications like web crawlers, log collectors, or web applications to write into HBase. For example, the next generation of the Nutch web crawler stores its data in HBase. Once the data generators insert the data, HBase enables MapReduce analysis on either the latest data or a snapshot at any recent timestamp.


User-facing analytics


Most traditional data warehouses focus on enabling businesses to analyze their data in order to make better business decisions. Some algorithm or aggregation is run on raw data to generate rollups, reports, or statistics giving internal decision-makers better information on which to base product direction, design, or planning. Hadoop enables those traditional business intelligence use cases, but also enables data mining applications that feed directly back into the the core product or customer experience. For example, LinkedIn uses Hadoop to compute the “people you may know” feature, improving connectivity inside their social graph. Facebook, Yahoo, and other web properties use Hadoop to compute content- and ad-targeting models to improve stickiness and conversion rates.


These user-facing data applications rely on the ability not just to compute the models, but also to make the computed data available for latency-sensitive lookup operations. For these applications, it’s simple to integrate HBase as the destination for a MapReduce job. Extremely efficient incremental and bulk loading features allow the operational data to be updated while simultaneously serving traffic to latency-sensitive workloads. Compared with alternative data stores, the tight integration with other Hadoop projects as well as the consolidation of infrastructure are tangible benefits.


HBase in CDH3 b2


We at Cloudera are pretty excited about HBase, and are happy to announce it as a first class component in Cloudera’s Distribution for Hadoop v3. HBase is a younger project than other ecosystem projects, but we are committed to doing the work necessary to stabilize and improve it to the same level of robustness and operability as you are used to with the rest of the Cloudera platform.


Since the beginning of this year, we have put significant engineering into the HBase project, and will continue to do so for the future. Among the improvements contributed by Cloudera and newly available in CDH3 b2 are:



  • HDFS improvements for HBase – along with the HDFS team at Facebook, we have contributed a number of important bug fixes and improvements for HDFS specifically to enable and improve HBase. These fixes include a working sync feature that allows full durability of every HBase edit, plus numerous performance and reliability fixes.

  • Failure-testing frameworks – we have been testing HBase and HDFS under a variety of simulated failure conditions in our lab using a new framework called gremlins. Our mantra is that if HBase can remain available on a cluster infested with gremlins, it will be ultra-stable on customer clusters as well.

  • Incremental bulk load – this feature allows efficient bulk load from MapReduce jobs into existing tables.


What’s up next for HBase in CDH3?


Our top priorities for HBase right now are stability, reliability, and operability. To that end, some of the projects we’re currently working on with the rest of the community are:



  • Improved Master failover – the HBase master already has automatic failover capability – the HBase team is working on improving the reliability of this feature by tighter integration with Apache ZooKeeper.

  • Continued fault testing and bug fixing – the fault testing effort done over the last several months has exposed some bugs that aren’t yet solved – we’ll fix these before marking CDH3 stable.

  • Better operator tools and monitoring – work is proceeding on an hbck tool (similar to filesystem fsck) that allows operators to detect and repair errors on a cluster. We’re also improving monitoring and tracing to better understand HBase performance in production systems.


And, as Charles mentioned, we see CDH as a holistic data platform. To that end, several integration projects are already under way:



  • Flume integration – capture logs from application servers directly into HBase

  • Sqoop integration – efficiently import data from relational databases into HBase

  • Hive integration – write queries in a familiar SQL dialect in order to analyze data stored in HBase.


Get HBase


If HBase sounds like it’s a good fit for one of your use cases, I encourage you to head on over to the HBase install guide and try it out today. I’d also welcome any feedback you have – HBase, like CDH3 b2, is still in a beta form, and your feedback is invaluable for improving it as we march towards stability.

"

Sunday, 11 July 2010

Metamorphic code: Information from Answers.com

Metamorphic code: Information from Answers.com

While polymorphic viruses cipher their functional code to avoid pattern recognition, such a virus will still need to decipher the code - unmodified from infection to infection - in order to execute. Metamorphic viruses change their code to an equivalent one (i.e. a code doing essentially the same thing), so that a mutated virus never has the same executable code in memory (not even at runtime) as the original virus that constructed the mutation. This modification can be achieved using techniques like inserting NOP instructions, swapping registers, changing flow control with jumps or reordering independent instructions. Metamorphic code is usually more effective than polymorphic code. Unlike with polymorphic viruses, anti-virus products may not simply use emulation techniques to defeat metamorphism, since metamorphic code may never reveal code that remains constant from infection to infection.

Thursday, 8 July 2010

Strategy: Recompute Instead of Remember Big Data

Strategy: Recompute Instead of Remember Big Data: "

Professor Lance Fortnow, in his blog post Drowning in Data, says complexity has taught him this lesson: When storage is expensive, it is cheaper to recompute what you've already computed. And that's the world we now live in: Storage is pretty cheap but data acquisition and computation are even cheaper.


Jouni, one of the commenters, thinks the opposite is true: storage is cheap, but computation is expensive. When you are dealing with massive data, the size of the data set is very often determined by the amount of computing power available for a certain price. With such data, a linear-time algorithm takes O(1) seconds to finish, while a quadratic-time algorithm requires O(n) seconds. But as computing power increases exponentially over time, the quadratic algorithm gets exponentially slower.


For me it's not a matter of which is true, both positions can be true, but what's interesting is to think that storage and computation are in some cases fungible. Your architecture can decide which tradeoffs to make based on the cost of resources and the nature of your data. I'm not sure, but this seems like a new degree of freedom in the design space.



"

Sunday, 4 July 2010

Hadoop Summit 2010


I didn’t attend the Hadoop Summit this year or last but was at the inaugural event back in 2008 and it was excellent. This year, the Hadoop Summit 2010 was held June 29 again in Santa Clara. This agenda for the 2010 event is at: Hadoop Summit 2010 Agenda. Since I wasn’t able to be there, Adam Gray of the Amazon AWS team was kind enough to pass on his notes and let me use them here:

Key Takeaways
·         Yahoo and Facebook operate the world largest Hadoop clusters, 4,000/2,300 nodes with 70/40 petabytes respectively. They run full cluster replicas to assure availability and data durability. 
·         Yahoo released Hadoop security features with Kerberos integration which is most useful for long running multitenant Hadoop clusters. 
·         Cloudera released paid enterprise version of Hadoop with cluster management tools and several dB connectors and announced support for Hadoop security.
·         Amazon Elastic MapReduce announced expand/shrink cluster functionality and paid support.
·         Many Hadoop users use the service in conjunction with NoSQL DBs like Hbase or Cassandra.

Keynotes
Yahoo had the opening keynote with talks by Blake Irving, Chief Products Officer, Shelton Shugar, SVP of Cloud Computing, and Eric Baldeschwieler, VP of Hadoop. They talked about Yahoo’s scale, including 38k Hadoop servers, 70 PB of storage, and more than 1 million monthly jobs, with half of those jobs written in Apache Pig. Further their agility is improving significantly despite this massive scale—within 7 minutes of a homepage click they have a completely reconfigured preference model for that user and an updated homepage. This would not be possible without Hadoop. Yahoo believes that Hadoop is ready for enterprise use at massive scale and that their use case proves it. Further, a recent study found that 50% of enterprise companies are strongly considering Hadoop, with the most commonly cited reason being agility. Initiatives over the last year include: further investment and improvement in Hadoop 0.20, integration of Hadoop with Kerberos, and the Oozie workflow engine.

Next, Peter Sirota gave a keynote for Amazon Elastic MapReduce that focused on how the service makes combining the massive scalability of MapReduce with the web-scale infrastructure of AWS more accessible, particularly to enterprise customers. He also announced several new features including expanding and shrinking the cluster size of running job flows, support for spot instances, and premium support for Elastic MapReduce. Further, he discussed Elastic MapReduce’s involvement in the ecosystem including integration with Karmasphere and Datameer. Finally, Scott Capiello, Senior Director of Products at Microstrategy, came on stage to discuss their integration with Elastic MapReduce.

Cloudera followed with talks by Doug Cutting, the creator of Hadoop, and Charles Zedlweski, Senior Director of Product Management. They announced Cloudera Enterprise, a version of their software that includes production support and additional management tools. These tools include improved data integration and authorization management that leverages Hadoops security updates. And they demoed a WebUI for using these management tools.

The final keynote was given by Mike Schroepfer, VP of Engineering at Facebook. He talked about Facebook’s scale with 36 PB of uncompressed storage, 2,250 machines with 23k processors, and 80-90 TB growth per day. Their biggest challenge is in getting all that data into Hadoop clusters. Once the data is there, 95% of their jobs are Hive-based. In order to ensure reliability they replicate critical clusters in their entirety.  As far as traffic, the average user spends more time on Facebook than the next 6 web pages combined. In order to improve user experience Facebook is continually improving the response time of their Hadoop jobs. Currently updates can occur within 5 minutes; however, they see this eventually moving below 1 minute. As this is often an acceptable wait time for changes to occur on a webpage, this will open up a whole new class of applications.

Discussion Tracks
After lunch the conference broke into three distinct discussion tracks: Developers, Applications, and Research. These tracks had several interesting talks including one by Jay Kreps, Principal Engineer at LinkedIn, who discussed LinkedIn’s data applications and infrastructure. He believes that their business data is ideally suited for Hadoop due to its massive scale but relatively static nature. This supports large amounts of  computation being done offline. Further, he talked about their use of machine learning to predict relationships between users. This requires scoring 120 billion relationships each day using 82 Hadoop jobs. Lastly, he talked about LinkedIn’s in-house developed workflow management tool, Azkaban, an alternative to Oozie.

Eric Sammer, Solutions Architect at Cloudera, discussed some best practices for dealing with massive data in Hadoop. Particularly, he discussed the value of using workflows for complex jobs, incremental merges to reduce data transfer, and the use of Sqoop (SQL to Hadoop) for bulk relational database imports and exports. Yahoo’s Amit Phadke discussed using Hadoop to optimize online content. His recommendations included leveraging Pig to abstract out the details of MapReduce for complex jobs and taking advantage of the parallelism of HBase for storage. There was also significant interest in the challenges of using Hadoop for graph algorithms including a talk that was so full that they were unable to let additional people in.

Elastic MapReduce Customer Panel
The final session was a customer panel of current Amazon Elastic MapReduce users chaired by Deepak Singh. Participants included: Netflix,  Razorfish, Coldlight Solutions, and Spiral Genetics. Highlights include:

·         Razorfish discussed a case study in which a combination of Elastic MapReduce and cascading allowed them to take a customer to market in half the time with a 500% return in ad spend. They discussed how using EMR has given them much better visibility into their costs, allowing them to pass this transparency on to customers.

·         Netflix discussed their use of Elastic MapRedudce to setup a hive-based data warehouseing infrastructure. They keep a persistent cluster with data backups in S3 to ensure durability. Further, they also reduce the amount of data transfer through pre-aggregation and preprocessing of data.

·         Spiral Genetics talked about they had to leverage AWS to reduce capital expenditures. By using Amazon Elastic MapReduce they were able to setup a running job in 3 hours. They are also excited to see spot instance integration.

·         Coldlight Solutions said that buying $1/2M in infrastructure wasn’t even an option when they started. Now it is, but they would rather focus on their strength: machine learning and Amazon Elastic MapReduce allows them to do this.


Saturday, 3 July 2010

Filesystem Interfaces to NoSQL Solutions

Filesystem Interfaces to NoSQL Solutions: "

Sounds like it is becoming a trend. So far we have:



Jeff Darcy explains why having a filesystem interface to NoSQL storage can make sense:



Voldemort is pretty easy to use, but providing a filesystem interface makes it possible for even “naive” programs that know nothing about Voldemort to access data stored there. It also provides things like directories, UNIX-style owners and groups, and byte-level access to arbitrarily large files. Lastly, it handles some common contention/consistency issues in what I hope is a useful way, instead of requiring the application to deal with those issues itself.


☞ pl.atyp.us




"