Tuesday 31 December 2013

ŷhat | 10 Books for Data Enthusiasts

Source: http://blog.yhathq.com/posts/ten-data-books.html
August 11, 2013

Over the last few years, I've invested a lot of time exploring various areas of data analysis and software development. Going down the proverbial coding rabbit hole, I've quietly accumulated a lot of books on various subjects.
This is a post about 10 data books that I've gotten a lot of milage out of and that really have legs.
  1. Programming Collective Intelligence by Toby Segaran

    Synopsis
    An overview of machine learning and the key algorithms in use today. Each chapter outlines a problem, defines an approach to solving it using a particular algorithm, and then gives you all the sample code you need to solve it.
    Why you should read it
    One of my favorite books (non-techincal and technical). I try to re-read it at least once per year. Great explanations of how you can make machine learning useful.
    Everyone has something to learn from PCI. My only criticism--the code is indented with 2 spaces instead of 4. Nitpicky, but annoying. Despite the fact that this is one of the oldest books on the list, it has managed to stay extremely relevant in the ever changing landscape of data analysis tools.
  2. Machine Learning for Hackers by Drew Conway and John Myles White

    Synopsis
    A series of real world case studies and solutions which use machine learning. This is a very practical approach to machine learning. The visuals are great and there are plenty of code samples to go around. A few of the chapters focusing on text classification/regression are particularly well done.
    Why you should read it
    I was on the pre-order list for this one. It was a gruelling 3 months on the waiting list but when it arrived Machine Learning for Hackers didn't disappoint. The code examples are optimized for readability rather than optimization which makes it much easier to follow along in the book (and translate them to other languages if need be). The code examples were also translated into Python , so I've included the Python logo even though it's not actually in the book.
  3. Super Crunchers by Ian Ayres

    Synopsis
    A collection of stories about data, modeling, and analysis, Super Crunchers tells how data and analysis are used in practice. Some of the examples are a little dated, but the core message stands the test of time.
    Why you should read it
    It's a lot higher level than most of the books on this list, and is geared for people who might not actually be doing the analysis or the modeling. Still, Super Crunchers is a great read and if you happen to be an analyst or data scientist, this will give you some insight into how the rest of the world views your work (for better or worsee). The most important takeaway from the book is not neccessarily what algorithms or technologies are being applied, but how they're being applied and how they're changing the way that companies use their data.
  4. Python for Data Analysis by Wes McKinney

    Synopsis
    A few years ago Wes McKinney took one for the team. He quit his job and wrote pandas, the open source Python package for wrangling data. Naturally Wes is the best person to write the book on pandas. The title may be a little misleading but Python for Data Analysis shows you the ins and outs of using pandas to improve your workflow.
    Why you should read it
    pandas is a must have for doing analysis with Python. This book focuses more on munging, wrangling, and formatting data (not modeling which many people incorrectly assume). So if you need brush up on your data wrangling (and you probably do) grab this off the shelf.
  5. R Cookbook by Paul Teetor

    Synopsis
    Pretty straightforward. A series of recipes for problems frequently encountered when doing analysis. Things like: building a regession model, merging data, imputing values, file i/o, etc.
    Why you should read it
    R can be a prickly language. The syntax is a little strange when you first start, everything is in tabular form, and weird stuff just tends to happen in general . This is the perfect book for when you have a question like:
    "I just want to loop through a bunch of files and combine them together. I know exactly how I'd to it in Python, but how the heck do I do it in R?"
    I strongly recommend this book if you're learning R, especially if you're coming form another programming language. It'll sit on your desk at work forever and you're guaranteed to pick it up at least a couple times per week.
  6. The Signal and the Noise by Nate Silver

    Synopsis
    A great overview of how predictions impact different parts of our lives. The book follows a similar pattern to Super Crunchers, telling stories related to data and prediction, and then tying them all together at the end. A great, quick read for anyone interested in data or analysis.
    Why you should read it
    Just because it's on The Internet doesn't mean it's true. Same goes with data. If you stare at a chart for long enough, a trend begins to emerge. The Signal and the Noise does a great job at teaching you when to throw up a warning flag when someone hands you some analysis.
  7. Visualize This by Nathan Yau

    Synopsis
    This is essentially the first couple years of Nathan Yau's blog, Flowing Data, in book format. There are great code examples to go along with some truely spectacular visuals.
    Why you should read it
    You can't show off your work with out some nifty data visuals. This book takes you step by step and shows you how you it's easy to construct great looking charts, maps, and other visuals if you use the right tools.
  8. ggplot2: Elegant Graphics for Data Analysis by Hadley Wickham

    Synopsis
    The name pretty much sums it up. This book shows you how to use ggplot2 by walking you through some examples and gradually adding complexity.
    Why you should read it
    If you're going to use R, you're inevitably going to be using ggplot2. ggplot2 is one the most popular R packages and probably the standard for making great looking visualizations. Who better to teach you how to use ggplot2 than the package's creator, Hadley Wickham. The book provides some core examples for making basic plots, and then exapnds on each of these by detailing some of the more in depth and advanced features of ggplot2 which makes it great for both beginners and advanced users.
  9. The NLTK Books by Jacob Perkins, Steven Bird, Ewan Klein, and Edward Loper

    Synopsis
    The Natural Language Toolkit (NLTK) is an excellent Python library for processing text and language. It has excellent APIs that can preproces, classify, and help analyze your text. The Cookbook and the freely available online book serve as the instruction manuals for using NLTK.
    Why you should read it
    Text analytics is really fun. Some of the examples in the NLTK books are really just magical (the text classification chapter is particularly cool ). Some of the code examples use a lot of the Python syntactic sugar which can make it a little difficult to read for someone who is new to Python, but the breadth of examples more than makes up for it. Top it all off with a really amazing library and it makes for a great read.
  10. Think Stats by Allen B. Downey

    Synopsis
    This book provides a gentle overview to statistics and a nice tutorial on using Python as well. It's sort of a crash course in statistics for those of us who chose to major in something less mathy in school.
    Why you should read it
    It's short, sweet, and to the point. Think Stats serves as the introduction to statistics course that many people missed out on in school. If you need to brush up on CDFs, PDFs, Normal Variates, or the Central Limit Theorem, then this is the book you're looking for. Also not a bad way to learn Python while picking up some stats skills.

Other Books

A few others that didn't quite make the list but we still love:
Let us know if there are any others you think we missed!

Sent from Evernote

Top Posts of 2013: Big Data Beyond MapReduce: Google's Big Data Papers | Architects Zone

Source: http://architects.dzone.com/articles/big-data-beyond-mapreduce

Mainstream Big Data is all about MapReduce, but when looking at real-time data, limitations of that approach are starting to show. In this post, I'll review Google's most important Big Data publications and discuss where they are (as far as they've disclosed).

MapReduce, Google File System and Bigtable: the mother of all big data algorithms

Chronologically the first paper is on the Google File System  from 2003, which is a distributed file system. Basically, files are split into chunks which are stored in a redundant fashion on a cluster of commodity machines (Every article about Google has to include the term "commodity machines"!)
Next up is the MapReduce  paper from 2004. MapReduce has become synonymous with Big Data. Legend has it that Google used it to compute their search indices. I imagine it worked like this: They have all the crawled web pages sitting on their cluster and every day or so they ran MapReduce to recompute everything.
Next up is the Bigtable  paper from 2006 which has become the inspiration for countless NoSQL databases like Cassandra, HBase, and others. About half of the architecture of Cassandra is modeled after BigTable, including the data model, SSTables, and write-through-logs (the other half being Amazon's Dynamo database for the peer-to-peer clustering model).

Percolator: Handling individual updates

Google didn't stop with MapReduce. In fact, with the exponential growth of the Internet, it became impractical to recompute the whole search index from scratch. Instead, Google developed a more incremental system, which still allowed for distributed computing.
Now here is where it's getting interesting, in particular compared to what common messages from mainstream Big Data are. For example, they have reintroduced transactions, something NoSQL still tells you that you don't need or cannot have if you want to have scalability.
In the Percolator  paper from 2010, they describe how the Google is keeping its web search index up to date. Percolator is built on existing technologies like Bigtable, but adds transactions and locks on rows and tables, as well as notifications for change in the tables. These notifications are then used to trigger the different stages in a computation. This way, the individual updates can "percolate" through the database.
This approach is reminiscent of stream processing frameworks (SPFs) like Twitter's Storm , or Yahoo's S4 , but with an underlying data base. SPFs usually use message passing and no shared data. This makes it easier to reason about what is happening, but also has the problem that there is no way to access the result of the computation unless you manually store it somewhere in the end.

Pregel: Scalable graph computing

Eventually, Google also had to start mining graph data like the social graph in an online social network, so they developed Pregel , published in 2010.
The underlying computational model is much more complex than in MapReduce: Basically, you have worker threads for each node which are run in parallel iteratively. In each so-called superstep, the worker threads can read messages in the node's inbox, send messages to other nodes, set and read values associated with nodes or edges, or vote to halt. Computations are run till all nodes have voted to halt. In addition, there are also Aggregators and Combiners which compute global statistics.
The paper shows how to implement a number of algorithms like Google's PageRank, shortest path, or bipartite matching. My personal feeling is that Pregel requires even more rethinking on the side of the implementor than MapReduce or SPFs.

Dremel: Online visualizations

Finally, in another paper from 2010, Google describes Dremel , which is an interactive database with an SQL-like language for structured data. So instead of tables with fixed fields like in an SQL database, each row is something like a JSON object (of course, Google uses it's own protocol buffer  format). Queries are pushed down to servers and then aggregated on their way back up and use some clever data format for maximum performance.

Big Data beyond MapReduce

Google didn't stop with MapReduce, but they developed other approaches for applications where MapReduce wasn't a good fit, and I think this is an important message for the whole Big Data landscape. You cannot solve everything with MapReduce. You can make it faster by getting rid of the disks and moving all the data to in-memory, but there are tasks whose inherent structure makes it hard for MapReduce to scale.
Open source projects have picked up on the more recent ideas and papers by Google. For example, ApacheDrill  is reimplementing the Dremel framework, while projects like Apache Giraph  and Stanford's GPS  are inspired by Pregel.
There are still other approaches as well. I'm personally a big fan of stream mining (not to be confused with stream processing) which aims to process event streams with bounded computational resources by resorting to approximation algorithms. Noel Welsh has some interesting slide's  on the topic.

Sent from Evernote

Wednesday 18 December 2013

Using RequireJS with Angular - Inline Block's Blog


Since attending Fluent Conf 2013 and watching the many AngularJS talks and seeing the power of its constructs, I wanted to get some experience with it.
Most of the patterns for structuring the code for single page webapps, use some sort dependency management for all the JavaScript instead of using global controllers or other similar bad things. Many of the AngularJS examples seem to follow these bad-ish patterns. Using angular.module('name' , []), helps this problem (why don't they show more angular.module() usage in their tutorials?), but you can still end up with a bunch of dependency loading issues (at least without hardcoding your load order in your header). I even spent time talking to a few engineers with plenty experience with Angular and they all seemed to be okay with just using something like Ruby's asset pipeline to include your files (into a global scope) and to make sure everything ends up in one file in the end via their build process. I don't really like that, but if you are fine with that, I'd suggest you do what you are most comfortable with.

Why RequireJS?

I love using RequireJS. You can async load your dependencies and basically remove all globals from your app. You can use r.js to compile all your JavaScript into a single file and minify that easily, so that your app loads quickly.
So how does this work with Angular? You'd think it would be easy when making single page web apps. You need your 'module' aka your app. You add the routing to your app but to have your routing, you need the controllers and to have your controllers you need the module they belong to. If you do not structure your code and what you load in with Require.js in the right order, you end up with circular dependencies.

Example

So below for my directory structure. My module/app is called "mainApp".
My base public directory:
directory listing
123456789101112131415
index.html- javascripts    - controllers/    - directives/    - factories/    - modules/    - routes/    - templates/    - vendors/      require.js      jquery.js    main.js    require.js- stylesheets/  ...
Here is my boot file, aka my main.js.
javascripts/main.js
12345678910111213141516171819
require.config({  baseUrl: '/javascripts',  paths: {    'jQuery': '//ajax.googleapis.com/ajax/libs/jquery/1.10.1/jquery.min',    'angular': '//ajax.googleapis.com/ajax/libs/angularjs/1.0.7/angular',    'angular-resource': '//ajax.googleapis.com/ajax/libs/angularjs/1.0.7/angular-resource'  },  shim: {    'angular' : {'exports' : 'angular'},    'angular-resource': { deps:['angular']},    'jQuery': {'exports' : 'jQuery'}  }});require(['jQuery', 'angular', 'routes/mainRoutes'] , function ($, angular, mainRoutes) {  $(function () { // using jQuery because it will run this even if DOM load already happened    angular.bootstrap(document , ['mainApp']);  });});
You'll notice how I am not loading my mainApp in. Basically we are bringing in the last thing that needs to configured for your app to load, to prevent circular dependencies. Since the Routes need the mainApp controllers and the controllers need the mainApp module, we just have them directly include the mainApp.js.
Also we are configuring require.js to bring in angular and angular-resource (angular-resource so we can do model factories).
Here is my super simple mainApp.js
javascripts/modules/mainApp.js
123
define(['angular' , 'angular-resource'] , function (angular) {  return angular.module('mainApp' , ['ngResource']);});
And here is my mainRoutes file:
javascripts/routes/mainRoutes.js
123456
define(['modules/mainApp' , 'controllers/listCtrl'] , function (mainApp) {  return mainApp.config(['$routeProvider' , function ($routeProvider) {    $routeProvider.when('/' , {controller: 'listCtrl' , templateUrl: '/templates/List.html'});  }]);});
You will notice I require the listCtrl, but actually use its reference. Including it adds it to my mainApp module so it can be used.
Here is my super simple controller:
javascripts/controllers/listCtrl.js
12345
define(['modules/mainApp' , 'factories/Item'] , function (mainApp) {  mainApp.controller('listCtrl' , ['$scope' , 'Item' , function ($scope , Item) {    $scope.items = Item.query();  });});
So you'll notice, I have to include that mainApp again, so I can add the controller to it. I also have a dependency on Item, which in this case is a factory.The reason I include that, is so that it gets added to the app, so the dependency injection works. Again, I don't actually reference it, I just let dependency injection do its thing.
Lets take a look at this factory really quick.
javascripts/factories/Item.js
12345
define(['modules/mainApp'] , function (mainApp) {  mainApp.factory('Item' , ['$resource' , function ($resource) {    return $resource('/item/:id' , {id: '@id'});  }]);});
Pretty simple, but again, we have to pull in that mainApp module to add the factory to it.
So finally lets look at our index.html, most if it is simple stuff, but the key part is the ng-view portion, which tells angular where to place the view. Even if you don't use document in your bootstrap and opt to use a specific element, you still need this ng-view.
index.html
123456789101112
  <!DOCTYPE html><html><head>  <title>Angular and Require</title>  <script src="/javascripts/require.js" data-main="javascripts/main"></script></head><body>  <div class="page-content">    <ng:view></ng:view>  </div></body></html>
Posted by Inline Block Jun 6th, 2013 amd, angular, angularjs, coding, javascript, requirejs

Like
Share
3 people like this. Be the first of your friends.

Sent from Evernote

Monday 9 December 2013

PayPal Switches from Java to JavaScript


PayPal Switches from Java to JavaScript

by Abel Avram on Nov 29, 2013 | Discuss
PayPal has decided to use JavaScript from browser all the way to the back-end server for web applications, giving up legacy code written in JSP/Java.
Jeff Harrell, Director of Engineering at PayPal, has explained in a couple of blog posts (Set My UI Free Part 1: Dust JavaScript Templating, Open Source and More , Node.js at PayPal ) why they decided and some conclusions resulting from switching their web application development from Java/JSP to a complete JavaScript/Node.js stack.
According to Harrell, PayPal's websites had accumulated a good deal of technical debt, and they wanted a "technology stack free of this which would enable greater product agility and innovation." Initially, there was a significant divide between front-end engineers working in web technologies and back-end ones coding in Java. When a UX person wanted to sketch up some pages, they had to ask Java programmers to do some back-end wiring to make it work. This did not fit with their Lean UX development model:
At the time, our UI applications were based on Java and JSP using a proprietary solution that was rigid, tightly coupled and hard to move fast in. Our teams didn't find it complimentary to our Lean UX development model and couldn't move fast in it so they would build their prototypes in a scripting language, test them with users, and then later port the code over to our production stack.
They wanted a "templating [solution that] must be decoupled from the underlying server technology and allow us to evolve our UIs independent of the application language" and that would work with multiple environments. They decided to go with Dust.js  – a templating framework backed up by LinkedIn – , plus Twitter's Bootstrap  and Bower , a package manager for the web. Additional pieces added later were LESS , RequireJS , Backbone.js , Grunt , and Mocha .
Some of PayPal's pages have been redesigned but they still had some of the legacy stack:
… we have legacy C++/XSL and Java/JSP stacks, and we didn't want to leave these UIs behind as we continued to move forward. JavaScript templates are ideal for this. On the C++ stack, we built a library that used V8 to perform Dust renders natively – this was amazingly fast! On the Java side, we integrated Dust using a Spring ViewResolver coupled with Rhino to render the views.
At that time, they also started using Node.js for prototyping new pages, concluding that it was "extremely proficient" and decided to try it in production. For that they also built Kraken.js , a "convention layer" placed on top of Express  which is a Node.js-based web framework. (PayPal has recently open sourced Kraken.js.) The first application to be done in Node.js was the account overview page, which is one of the most accessed PayPal pages, according to Harrell. But because they were afraid the app might not scale well, they decided to create an equivalent Java application to fall back to in case the Node.js one won't work. Following are some conclusions regarding the development effort required for both apps:
Java/Spring JavaScript/Node.js
Set-up time 0 2 months
Development ~5 months ~3 months
Engineers 5 2
Lines of code unspecified 66% of unspecified
The JavaScript team needed 2 months for the initial setup of the infrastructure, but they created with fewer people an application with the same functionality in less time. Running the test suite on production hardware, they concluded that the Node.js app was performing better than the Java one, serving:
Double the requests per second vs. the Java application. This is even more interesting because our initial performance results were using a single core for the node.js application compared to five cores in Java. We expect to increase this divide further.
and having
35% decrease in the average response time for the same page. This resulted in the pages being served 200ms faster— something users will definitely notice.
As a result, PayPal began using the Node.js application in beta in production, and have decided that "all of our consumer facing web applications going forward will be built on Node.js," while some of the existing ones are being ported to Node.js.
One of the benefits of using JavaScript from browser to server is, according to Harrell, the elimination of a divide between front and back-end development by having one team "which allows us to understand and react to our users' needs at any level in the technology stack."

Tell us what you think


Sent from Evernote