Friday, 22 February 2013

Apache Mahout: Scalable machine learning and data mining

Apache Mahout: Scalable machine learning and data mining:


The Apache Mahout™ machine learning library's goal is to build scalable machine learning libraries.

Mahout currently has

  • Collaborative Filtering
  • User and Item based recommenders
  • K-Means, Fuzzy K-Means clustering
  • Mean Shift clustering
  • Dirichlet process clustering
  • Latent Dirichlet Allocation
  • Singular value decomposition
  • Parallel Frequent Pattern mining
  • Complementary Naive Bayes classifier
  • Random forest decision tree based classifier
  • High performance java collections (previously colt collections)
  • A vibrant community
  • and many more cool stuff to come by this summer thanks to Google summer of code
With scalable we mean:
Scalable to reasonably large data sets. Our core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. However we do not restrict contributions to Hadoop based implementations: Contributions that run on a single node or on a non-Hadoop cluster are welcome as well. The core libraries are highly optimized to allow for good performance also for non-distributed algorithms
Scalable to support your business case. Mahout is distributed under a commercially friendly Apache Software license.
Scalable community. The goal of Mahout is to build a vibrant, responsive, diverse community to facilitate discussions not only on the project itself but also on potential use cases. Come to the mailing lists to find out more.
Currently Mahout supports mainly four use cases: Recommendation mining takes users' behavior and from that tries to find items users might like. Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from exisiting categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. Frequent itemset mining takes a set of item groups (terms in a query session, shopping cart content) and identifies, which individual items usually appear together.

Wednesday, 13 February 2013

Data for dummies: 6 data-analysis tools anyone can use

Data for dummies: 6 data-analysis tools anyone can use:
If you care only about the cutting edge of machine learning and how to manage petabytes of big data, you might want to quit reading now and just come to our Structure:Data conference in March. But if you’re a normal person dealing with mere normal data, you’ll probably want to stick around. Although your data might not be that big or complex, that doesn’t mean it isn’t worth looking at in a new light.
With that in mind, here are six of the best free tools I’ve come across for helping we mere mortals analyze our data without having to know too much about, well, anything (I’d keep an eye on the still-under-wraps Datahero, too). I’ve gathered some personal data and tracked down some interesting public data sets to help demonstrate what a novice can do with them. Someone with more skills can certainly do a lot more, and larger datasets will provide greater statistical significance.

BigML

BigML is to machine learning what Blue Moon is to Belgian ales: a simple approach to something generally more complex — but also rather accessible and good enough to do the job in a pinch. I explained the service more thoroughly in recent post about it being used to generate predictions of Kickstarter success, but here’s how it works, in a nutshell: Users upload and format data (which is actually pretty easy), BigML discovers the myriad relationships between the variables and creates a predictive model, and users enter hypothetical data and receive a prediction.
I’m pretty bad when it comes to entering my data into Fitbit (see disclosure), but I was relatively good for a month this summer as I prepped for the Warrior Dash, and that’s the data I used to demonstrate BigML. This prediction of how many calories I can expect to burn in a day would work a lot better if I had a bigger sample size and hadn’t occasionally forgotten to log calories and hours slept, but you get the point. The first image is the model the service generated; the second is the prediction interface.
cals bigmlpredict

Google Fusion Tables

The user interface for Google Fusion Tables  isn’t what I’d call pretty (“sparse” is probably a better description), but the still-in-experimental-mode visualization tool sure is easy if your data is nicely formatted. I created this interactive map simply by uploading a publicly available dataset about gun violence and clicking the button to create a map:
fusion
For this simple comparison of gun ownership and gun homicide rates, I just checked the countries by which I wanted to filter the chart. Easy.:
gunscomp

Infogram

If you have really simple data — like a few columns and a handful of rows — Infogram might be the easiest to use of the bunch. The company launched last year with a variety of infographic templates, but it has since expanded to include a large number of charts and graphs, too (including line, pie, pictorial, treemap and bubble). Furthermore, it gives sample data, which you can use as an example to enter your own or format the table you want to upload, and the interactive charts embed nicely into web pages (ours, at least).
Here are the top 10 things I ate during the time I was logging food via Fitbit, excluding copious amounts of beer, water, coffee and Diet Pepsi that I didn’t record.
In July, I made this chart with Infogram comparing infrastructure spending trends among internet companies.
And here’s a sample of the simplest chart in the world.

Many Eyes

Many Eyes is a free web service run by IBM that includes a wide variety of visualizations ranging from maps to pie charts to scatter plots. But what makes it stand apart from the others is the suite of text-analysis tools it offers — not only are they fairly novel, but all they require users to do is paste a page of plain text into the web interface and press a button to visualize it. I used it to analyze the last 15 posts I’ve written for GigaOM.
What did I find? For starters, I use the words “data,” “Facebook” and “users” a lot.
words 1
When it comes to two-word combinations, “big data,” “data centers” and “hard drives” are among the biggies.
words 2
This one is particularly interesting, showing how I tend to form phrases around certain words with common conjunctions, or just a space, in between.
data
Apparently, out of 10,013 words, I only used “cloud” 20 times. I usually followed it up with “provider,” “servers,” “computing,” “-based” and “providers.”
cloud2
For fun, I also made a word cloud based on couple month’s worth of Fitbit food logs. It turns out, you can take the boy out of Wisconsin, but …
wordcloud

Statwing

Statwing might be my favorite of the bunch, if only because it’s so simple yet actually tries to teach users about statistics. You upload data, check the variables you’re concerned with, and it plots their relationship. (It also can describe the variables by highlighting the sample size, minimum, maximum, mean, median and standard deviation.) Graphs are accompanied by explanations as to how strong the correlation is based on various statistical metrics, as well as the results of a linear regression model.
To demonstrate Statwing, I went back to the Fitbit data. Of the variables that Fitbit tracks, some correlations are easy to predict (e.g., steps and calories burned), but I was kind of surprised to see that the 86 minutes a day I spent being fairly active really weren’t that good of an expenditure of my time.
statwing

Tableau Public

Tableau Public, the only free version of the popular business-intelligence software, was clearly designed with business users in mind. It expects a lot of structure in the data, and although you can edit almost every aspect of it within the application to get it into usable shape, the service doesn’t allow much guidance if you don’t speak the language of BI (it also requires Windows). But the software is very good at deciphering the characteristics of different variables, the drag-and-drop operation makes it kind of easy to experiment and the wide array of visualizations look really nice.
Using my Fitbit data (and here’s where you see how lax I am at data entry), I created a line graph comparing the calories I ate each day with the calories I burned. Assuming I didn’t go crazy eating on the days I forgot to make entries, the good news is I never ate more calories than I burned. (Note: Although these are static images, Tableau Public actually lets you embed interactive charts, which I’ve used in the past on several occasions, but they don’t always fit well within our pages.)
cal tabHere’s one I played around with a while back charting Amazon’s “Other” revenue againt the number of objects stored in Amazon S3.
aws objrevFinally, here is my first-ever (I think) Tableau chart, which uses the raw data on government takedown requests that Google provided along with its Transparency Report in October 2011. You can read that post and play with the interactive version here.
goog trans

There is, however, one disclaimer that applies to all of these tools: I didn’t get into cleaning and formatting data, which can be a somewhat arduous process. Many tools expect some sort of structure to the data — the X axis to be in columns and the Y axis in rows, measurements without units (e.g., grams), etc. — that just isn’t present if you’re downloading an Excel or CSV file rather than creating it yourself. Sometimes, with comprehensive datasets like your Fitbit Premium data, you’ll have to separate or combine the relevant data into new spreadsheet tables before uploading it to a service. But once you have the data ready to go, these tools can help you analyze it, visualize it and hopefully glean some insights from it.
Disclosure: Fitbit is backed by True Ventures, a venture capital firm that is an investor in the parent company of this blog, Giga Omni Media. Om Malik, founder of Giga Omni Media, is also a venture partner at True.



Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.