Classifying Data with Discriminant Analysis

Bree Stanwyck

2 min read

Nov 21, 2011

Classifying Data with Discriminant Analysis

Classify ALL the data!

Cluster analysis methods have been gaining popularity as a way of Relating pieces of data in large datasets with one another. Examples in social networking are obvious: friends on Facebook cluster into cliques and communities, which cluster into even larger groups. Demographics and other marketing research can also be aided by sorting prospective customers into groups based on preference.

When the clusters are known, and ample training data is available, discriminant analysis is particularly effective at classifying new data. Discriminant analysis methods are built into the R programming language (something we’ve discussed a bit in the past) with a standard package. However, R can be cumbersome to use by itself (and the syntax still seems a bit bizarre to me personally), so I used Rinruby, a gem which gives direct access to R methods and data, to put a nice Ruby wrapper around it. Below is some example code that analyzes a well-known clustering test dataset, the Fisher iris data.

Say our iris data is contained as an array (“training_rows”) of hashes that each look like

(where the species is given as 1, 2, or 3). We can load our data into a DiscriminantAnalysis instance and start predicting in just a few lines:

Every prediction comes as a hash with a confidence score to check the quality of the classification. This code uses a linear analysis (i.e., it separates classes by lines or planes); the pretty scatterplot up top was generated using the iris data with quadratic analysis, which can be achieved in the code above by simply replacing “init_lda_analysis” with “init_qda_analysis”.

R allows for tons of manipulation of the analysis once it’s loaded, some of which has been built in to the DiscriminantAnalysis class (scatterplots, accuracy and significance testing, etc.).

Since we’re all about contributing to the open source community at Highgroove, I’ve packaged these methods into a gem called harlequin (cheesily enough after one of the irises in Fisher’s data). If you have a working copy of R on your machine, you can get the code above running by just adding the proper requires for the gem. Feel free to check it out and help to make it more awesome!

Have you used clustering methods before? How do you deal with heavy-duty data processing in Ruby?

Mark Dalrymple

Reviewer Big Nerd Ranch

MarkD is a long-time Unix and Mac developer, having worked at AOL, Google, and several start-ups over the years.  He’s the author of Advanced Mac OS X Programming: The Big Nerd Ranch Guide, over 100 blog posts for Big Nerd Ranch, and an occasional speaker at conferences. Believing in the power of community, he’s a co-founder of CocoaHeads, an international Mac and iPhone meetup, and runs the Pittsburgh PA chapter. In his spare time, he plays orchestral and swing band music.

Speak with a nerd

Schedule a call today! Our team of nerds are ready to help

Let's Talk

Related Posts

We are ready to discuss your needs.

Stay in Touch WITH Big Nerd Ranch News