Carl Meyer REU Projects

Summer 2012 REU Project

Unsupervised Learning by Data Clustering

Student Participants

Mindy Hong, Emory University

Robert Pearce, NC State University

Kevin Valakuzhy, University of North Carolina at Chapel Hill

Advisors

Carl D. Meyer (Faculty Advisor, NC State)

Shaina Race (Graduate Student Advisor, NC State)

Project Description

Data Mining is one of the fastest growing disciplines in mathematics and computer science today. Advances in data collection and storage have allowed companies and scientific researchers to create huge stores of data in the hopes that data miners will be able to discern valuable information from it. The vast majority of data mining models are examples of supervised learning; a model is created using training and test data for which the variable to be predicted is known, and the goal is to minimize the error of the prediction. We focused on unsupervised data mining techniques that aim to detect patterns and structure in unlabeled data where no value for error or accuracy can be placed on the final result. Emphasis is on clustering algorithms. Many existing clustering algorithms are inadequate in that they require knowledge of the number k of clusters that exist in the data, and in that their underlying assumptions make them ineffective in certain situations. The work revolves around the method of consensus clustering that seeks to rectify the latter problem by incorporating the results of multiple clustering algorithms to achieve one final grouping. The goal is to investigate a novel method of iterative consensus clustering (ICC) which solves both the problem of determining the best value of k as well as improving cluster determination.
We show that iterating consensus clustering techniques widen the eigengap associated with the Perron cluster of eigenvalues and thus gives a more definitive and accurate estimation of the number of clusters.

Article and Poster Presentations

The article Iterated Consensus Clustering: A Technique We Can All Agree On explains the details.

A poster presentation was given at the Undergraduate Research Symposium, McKimmon Center, NC State University, August 2012