Carl Meyer REU Projects

Summer 2010 REU Project

Cluster Analytics and Data Mining

Student Participants

David Benson-Putnins (Oxford University & University of Michigan)

Margaret Bonfardin (Washington University in St. Lousis)

Meagan Magnoni (Ithaca College & RPI)

Daniel Martin (Davidson College)

Advisors

Carl D. Meyer (Faculty Advisor, NC State)

Chuck Wessell (Graduate Student Advisor, NC State)

Project Description

Given a collection of raw data, the object is to determine hidden patterns by partitioning the data into clusters (which may or may not be disjoint), where the items in a given cluster share or exhibit some sort of commonality that is not immediately apparent due to the sheer volume or the diverse nature of the data. Two fundamental application areas will be under investigation.

The first application involves clustering DNA microarray data. Clustering DNA microarray data is a fundamental problem for genomic scientists because in addition to helping reveal hidden genetic components relevant to the development of diseases, it is a valuable diagnostic tool. For example, without knowing all of the specific genetic factors governing a disease such as leukemia, an individual's genetic propensity for developing a particular type of leukemia can be inferred by incorporating their DNA data into a known paradigm (such as the ALL-AML leukemia data set described below) and applying an appropriate clustering technique on the augmented data. By observing which cluster (if any) the individual falls into (along with the strength of the clustering effect), a preemptive diagnosis can be made. Much of this research will involve ALL-AML leukemia data and associated studies from the MIT-Harvard Broad Institute of Genome Research.

In the second application text data will be extracted from the world wide web. The goal is to first facilitate machine reading of web documents and then to cluster these documents into sets of common topics. The motivation stems from the need to automate the evaluation of product reviews and information that appear across the internet to help formulate a consensus of opinion about specific products and their features. This application will typically involve thousands of documents, each of which is described by the relative presence of hundreds or thousands of vocabulary words.

The tools employed are elementary probability and statistics, networks and graphs, linear algebra, numerical analysis and some scientific computing principles. A bit of basic knowledge of biology won't hurt.

Results & Conclusions

The purpose was to investigate the fields of data mining and cluster analysis. We began our research into this area by learning about several known clustering techniques such as k-means and hierarchical clustering. Using two sets of documents, the REU paragraphs and the Blackstone documents, we created a search engine using latent semantic indexing and clustered the documents using non-negative matrix factorization. We also used two spectral techniques, the Fiedler Method and the MinMaxCut Method, to cluster these documents as well as Fisher's Iris data set and a data set of gene expression levels for leukemia patients. In addition, we explored some of the theory behind the Fiedler Method and discovered some interesting features of the Fiedler vector and proved some results when the Fiedler vector contains zero entries or when it is associated with a non-simple eigenvalue. More details are in the paper that was published in the SIAM Undergraduate Research Online Journal (SIURO) concerning the hidden patterns that our methods revealed in Fisher's Iris data set.

Summary

We surveyed different clustering algorithms.

We created a search engine using latent semantic indexing.

We used the theory of consensus clustering in conjunction with k-means, non-negative matrix factorization, and MinMaxCut method for graph partitioning to search for and reveal hidden patterns in the Blackstone document collection, the well-known leukemia data set of Golub, Slonim, et. al., and Fisher's Iris data set. Results are detailed in our SIAM article "Spectral Clustering and Visualization: A Novel Clustering of Fisher's Iris Data Set, SIURO, Vol 4, pp. 1-15, 2011.

We created a computer tool for visualizing cluster connectivity.

We created a mathematical technique and wrote an algorithm for determining the optimal number of clusters in a data set.

We researched aspects of the Fiedler method and proved results concerning Fiedler vector for some nonstandard cases.

Publications and Poster Presentation

Spectral Clustering and Visualization: A Novel Clustering of Fisher's Iris Data Set Published in SIAM Undergraduate Research Online (SIURO), Vol 4, pp. 1-15, 2011

The poster presentation given at the Undergraduate Research Symposium, McKimmon Center, NC State University, August,2010.

Final Report

The final project report