|
- Student Participants
- David Benson-Putnins (Oxford University & University of Michigan)
- Margaret Bonfardin (Washington University in St. Lousis)
- Meagan Magnoni (Ithaca College & RPI)
- Daniel Martin (Davidson College)
- Advisors
- Carl D. Meyer (Faculty Advisor, NC State)
- Chuck Wessell (Graduate Student Advisor, NC State)
- Project Description
- Given a collection of raw data, the object is to determine hidden
patterns by partitioning the data into clusters (which may or may not be
disjoint), where the items in a given cluster share or exhibit some sort of
commonality that is not immediately apparent due to the sheer volume or the
diverse nature of the data. Two fundamental application areas will be
under investigation.
-
The first application involves clustering DNA microarray data.
Clustering DNA microarray data is a fundamental problem for genomic
scientists because in addition to helping reveal hidden genetic components
relevant to the development of diseases, it is a valuable diagnostic tool.
For example, without knowing all of the specific genetic factors governing
a disease such as leukemia, an individual's genetic propensity for
developing a particular type of leukemia can be inferred by incorporating
their DNA data into a known paradigm (such as the ALL-AML leukemia data set
described below) and applying an appropriate clustering technique on the
augmented data. By observing which cluster (if any) the individual falls
into (along with the strength of the clustering effect), a preemptive
diagnosis can be made. Much of this research will involve ALL-AML leukemia
data and associated studies from the MIT-Harvard Broad Institute of Genome
Research.
-
In the second application text data will be extracted from the world
wide web. The goal is to first facilitate machine reading of web documents
and then to cluster these documents into sets of common topics. The
motivation stems from the need to automate the evaluation of product
reviews and information that appear across the internet to help formulate a
consensus of opinion about specific products and their features. This
application will typically involve thousands of documents, each of which is
described by the relative presence of hundreds or thousands of vocabulary
words.
-
The tools employed are elementary probability and statistics,
networks and graphs, linear algebra, numerical analysis and some
scientific computing principles. A bit of basic knowledge of biology won't
hurt.
- Results & Conclusions
- The purpose was to investigate the fields of data mining and cluster analysis. We began our research into this area by learning
about several known clustering techniques such as k-means and hierarchical clustering. Using two sets of documents, the REU
paragraphs and the Blackstone documents, we created a search engine using latent semantic indexing and clustered the documents using
non-negative matrix factorization. We also used two spectral techniques, the Fiedler Method and the MinMaxCut Method, to cluster
these documents as well as Fisher's Iris data set and a data set of gene expression levels for leukemia patients.
In addition, we explored some of the theory behind the Fiedler Method and discovered some interesting features of the Fiedler vector
and proved some results when the Fiedler vector contains zero entries or when it is associated with a non-simple eigenvalue.
More details are in the paper that was published in
the SIAM Undergraduate Research Online Journal (SIURO) concerning the hidden patterns that our methods revealed in Fisher's Iris data
set.
- Summary
- We surveyed different clustering algorithms.
- We created a search engine using latent semantic indexing.
- We used the theory of consensus clustering in conjunction with k-means, non-negative matrix
factorization, and MinMaxCut method for graph partitioning to search for and reveal hidden patterns in the Blackstone document
collection, the well-known leukemia data set of Golub, Slonim, et. al., and Fisher's Iris data set. Results are detailed in our SIAM article "Spectral Clustering and Visualization: A Novel
Clustering of Fisher's Iris Data Set, SIURO, Vol 4, pp. 1-15, 2011.
- We created a computer tool for visualizing cluster connectivity.
- We created a mathematical technique and wrote an algorithm for determining the optimal number of clusters in a data set.
- We researched aspects of the Fiedler method and proved results concerning Fiedler vector for some nonstandard cases.
- Publications and Poster Presentation
- Final Report
|