|Department:||Faculty of Engineering and Environment|
|Full text PDF:||http://nrl.northumbria.ac.uk/21438/|
With the advent of microarray technology, it is possible to monitor gene expression of tens of thousands of genes in parallel. In order to gain useful biological knowledge, it is necessary to study the data and identify the underlying patterns, which challenges the conventional mathematical models. Clustering has been extensively used for gene expression data analysis to detect groups of related genes. The assumption in clustering gene expression data is that co-expression indicates co-regulation, thus clustering should identify genes that share similar functions. Microarray data contains plenty of uncertain and imprecise information. Fuzzy c-means (FCM) is an efficient model to deal with this type of data. However, it treats samples equally and cannot differentiate noise and meaningful data. In this thesis, motivated by the preservation of local structure, a local weighted FCM is proposed which concentrate on the samples in neighborhood. Experiments show that the proposed method is not only robust to the noise, but also identifies clusters with biological significance. Due to FCM is sensitive to the initialization and the choice of parameters, clustering result lacks stability and biological interpretability. In this thesis, a new clustering approach is proposed, which computes genes similarity in kernel space. It not only finds nonlinear relationship between gene expression profiles, but also identifies arbitrary shape of clusters. In addition, an initialization scheme is presented based on Parzen density estimation. The objective function is modified by adding a new weighted parameter, which accentuates the samples in high density areas. Furthermore, a parameters selection algorithm is incorporated with the proposed approach which can automatically find the optimal values for the parameters in the clustering process. Experiments on synthetic data and real gene expression data show that the proposed method substantially outperforms conventional models in term of stability and biological significance. Time series gene expression is a special kind of microarray data. FCM rarely consider the characteristics of the time series. In this work, a fuzzy clustering approach (FCMS) is proposed by using splines to smooth time-series expression profiles to minimize the noise and random variation, by which the general trend of expression can be identified. In addition, FCMS introduces a new geometry term of radius of curvature to capture the trend information between splines. Results demonstrate that the new method has substantial advantages over FCM for time-series expression data.