name: quotation layout: true class: center, middle, inverse --- #
An Overview on Cluster Analysis
### By Mingbo Cheng ### 2019-1-17 .footnote[powered by
remark.js
] --- layout: false #
OUTLINE
- Where are Clustering Applied - What kinds of Clustering Methods - How to Evaluate - What Makes Better Performance - Conclusion - References --- #
OUTLINE
- **Where are Clustering Applied** - What kinds of Clustering Methods - How to Evaluate - What Makes Better Performance - Conclusion - References --- ## Facebook Sentiment Analysis through LDA
--- ### Google News Clustering
??? google paper in 2007 talked about MinHashing is a probabilistic clustering method Google news personalization: scalable online collaborative filtering --- ### Anomaly Detection
https://www.semanticscholar.org/paper/Traffic-Anomaly-Detection-Using-K-Means-Clustering-M%C3%BCnz-Li/634e2f1a20755e7ab18e8e8094f48e140a32dacd
??? 1. paypal account thieves behaviors different from 2. aircraft engine, detect whether an engine good enough --- #### Clustering cancer gene expression data: A comparative study
??? 2008 BMC Bioinformatics 7 clustering algorithms shared nearest-neighbor algorithm(SNN) nearest neighbors (NN) spectral clustering(SPC) similarity matrix dimensionality reduction mixture of multivariate Gaussians(FMG) average linkage(AL) complete linkage(CL) single linkage (SL) 4 proximity measures Pearson's Correlation coefficient(P) Cosine (C) Spearman's correlation coefficient(C) Euclidean Distance(E) -> four version, Z0 Z1 Z2 Z3 Genetic conditions(can be seen as illness), Down's --- #
OUTLINE
- Where are Clustering Applied - **What kinds of Clustering Methods** - How to Evaluate - What Makes Better Performance - Conclusion - References --- ## Typical cluster models - Connectivity models
https://www.datanovia.com/en/courses/hierarchical-clustering-in-r-the-essentials/
??? connectivity models: https://www.datanovia.com/en/courses/hierarchical-clustering-in-r-the-essentials/ --- ## Typical cluster models - Centroid models
https://www.mathworks.com/matlabcentral/mlc-downloads/downloads/submissions/26182/versions/11/screenshot.jpg
??? kmeans --- ## Typical cluster models - Distribution models
--- ## Typical cluster models - Density models
https://medium.com/@elutins/dbscan-what-is-it-when-to-use-it-how-to-use-it-8bd506293818
--- ## Typical cluster models - Subspace models
http://cw.fel.cvut.cz/old/_media/courses/a6m33bin/biclustering.pdf
--- ## Typical cluster models - Subspace models
http://cw.fel.cvut.cz/old/_media/courses/a6m33bin/biclustering.pdf
??? different time point, if the gene express --- ## Typical cluster models - Graph-based models
https://en.wikipedia.org/wiki/Clique_(graph_theory)
--- ## Typical cluster models - Neural models
http://www.lohninger.com/helpcsuite/kohonen_network_-_background_information.htm
??? tutorial: http://www.pitt.edu/~is2470pb/Spring05/FinalProjects/Group1a/tutorial/som.html --- #
OUTLINE
- Where are Clustering Applied - What kinds of Clustering Methods - **How to Evaluate** - What Makes Better Performance - Conclusion - References --- ## Cluster evaluation - Internal Evaluation - Davies-Bouldin index - Dunn index - External Evaluation - Purity - Classification Criteria ??? internal: based on data itself 1. Devies-Bouldin index: centroids far from each other, inner the cluster, point is near to each other Smaller, better 2. Dunn index density and well-separated Higher, better :::::::::::::::::::::::::::::::::::::::::::::::: external: data not used for clustering, labeled data Purity, each point is a cluster, imbalance data problem Classificatioin Criteria: Precision, Recall __ FMeasure TRUE Positive False Positive True Negative False Negative --- #
OUTLINE
- Where are Clustering Applied - What kinds of Clustering Methods - How to Evaluate - **What Makes Better Performance** - Conclusion - References --- ## Data Preprocessing - Dimensionality Reduction - Noise Reduction - Standardization - Scaling - Ranking ??? Dimensionality Reduction: PCA, Non-negative matrix factorization(NMF) noise reduction: LDA, high-frequency words. THE may offer us no information Stardardization: zero mean, then unit variance Scaling: Hight and weight, S or L size shirt Different dimension contribute different Ranking: Function(x_i) --- ## Similarity Measures - Minkowski Metric
q=1: Manhattan distance
q=2: Euclidean distance
$$d(x, y) = L\_q(x, y) = \sqrt[q]{\sum\_{i=1}^n(x\_i-y\_i)^q}$$ - Kullback-Leibler Divergence - Cosin ??? included angle --- #
OUTLINE
- Where are Clustering Applied - What kinds of Clustering Methods - How to Evaluate - What Makes Better Performance - **Conclusion** - References --- ## Conclusion - Clustering Applications - Kinds of Clustering Methods - Clustering Evaluation - Gain a Better Performance --- ## References
https://en.wikipedia.org/wiki/Cluster_analysis
Costa IG: Clustering cancer gene expression data: A comparative study
Mahamed G.H. Omran etc.: An overview of clustering methods
T. Soni Mahulatha etc.: An overview on clustering methods by
TAOUFIQ ZARRA etc.: TOPIC MODELING AND SENTIMENT ANALYSIS IN FACEBOOK TO ENHANCE STUDENTS LEARNING
Abhinandan Das:Online Collaborative Filtering Online Collaborative Filtering
--- ## References cont.
Sara C. Madeira: Biclustering Gene Expression Data
https://www.youtube.com/watch?v=uCaPP4blYAg
http://www.mit.edu/~9.54/fall14/slides/Class13.pdf
http://www.lohninger.com/helpcsuite/kohonen_network_-_background_information.htm
http://www.ims.uni-stuttgart.de/institut/mitarbeiter/schulte/theses/phd/algorithm.pdf
--- ## References cont.
Marcilio C. P. de Souto etc.: Comparative Study on Normalization Procedures for Cluster Analysis of Gene Expression Datasets
https://datascience.stackexchange.com/questions/22795/do-clustering-algorithms-need-feature-scaling-in-the-pre-processing-stage
http://mlwiki.org/index.php/SNN_Clustering
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC430175/
--- template: inverse
#
Thanks!
##
Q&A