Model-based clustering and classification for data science: with applications in R

By: Bouveyron, Charles
Contributor(s): Celeux, Gilles [Co-author] | Murphy, T. Brendan [Co-author] | Raftery, Adrian E [Co-author]
Material type: TextTextSeries: Cambridge series in statistical and probabilistic mathematics; 50Publisher: Cambridge Cambridge University Press 2019Description: xvii, 427 p.: col. ill. Includes bibliographical references and indexISBN: 9781108494205Subject(s): Mathematical statistics | Cluster analysis | Statistics - Classification | R (Computer program language)DDC classification: 519.53 Summary: Cluster analysis finds groups in data automatically. Most methods have been heuristic and leave open such central questions as: how many clusters are there? Which method should I use? How should I handle outliers? Classification assigns new observations to groups given previously classified observations, and also has open questions about parameter tuning, robustness and uncertainty assessment. This book frames cluster analysis and classification in terms of statistical models, thus yielding principled estimation, testing and prediction methods, and sound answers to the central questions. It builds the basic ideas in an accessible but rigorous way, with extensive data examples and R code; describes modern approaches to high-dimensional data and networks; and explains such recent advances as Bayesian regularization, non-Gaussian model-based clustering, cluster merging, variable selection, semi-supervised and robust classification, clustering of functional data, text and images, and co-clustering. Written for advanced undergraduates in data science, as well as researchers and practitioners, it assumes basic knowledge of multivariate calculus, linear algebra, probability and statistics. https://www.cambridge.org/in/academic/subjects/statistics-probability/statistical-theory-and-methods/model-based-clustering-and-classification-data-science-applications-r?format=HB
Tags from this library: No tags from this library for this title. Log in to add tags.
    Average rating: 0.0 (0 votes)
Item type Current location Item location Collection Call number Status Date due Barcode
Books Vikram Sarabhai Library
On Display
Slot 1423 (0 Floor, East Wing) Non-fiction 519.53 B6M6 (Browse shelf) Available 203258

Table of contetns:

1.Introduction
1.1.Cluster Analysis
1.1.1.From Grouping to Clustering
1.1.2.Model-based Clustering
1.2.Classification
1.2.1.From Taxonomy to Machine Learning
1.2.2.Model-based Discriminant Analysis
1.3.Examples
1.4.Software
1.5.Organization of the Book
1.6.Bibliographic Notes

2.Model-based Clustering: Basic Ideas
2.1.Finite Mixture Models
2.2.Geometrically Constrained Multivariate Normal Mixture Models
2.3.Estimation by Maximum Likelihood
2.4.Initializing the EM Algorithm
2.4.1.Initialization by Hierarchical Model-based Clustering
2.4.2.Initialization Using the small EM Strategy
2.5.Examples with Known Number of Clusters
2.6.Choosing the Number of Clusters and the Clustering Model
2.7.Illustrative Analyses
2.7.1.Wine Varieties
2.7.2.Craniometric Analysis
2.8.Who Invented Model-based Clustering?
2.9.Bibliographic Notes

3.Dealing with Difficulties
3.1.Outliers
3.1.1.Outliers in Model-based Clustering
3.1.2.Mixture Modeling with a Uniform Component for Outliers
3.1.3.Trimming Data with tclust
3.2.Dealing with Degeneracies: Bayesian Regularization
3.3.Non-Gaussian Mixture Components and Merging
3.4.Bibliographic Notes

4.Model-based Classification
4.1.Classification in the Probabilistic Framework
4.1.1.Generative or Predictive Approach
4.1.2.An Introductory Example
4.2.Parameter Estimation
4.3.Parsimonious Classification Models
4.3.1.Gaussian Classification with EDDA
4.3.2.Regularized Discriminant Analysis
4.4.Multinomial Classification
4.4.1.The Conditional Independence Model
4.4.2.An Illustration
4.5.Variable Selection
4.6.Mixture Discriminant Analysis
4.7.Model Assessment and Selection
4.7.1.The Cross-validated Error Rate
4.7.2.Model Selection and Assessing the Error Rate
4.7.3.Penalized Log-likelihood Criteria

5.Semi-supervised Clustering and Classification
5.1.Semi-supervised Classification
5.1.1.Estimating the Model Parameters through the EM Algorithm
5.1.2.A First Experimental Comparison
5.1.3.Model Selection Criteria for Semi-supervised Classification
5.2.Semi-supervised Clustering
5.2.1.Incorporating Must-link Constraints
5.2.2.Incorporating Cannot-link Constraints
5.3.Supervised Classification with Uncertain Labels
5.3.1.The Label Noise Problem
5.3.2.A Model-based Approach for the Binary Case
5.3.3.A Model-based Approach for the Multi-class Case
5.4.Novelty Detection: Supervised Classification with Unobserved Classes
5.4.1.A Transductive Model-based Approach
5.4.2.An Inductive Model-based Approach
5.5.Bibliographic Notes

6.Discrete Data Clustering
6.1.Example Data
6.2.The Latent Class Model for Categorical Data
6.2.1.Maximum Likelihood Estimation
6.2.2.Parsimonious Latent Class Models
6.2.3.The Latent Class Model as a Cluster Analysis Tool
6.2.4.Model Selection
6.2.5.Illustration on the Carcinoma Data Set
6.2.6.Illustration on the Credit Data Set
6.2.7.Bayesian Inference
6.3.Model-based Clustering for Ordinal and Mixed Type Data
6.3.1.Ordinal Data
6.3.2.Mixed Data
6.3.3.The ClustMD Model
6.3.4.Illustration of ClustMD: Prostate Cancer Data
6.4.Model-based Clustering of Count Data
6.4.1.Poisson Mixture Model
6.4.2.Illustration: Velib Data Set
6.5.Bibliographic Notes

7.Variable Selection
7.1.Continuous Variable Selection for Model-based Clustering
7.1.1.Clustering and Noisy Variables Approach
7.1.2.Clustering, Redundant and Noisy Variables Approach
7.1.3.Numerical Experiments
7.2.Continuous Variable Regularization for Model-based Clustering
7.2.1.Combining Regularization and Variable Selection
7.3.Continuous Variable Selection for Model-based Classification
7.4.Categorical Variable Selection Methods for Model-based Clustering
7.4.1.Stepwise Procedures
7.4.2.A Bayesian Procedure
7.4.3.An Illustration
7.5.Bibliographic Notes

8.High-dimensional Data
8.1.From Multivariate to Higb-dimensional Data
8.2.The Curse of Dimensionality
8.2.1.The Curse of Dimensionality in Model-based Clustering and Classification
8.2.2.The Blessing of Dimensionality in Model-based Clustering and Classification
8.3.Earlier Approaches for Dealing with High-dimensional Data
8.3.1.Unsupervised Dimension Reduction
8.3.2.The Dangers of Unsupervised Dimension Reduction
8.3.3.Supervised Dimension Reduction for Classification
8.3.4.Regularization
8.3.5.Constrained Models
8.4.Subspace Methods for Clustering and Classification
8.4.1.Mixture of Factor Analyzers (MFA)
8.4.2.Extensions of the MFA Model
8.4.3.Parsimonious Gaussian Mixture Models (PGMM)
8.4.4.Mixture of High-dimensional GMMs (HD-GMM)
8.4.5.The Discriminative Latent Mixture (DLM) Models
8.4.6.Variable Selection by Penalization of the Loadings
8.5.Bibliographic Notes

9.Non-Gaussian Model-based Clustering
9.1.Multivariate t-Distribution
9.2.Skew-normal Distribution
9.3.Skew-t Distribution
9.3.1.Restricted Skew-t Distribution
9.3.2.Unrestricted Skew-t Distribution
9.4.Box-Cox Transformed Mixtures
9.5.Generalized Hyperbolic Distribution
9.6.Example: Old Faithful Data
9.7.Example: Flow Cytometry
9.8.Bibliographic Notes

10.Network Data
10.1.Introduction
10.2.Example Data
10.2.1.Sampson's Monk Data
10.2.2.Zachary's Karate Club
10.2.3.AIDS Blogs
10.2.4.French Political Blogs
10.2.5.Lazega Lawyers
10.3.Stochastic Block Model
10.3.1.Inference
10.3.2.Application
10.4.Mixed Membership Stochastic Block Model
10.4.1.Inference
10.4.2.Application
10.5.Latent Space Models
10.5.1.The Distance Model and the Projection Model
10.5.2.The Latent Position Cluster Model
10.5.3.The Sender and Receiver Random Effects
10.5.4.The Mixture of Experts Latent Position Cluster Model
10.5.5.Inference
10.5.6.Application
10.6.Stochastic Topic Block Model
10.6.1.Context and Notation
10.6.2.The STBM Model
10.6.3.Links with Other Models and Inference
10.6.4.Application: Enron E-mail Network
10.7.Bibliographic Notes

11.Model-based Clustering with Covariates
11.1.Examples
11.1.1.C02 and Gross National Product
11.1.2.Australian Institute of Sport (AIS)
11.1.3.Italian Wine
11.2.Mixture of Experts Model
11.2.1.Inference
11.3.Model Assessment
11.4.Software
11.4.1.flexmix
11.4.2.mixtools
11.4.3.MoEClust
11.4.4.Other
11.5.Results
11.5.1.CO2 and GNP Data
11.5.2.Australian Institute of Sport
11.5.3.Italian Wine
11.6.Discussion
11.7.Bibliographic Notes

12.Other Topics
12.1.Model-based Clustering of Functional Data
12.1.1.Model-based Approaches for Functional Clustering
12.1.2.The fclust Method
12.1.3.The funFEM Method
12.1.4.The funHDDC Method for Multivariate Functional Data
12.2.Model-based Clustering of Texts
12.2.1.Statistical Models for Texts
12.2.2.Latent Dirichlet Allocation
12.2.3.Application to Text Clustering
12.3.Model-based Clustering for Image Analysis
12.3.1.Image Segmentation
12.3.2.Image Denoising
12.3.3.Inpainting Damaged Images
12.4.Model-based Co-clustering
12.4.1.The Latent Block Model
12.4.2.Estimating LBM Parameters
12.4.3.Model Selection
12.4.4.An Illustration
12.5.Bibliographic Notes

Cluster analysis finds groups in data automatically. Most methods have been heuristic and leave open such central questions as: how many clusters are there? Which method should I use? How should I handle outliers? Classification assigns new observations to groups given previously classified observations, and also has open questions about parameter tuning, robustness and uncertainty assessment. This book frames cluster analysis and classification in terms of statistical models, thus yielding principled estimation, testing and prediction methods, and sound answers to the central questions. It builds the basic ideas in an accessible but rigorous way, with extensive data examples and R code; describes modern approaches to high-dimensional data and networks; and explains such recent advances as Bayesian regularization, non-Gaussian model-based clustering, cluster merging, variable selection, semi-supervised and robust classification, clustering of functional data, text and images, and co-clustering. Written for advanced undergraduates in data science, as well as researchers and practitioners, it assumes basic knowledge of multivariate calculus, linear algebra, probability and statistics.

https://www.cambridge.org/in/academic/subjects/statistics-probability/statistical-theory-and-methods/model-based-clustering-and-classification-data-science-applications-r?format=HB

There are no comments for this item.

to post a comment.

Powered by Koha