Since I’m on a PCA roll, I decided to skip Chapter 3 on pre-processing and go to Chapter 4 of Chemometrics with R by Ron Wehrens on principal component analysis (PCA). You can have a look at my summary of the first two chapters here.
Things that I learned and things that I found interesting:
- He has some really good descriptions of PCA. Very clear. (Or perhaps it’s finally starting to sink in?) “…PCA deﬁnes new variables, consisting of linear combinations of the original ones, in such a way that the ﬁrst axis is in the direction containing most variation. Every subsequent new variable is orthogonal to previous variables, but again in the direction containing most of the remaining variation.”
- Here’s another description of PCA that he gives. “…provides a direct mapping of high-dimensional data into a lower-dimensional space containing most of the information in the original data.”
- The largest possible number of principal components (PCs) is the number of rows or the number of columns (whichever is smaller) of the data matrix X.
- Using all the PCs to describe the data space is equivalent to using the original matrix, only rotated so that its ‘axes’ are in the direction of highest variation
- Advantages of PCA: simple, unique analytical solution optimizing a clear and unambiguous criterion (maximum variance), often leads to more interpretable data representation
- Underlying assumption of PCA: variation equals information
- Singular value decomposition (SVD) is usually used in PCA algorithms instead of eigendecomposition on the covariance/correlation matrix of the data since it’s numerically more stable.
- “The loadings… give the weights of the original variables in the PCs.” So a variable (feature) with a low loading for a given PC contributes little to that PC.
- “The scores… constitute the coordinates in the space of the latent variables. Put differently: these are the coordinates of the samples as we see them from our new PCA viewpoint.”
- Sometimes (especially in cases where there’s a large difference in number of rows and columns) the SVD algorithm is faster when instead of applying it to the mean-centered data matrix X, it is applied to XTX or XXT, whichever is smaller.
- He presents some statistical methods for getting confidence intervals for the explained variances and for testing the equality of explained variances of individual PCs. Usually though, informal graphical methods are used to pick the number of PCs e.g. visual inspection of the scree plot.
- “In other applications, notably with spectral data, one can sometimes check the loadings for structure. If there is no more structure – in the form of peak-like shapes – present in the loadings of higher PCs, they can safely be discarded.”
- In case the data set is split e.g. into a training (calibration) set and a test (validation) set, if the training set is mean-centered and auto-scaled, make sure to scale the test set with the mean and standard deviation of the training set. “…both halves should have the same point of origin.”
- The prcomp function in R by default mean-centers but does not auto-scale the data
- The princomp function in R also does PCA but uses eigendecomposition in contrast to prcomp that uses SVD – since SVD is more stable, prcomp is preferred for regular applications
- However, princomp allows you to provide a specific covariance matrix that will be used for the decomposition, which can be useful in some situations
The remaining sections of Chapter 4 talk about multidimensional scaling, independent component analysis, projection pursuit and factor analysis. I’ll blog on these in a future post.
Journal Article: Chemometrics in Spectroscopy. Part 1. Classical Chemometrics – by Paul Geladi (2003) – 1.0 – 3.2
A few things I learned from it:
1. Every two years there is a review of chemometrics (conveniently titled ‘Chemometrics’) in the ACS journal ‘Analytical Chemistry’. The latest one is the 2010 review which I plan on reading through and blogging about in the near future.
2. There’s a sub-field of chemometrics called statistical design or design of experiments that I really haven’t seen much of up to this point of my learning. I must learn more about this.
3. Chemometrics makes use of the fact that many variables in the data matrix are correlated (e.g. spectral intensities at neighboring wavelengths often carry similar information). Reducing the data matrix to get rid of redundant information makes it easier to understand. Hopefully the residual contains just noise and/or information that is not as useful.
4. Visualization of data is important in chemometrics. Lots of plots, graphs, figures.
5. He goes over an example in detail of how to carry out principal component analysis (PCA) on a 7 by 12 data matrix of spectra for data exploration and how to interpret what you get out of the PCA.
- In PCA, the goal is to explain as much as possible the total sum of squares of (the data matrix) X with a minimum number of principal components (PCs). This is achieved by making the score vectors orthogonal and the loading vectors orthonormal.
- It’s important to know how much of the total sum of squares of X goes into the model and how much is left in the residual.
- Check the distribution of ‘noise’ in the residual to make sure it really is noise and there is no structure
- The result in PCA is unique but for the sign. Changing the sign of a score-loading vector pair doesn’t change the solution.
- The number of latent variables (PCs) to keep is often chosen subjectively, rarely objectively.
- Identify outliers and explain them. Errors in sampling?data handling?analysis?number rounding? Or sign of existence of unknown classes of objects?
- The application of a pre-processing scaling type (linear? logarithmic? exponential?) is in many cases ad hoc. You try the analysis without scaling, see what goes wrong and then see what kind of scaling you can apply to correct it by trial-and-error.
- Variables that contain mostly background noise will have low loadings
- Principal components have no natural connection to chemistry. The loadings with their constraint of orthogonality rarely have chemical meaning. Sometimes rotated loadings may be better in this respect e.g. orthomax, quartimax, varimax rotation
6. Several methods for clustering/classification were mentioned – distance-based methods (Euclidean, Minkowski, Mahalanobis, etc), Bayesian modeling (clusters described by multivariate normal distribution), fuzzy clustering, nearest-neighbor methods
I’ll write about the rest of the paper (on curve resolution and multivariate calibration) at a later date.
Finished watching the last three video tutorials on PCA from the website of the Quality and Technology Department of Food Science of the University of Copenhagen. The topics covered are how to choose the number of PCA components using cross-validation, and outliers. I also plan on watching their videos on partial least squares (PLS) regression, spectral pre-processing and variable selection and blogging on those.
A few things I learned:
- validation of a PCA model includes: selecting number of principal components (PCs), checking for problematic samples (outliers), checking for possibly irrelevant variables – these steps are interdependent (e.g. removing an outlier might change the number of PCs required) so this is an iterative process
- avoid underfitting (too few PCs) and overfitting (too many PCs)
- in PCA, can compare the explained variance on calibration (training) set with explained variance on cross-validation (test) set – the calibration/training set is the one used to make the PCA model, the cross-validation/test set is the data set used to check how well the model works
- the explained calibration variance will always increase with the number of PCs; the explained cross-validation variance might increase at first with number of PCs but at some point will start to decrease due to overfitting
- can compare the explained variance on calibration (training) set with explained variance on cross-validation (test) set for individual variables – this helps to check for irrelevant/disruptive variables
- want difference between explained calibration variance and explained cross-validation variance to be as small as possible – for a perfect model, the difference will be zero
- cross-validation only gives an indication of the number of components to keep – have to use your knowledge about the data to decide how many components to keep – don’t just use the number of components automatically suggested by the software as they are subject to effects of outliers and noise
- outlier – due to: measurement error? wrong label? noise? most interesting sample?
- be careful with arbitrarily removing outliers – if remove 50% of data calling them outliers, then model will only represent (at best) 50% of the population
- outliers might be an indication that more sampling needs to be done to sufficiently represent the populations/phenomena being modeled
- two kinds of outliers: extreme but in the model, and outside the model
- detecting outliers: ‘extreme but in the model’ outliers will have high scores in the score plot; ‘outside the model’ outliers will have high residuals
- testing for outliers with Hotelling’s T2 – multivariate generalization of student’s t-test; tests the score values of a specific sample compared to the variation in the remaining samples taking the covariance structure of the data into account; CAUTION: assumes multivariate normal distribution therefore if have multiple clusters and an outlier in between the clusters, won’t be found by Hotelling’s T2 ellipse (but can be seen visually in score plot)
- sometimes easier to see outliers in a score plot than in the raw data (e.g. spectra)
- leverage ~ squared Mahalanobis distance ~ linearly related to Hotelling’s T2
- Euclidean distance (circle): doesn’t take structure of data into account; might not be useful since some directions might be more important than others
- Mahalanobis distance (ellipse): distance to the center takes the position of all points into account; takes covariance structure into account
- high leverage means unique/extreme/outlying sample
- residuals usually assumed to be small, randomly distributed and of similar size; CAUTION: residuals could also contain non-noise aspects of the data that the model was unable to capture(model error) so be careful with assumptions
- compare size(variance) of E (residual matrix) with data: big E means low variance explained
- high residuals for one variable (column) indicates irrelevant/inconsistent variable
- high residuals for one sample (row) indicates outlier
- high residual for single elements might indicate measurement error
- influence plot: residual variance for each sample vs leverage (for a specific number of components)
- in influence plot, often see movement of outliers from upper left to lower right
- “…outliers may be indicative of data points that belong to a different population than the rest of the sample set.” – Wikipedia
Ended with the following recipe for doing PCA:
1. Decide on pre-processing: centering? to scale or not to scale?
2. Run PCA with a few components
3. Use scores and residual plots to identify outliers in these first few components
4. Decide how to handle outliers: remove bad outliers? get more data to improve sampling?
5. Use cross-validation to decide on number of components
6. Look for outliers and go back to Step 4 if any changes in the data
7. Assess model more carefully: make sure calibrated and validated fit is as expected, interpret score plots, interpret loadings plot, visualize residuals, etc
8. Use the model
His final words: don’t blindly accept the defaults given to you by the software
We’ve spent essentially all the lecture time this week going through a mathematical derivation of support vector machines (SVMs)
And we’re still deriving!
I’ll refer you to the lecture notes here (labeled Lecture notes 3) for all the glorious details.
A few highlights from the lecture and terms mentioned during the derivation though:
- margins: given a binary classification problem, SVMs seek to maximize the margin between the two classes
- kernels: enable SVMs to be used to perform non-linear classification by mapping low-dimensional data to high-dimensional feature space
- functional margins
- separating hyperplane
- geometric margins
- optimal margin classifier
- convex optimization
- quadratic programming
- constrained optimization problem
- Lagrange duality, Lagrange multipliers, generalized Lagrangian
- primal optimization problem
- dual optimization problem
- Karush-Kuhn-Tucker (KKT) conditions, KKT dual complementarity condition
- support vectors
- inner products (between input feature vectors)
- feature mapping
- Gaussian kernel
- Kernel matrix
- positive semi-definite matrix
- Mercer’s theorem
- polynomial kernel
- kernel trick
- sequential minimal optimization (SMO) algorithm
- coordinate ascent
This week we started on classification in earnest with decision tree learning.
Decision trees are an example of supervised learning where a labeled training set is used to learn the classifier (the tree) after which the performance of the tree is evaluated on a labeled test set and a loss function related to misclassification error is calculated.
Decision trees are not unique i.e. there could be more than one tree that fits the data. The tree found cannot usually be guaranteed to be the globally optimal one when the greedy forward algorithm is used.
Decision trees have trouble capturing structure not parallel to the feature axes. Consider for instance a two-class data set with two features x and y that is perfectly separable by the line x+y=1. While this would be an easy problem for a linear classifier, a decision tree would have to be quite complex to achieve the same level of separation.
Went over Hunt’s algorithm for growing decision trees.
Given various options on how to split a tree node, how do you decide which one to pick? The most commonly used measures are the Gini impurity index and the information gain/deviance/entropy. These are calculated for the various split options and the one that minimizes the Gini index/entropy (similar to maximizing purity of node) is chosen.
One could also use misclassification error to decide on a ‘best split’. Since this involves a function that is not smooth (unlike Gini index and entropy) this is more difficult to optimize numerically. Also, there might be splits where no gain in misclassification error is seen but there would be a gain in Gini and information.
Binary splits are usually preferred to multi-way splits since the number of data points per leaf node drops faster with multi-way splits than with binary splits.
Trees that are too deep tend to overfit the data. (You don’t want a classifier to rote-learn and memorize the training set data by heart as this will make it very poor at making predictions on other data sets.) In general, the test (misclassification) error will initially decrease as the number of nodes on the tree increase but at some point will flatten out and start to increase due to overfitting.
Pruning is usually done to avoid having a tree that is too deep.
Some criteria that could be used to terminate tree growth and prevent it from getting too deep and overfitting are:
– stop if node is pure i.e. all members of same class
– stop if node population is less than some threshold (to avoid the problem of having too little data in a node to make a statististically sound split)
– stop if expanding current node does not improve some impurity measure e.g. GINI, information/entropy/deviance
– stop if class distribution of instances are independent of the available features e.g test using chi-squared test
The rpart library of R allows one to specify a loss matrix which penalizes certain misclassification errors more than others e.g. making false negatives more unfavorable than false positives.
Looked at the confusion matrix and measures of performance for evaluating a classifier:
– accuracy/simple matching coefficient (SMC): what fraction of the data was correctly classified? [(TP + TN)/(TP + TN + FP + FN)]
– precision: of the data that was classified positive, what fraction was correctly classified? [TP/(TP + FP)]
– specificity: of the data that was actually negative, what fraction was correctly classified? [TN/(TN + FP)]
– recall/sensitivity: of the data that was actually positive, what fraction was correctly classified? [TP/(TP + FN)]
– F: 2(recall)(precision)/(recall + precision)
– the ideal classifier will have all these metrics equal 1
The textbook for the class is Introduction to Data Mining by Tan, Steinbach and Kumar. The slides used in class closely follow the slides by the book authors that you can download here on the book website. This week’s lectures are all based on the chapter four slides.
While it’s nice to hear how the scores and loadings of principal component analysis (PCA) are related to the singular value decomposition (SVD) of the data matrix, I think some of us are more visually oriented and want to see what exactly it is that PCA does to the coordinate system and how it reduces the dimensionality of the problem. I found a bunch of short video tutorials on how PCA works that go about describing PCA in a very visual and (to me) intuitive way on the website of the Quality and Technology Department of Food Science of the University of Copenhagen (what a mouthful!).
So far I’ve only watched the first four videos (PCA Intro 1 & 2 and PCA Theory 1 & 2) but they also have video tutorials on partial least squares (PLS) regression, spectral pre-processing and variable selection.
A few things I learned from the PCA videos:
- data does not equal information although one can get information from data; similarly, information does not equal knowledge although one can gain knowledge from information. (Apparently, they even named a pyramid after this principle)
- before doing PCA on a data set, ask yourself whether the data is representative and relevant. An extreme example is trying to use data on blood types, weight and height to try and determine a person’s income. Can the data you have be used to make the predictions you wish to make? Answering this question requires domain knowledge (and some ‘common sense’ I suppose).
- on a PCA score plot, two observations being close to each other implies similarity with respect to the variables explained by the principal components (PCs) being looked at; for a 2D score plot, this holds to the extent that the samples are close to the plane described by the PC axes (if the variance explained by the PCs is high then the observations/samples will be close to the plane)
- original (before PCA) axes can be projected to the new (after PCA) coordinate axes. From the projections can evaluate how important the variable are: if close to origo, then they don’t influence the model much, if close to each other, they carry essentially the same information
- using a biplot to show both samples and variables on the same plot
- some operations that are usually carried out on the data matrix before PCA are mean-centering (subtracting from each column its mean) and unit variance scaling/auto-scaling (dividing each column by its standard deviation) – this results in your variables having a mean of zero and standard deviation of one; the order doesn’t matter i.e mean-centering then auto-scaling is the same as auto-scaling then mean-centering
- PCA highlights high variance between the variables you give it – is high variance in the variables used in your PCA relevant for your problem?
- PCA is good for: exploratory analysis/hypothesis generating, dimension reduction, outlier detection and classification (e.g. in SIMCA)
Watch the videos for more details and better explanations. I plan on watching the rest of the videos on how to decide how many PCA components to keep and outliers next and will blog on those as well. Stay tuned.
Oh, they also have a lot of chemometrics-related links on their main website.
A continuation of the lecture summaries of the Autumn 2012 Data Mining and Analysis class at Stanford.
This is my summary of the third week of lecture in the Stanford Data Mining and Analysis (STATS 202) class.
- started on graphical exploratory data analysis
- visualization useful for detecting general patterns/trends and for detecting outliers
- went over stem-and-leaf plots, histograms, probability densities and cumulative distribution functions, quantile plots, boxplots, contour plots, heatmaps, correlation matrices and pairs plots
- went over multidimensional scaling
- discussed multidimensional arrays
- started on classification