Principal Components Analysis









“The central idea of principal component analysis (PCA) is to reduce the dimensionality of a data set consisting of a large number of interrelated variables, while retaining as much as possible of the variation present in the data set” (Jolliffe 2002). The goal of PCA is to replace a large number of correlated variables with a set of uncorrelated principal components. These components can be thought of as linear combinations of the original variables that are optimally weighted and derived from the correlation matrix of the data. The first few principal components explain the largest proportion of the total variance and can be retained for use in subsequent regression models without concern of multicollinearity. Consider using this technique when: (1) you have a large data set with correlated variables, (2) these variables capture some underlying constructs which you aim to describe, (3) You do not want to include all the original measure in subsequent analyses.


Textbooks & Chapters

Excellent introduction for those planning on using PCA for questionnaire development:
Dunteman, G.H. Principal Components Analysis. Quantitative Applications in the Social Sciences. Sage Pubn Inc. 1989

Excellent resource for those interested in learning more about the theoretical underpinnings of PCA:
Jolliffe, I.T. Principal Component Analysis, Second Edition, Springer 2002

Methodological Articles

In Genetic Epidemiology:
Moser, K.L., et al., Comparison of three methods for obtaining principal components from family data in genetic analysis of complex disease. Genetic epidemiology, 2001. 21 Suppl 1: p. S726-31.

Application Articles

PCA in Genome-Wide Association Studies:
Price, A.L., et al., Principal components analysis corrects for stratification in genome-wide association studies. Nature genetics, 2006. 38(8): p. 904-9

PCA in Nutritional Epidemiology:
Navarro Silvera, S.A., et al., Principal component analysis of dietary and lifestyle patterns in relation to risk of subtypes of esophageal and gastric cancer. Annals of epidemiology, 2011. 21(7): p. 543-50.

Hu, F.B., Dietary pattern analysis: a new direction in nutritional epidemiology. Current opinion in lipidology, 2002. 13(1): p. 3-9.

PCA in Cardiovascular Disease Epidemiology:
Okin, P.M., et al., Principal component analysis of the T wave and prediction of cardiovascular mortality in American Indians: the Strong Heart Study. Circulation, 2002. 105(6): p. 714-9.

PCA in Social Epidemiology:
Hurtado, D., I. Kawachi, and J. Sudarsky, Social capital and self-rated health in Colombia: the good, the bad and the ugly. Social science & medicine, 2011. 72(4): p. 584-90.

PCA in Oral Epidemiology:
Bueno, R.E., S.J. Moyses, and S.T. Moyses, Millennium development goals and oral health in cities in southern Brazil. Community dentistry and oral epidemiology, 2010.

PCA in Psychiatric Epidemiology:
Stewart, S.E., et al., Principal components analysis of obsessive-compulsive disorder symptoms in children and adolescents. Biological psychiatry, 2007. 61(3): p. 285-91.


For a brief tutorial on the method and underlying statistics used in PCA:
L.I. Smith. A Tutorial on Principal Component Analysis. pp 1-26, February 2002 Available at:



Join the Conversation

Have a question about methods? Join us on Facebook