This page briefly describes splines as an approach to nonlinear trends and then provides an annotated resource list.
Defining the problem
Many of our initial decisions about regression modeling are based on the form of the outcome under investigation. Yet the form of our predictor variables also warrants attention. Continuous predictors sometimes, but not always, assumed to have a normal distribution (particularly when conducting multiple imputation, factor analysis, or other procedures that rely on a normality assumption). Some distributions can be readily transformed to more closely approximate normality. Common transformations include log-transformation, and polynomial transformations such as the square or cube.
Another assumption that can be addressed through variable transformation is the assumption that a exposures and covariates linearly predict the outcome (either on the original, log, or logit scale, depending on the link used). In other words, the generalized linear model “g(y) = constant + βX + error” implies that a one unit increase in any “X” variable predicts a fixed difference in “g(y)” that can be described by a single number: β. Thus, we make no distinction between a one unit increase in X near the minimum value of X and a one unit increase in X near the maximum value of X. Sometimes, this linearity is hypothesized directly, or incorporated more vaguely into our idea of a dose-response. However, our hypotheses are often less strict about the shape of the association, specifying only that the association should be monotonic (defined as having a slope that does not change sign, and is thus either increasing (+) or decreasing (-) but not switching between these). Occasionally, a specific non-linear association may be hypothesized, such as a u- or j-shaped curve. And, of course, there may exploratory or predictive analyses without a pre-specified hypothesis in which the investigators simply wish to describe the shape and strength of the association across the range of X when X is the exposure of interest.
Linearity assumptions for potential confounders also warrant attention, since departures from linearity may open the door to residual confounding. Consider for example an analysis of height adjusting for age. Depending on the range of ages in the study, the assumption of linearity may be implausible since we know that a child’s height increases with age while adult height is relatively stable, and perhaps starting to decline with age after many decades. Stating that one has “adjusted for age” may hide the inadequate model fit for this potential confounder, which might be more adequately characterized by adding a squared term (age*age) or other transformation. Likewise, the efficiency of age as a precision variable or in a predictive model would be limited by an unjustified assumption of linearity.
Linearity can be compared with more flexible forms in order to test whether added complexity significantly improves the fit to the data. If so, these flexible forms (including splines) may allow us to better characterize the shape and strength of the association of interest. A likelihood ratio approach to testing nested models is described at
In brief, a likelihood ratio test can be conducted by comparing the log-likelhoods from two models fit to the same set of observations (check for identical N), where the more complex model has one or more additional parameters. The difference in log-likelihoods is multiplied by 2, this value is compared to a chi-squared distribution with the degrees of freedom equal to the number of additional parameters in the more complex model. So, as a simple example, a model with X could be compared to a model with X and X2 by looking up the p-value for a twice the difference in log-likelihoods in a chi-squared table with 1 degree of freedom. A significant p-value indicates that the more complex model fits the data significantly better. If the p-value is non-significant, the simpler model might be considered sufficient.
Options for analysis of continuous predictors
A continuous predictor, either the exposure of interest or another covariate, can be analyzed in a range of methods from relatively simple to more complex and flexible.
1. Enter the predictor as is, assuming linearity; this captures the “first order trend” and can be easily summarized with one slope parameter, but may underestimate the strength of a truly non-linear association
2. Create one dichotomous variable or several dichototomous variables (dummy indicators) based on whether the predictor exceeds some specific values (e.g. published guidelines) or percentiles; this assumes zero slope within each exposure group and an abrupt change in the predicted outcome at the threshold.
3. Generate a grouped linear variable based on quantiles (tertiles, quartiles, quintiles, deciles) or pre-specified cutpoints, such that the categories can be treated continuously (e.g., a single variable with quartiles coded as the lowest category = 0 (reference), the second lowest = 1, the second highest = 2, and the highest = 3); this assumes zero slope within each group, an abrupt change in the predicted outcome at the threshold, and a linear increase across groups; significance tests from such grouped linear variables are often described as providing a “p-for-trend”
4. Maintaining the variable as continuous, but applying a log-transformation, square transformation, or other single polynomial transformation; this assumes that a 1-unit change in the transformed variable has similar meaning across the range of exposure, and the choice of transformation will affect which observations have more leverage to affect model fit (log-transformation or square root transformation tends to decrease the leverage of high values, while higher order transformation like the square or cube tend to increase the leverage of high values relative to using an untransformed version of the predictor)
5. Visually explore the shape of association through use of splines or other complex forms, and test whether these improve model fit
Understanding cubic or smoothing splines
Working with splines allows for powerful exploratory data visualization and tests of non-linearity. Splines are generally a flexible, data driven approach. There are many possibilities for how to fit splines, but we’ll focus on a popular in epidemiology: restricted cubic splines. These are usually adequate for the shapes of interest in epidemiology (e.g., U-shaped, J-shaped, or S-shaped curves; associations with a ceiling or floor effect that approach an asymptote).
To understand more complex splines, let’s start with a simple linear spline. Let’s imagine our X has a mean 0 and a standard deviation of 1, and is used to predict a continuous outcome Y. We select a set of cutpoints within the range of X that we label as knots; for this example, we place knots at -1, 0, and 1. We then use a scatterplot of Y vs X, and separately fit a line for the 4 regions: where X is below -1, where X is between -1 and 0, where X is between 0 and 1, and where X is greater than 1. Each line segment could have a slope and intercept estimated by ignoring the data outside of that region. However, we traditionally go one step farther to force the lines to meet at the same predicted Y value at each knot, so that there are no large “breaks” or “jumps” in the shape. This continuity constraint also limits the degrees of freedom used, since we now only need to estimate one intercept and 4 slopes.
Cubic splines add further flexibility by allowing the shape within each region to be based on a cubed transformation rather than a line. An additional “continuity” constraint is added by doing so, forcing the slopes and the rate of change in the slope to converge at each knot. This provides a smooth shape with no abrupt “corners”. The complexity of the shapes you can fit increases with the number of knots used, and can also be altered by the placement of the knots. However, to limit the inflections near the extremes of the data where observations may be sparse, we often perform restricted cubic splines, constrained to an approximately linear fit near the minimum and maximum.
Smoothing splines, while potentially similar to cubic splines in appearance, are fitted through generalized additive models instead of generalized linear models. Generalized additive models such as g(y) = constant + f(X) + error” include a flexible function of X optimized to predict the outcome. A balance is struck between a best fit line and interpolation among all unique values of X by using a penalty for changes in slope. The more degrees of freedom used for such splines, the more flexible the form will be. The degrees of freedom used can be pre-specified or estimated, and degrees of freedom used need not be an integer.
Tips for implementing splines
Use splines to visualize and test deviations from linearity in your associations of interest and those with continuous confounders.
If you think a simple transformation adequately fits the observed shape, you may not want to include a spline in your final model.
Planning your strategy in advance may help to avoid biased p-values (e.g., use 4 knots with default placement, use smoothing spline penalty that minimizes the AIC)
Splines can be fitted for one or more continuous predictors, and other predictors can be included in your model as well.
Splines can be summarized visually by graphing the predicted outcome versus the predictor on it’s original scale, or by displaying a table of predicted outcome values for several selected values of X.
When visualizing splines, be cautious of the wide confidence intervals that can arise where data are sparse — these can be distracting to the eye despite containing little information, and trimming them out of the graph may facilitate communication of results.
Textbooks & Chapters
March LC, Cormier DR. Spline regression models. SAGE Publications, Inc, 2002
This short text (62 pages) gives an accessible description of various spline approaches, including interpretation of spline output from software.
Fox J. Applied regression analysis and generalized linear models. SAGE Publications, Inc, 2008.
This text is not specific to the topic of splines and fractional polynomials, but does take a flexible approach to visualization throughout. These techniques are covered along with simpler (kernel density estimation, loess smoothing) and more complex (generalized additive and non-parametric modeling) approaches.
Greenland S. Dose-response and trend analysis in epidemiology: alternatives to categorical analysis. Epidemiology 1995;6(4):356-65.
Jacoby WG. Loess: a nonparametric, graphical tool for depicting relationships between variables. Electoral Studies 2000;19(4):577-613.
Steenland K, Deddens JA. A practical guide to dose-response analyses and risk assessment in occupational epidemiology. Epidemiology 2004;15(1):63-70.
Strasak AM, Umlauf N, Pfeiffer RM, Lang S. Comparing penalized splines and fractional polynomials for flexible modeling of the effects of continuous predictors. Computational Statistics & Data Analysis 2010;55(4):1540.
Durrleman S, Simon R. Flexible regression models with cubic splines. Stat Med 1989;8(5):551-61.
Wood S. Stable and efficient multiple smoothing parameter estimation for generalized additive models. JASA 2004;99:637-686
Hastie T, Tibshirani R. Generalized additive models for medical research. Statistical Methods in Medical Research 1995;4(3):187.
(Note that there is a typo in the column headers of Table 1 in this paper, as pointed out by a student. The column labeled “p value” contains negative numbers and numbers greater than 1, so these cannot be p values. They likely are the z-score test statistics on which p values would be based.)
Roberts S, Martin MA. The question of nonlinearity in the dose-response relation between particulate matter air pollution and mortality: can Akaike’s Information Criterion be trusted to take the right turn? Am J Epidemiol 2006;164(12):1242-50.
(For references discussing the problems with dichotomization of continuous variables, see Relative Risk Regression)
Strand BH, Kuh D, Shah I, Guralnik J, Hardy R. Childhood, adolescent and early adult body mass index in relation to adult mortality: results from the British 1946 birth cohort. J Epidemiol Community Health 2012;66(3):225-32.
This survival analysis use cubic splines to describe the relationship of BMI with mortality
Diez Roux AV, Ranjit N, Powell L, Jackson S, Lewis TT, Shea S, Wu C. Psychosocial factors and coronary calcium in adults without clinical cardiovascular disease. Ann Intern Med 2006;144(11):822-31.
This example uses a cubic spline to explore the association of psychosocial factors with coronary calcification (a log-transformed continuous outcome).
Hankinson JL, Odencrantz JR, Fedan KB. Spirometric reference values from a sample of the general U.S. population. Am J Respir Crit Care Med 1999;159(1):179-87.
This prediction-focused analysis used lower-order polynomials and piecewise polynomials to create lung function reference equations.
Lovasi GS, Lemaitre RN, Siscovick DS, Dublin S, Bis JC, Lumley T, Heckbert SR, Smith NL, Psaty BM. Amount of leisure-time physical activity and risk of nonfatal myocardial infarction. Ann Epidemiol 2007;17(6):410-6.
This study uses a smoothing spline to display the association between physical activity and case-control status in a study of myocardial infarction.
On simple transformations
On smoothers or splines (cubic, smoothing, or otherwise)
This describes local regression (PROC LOESS in SAS).
This is a posted chapter on spline curves with nice embedded graphs to illustrate polynomial transformations and continuity constraints as they relate to splines
This website shows examples of splines that extend beyond the cubic and smoothing versions discussed in class in order to capture more complex shapes
This describes how to run PROC GAM for smoothing splines in SAS
SAS code for non-linear relations and nice graphs for logistic and proportional hazards models