Differential Item Functioning









In brief, differential item functioning (DIF) occurs when groups (such as defined by gender, ethnicity, age, or education) have different probabilities of endorsing a given item on a multi-item scale after controlling for overall scale scores. An item is labeled as having DIF when people with the same latent ability but from different groups have an unequal probability of giving a response. An item is labeled as non-DIF when people with the same latent ability have equal probability of getting an item correct, regardless of group membership. DIF should be studied since we are often interested in comparing groups. It is important to note that DIF, also referred to as measurement invariance, is also important for measures across time, not just groups.


Benign vs. Adverse DIF

An important distinction is between “benign” and “adverse” DIF (Breslau et al., 2008). Benign DIF occurs when the groups differ in their probabilities of endorsing an item because the item taps a dimension of the underlying trait or attribute measured in the scale that manifests differently between the groups. Adverse DIF occurs when groups differ in their probabilities of endorsing an item because of artifactual elements in the measurement process, such as different understandings of a word or phrase used in the item. Benign DIF is not a form of measurement error, whereas adverse DIF is. That is, benign DIF reflects real group differences in the manifestation of the underlying trait or attribute whereas adverse DIF reflects biases in the measurement process.

There are no unambiguous quantitative methods to distinguish benign from adverse DIF; rather, distinguishing between them, or minimizing the degree of adverse DIF in your data, can be accomplished using several approaches, including focus groups with your groups of interest, follow-up in-depth interviews, and careful review of items by content experts in the area of the scale’s topic. The goal with each of these strategies is to create scale items where groups’ differential probabilities of endorsing the items are due to real group differences on the construct each item is tapping rather than to extraneous factors such as differential interpretation of words, etc.

It is important to keep in mind that the presence of adverse DIF in any one item, or in two or more items, does not guarantee that observed group differences in the overall measure will change once these items are identified and removed. This lack of guarantee is because two or more items with DIF may operate in opposing directions and balance each other out or because significant group differences in one or more items may not be sufficient to significantly alter group differences in the overall scale.

Detecting adverse DIF is important when comparing groups on a multi-item measure because we want to know if observed group differences on the measure are valid or biased estimates.

Testing for DIF

Multiple data analytic strategies exist for detecting DIF, many of which are discusssed in the articles below.

  • Item Response Theory / Logistic Regression – Perform chi-square difference tests to compare the fit of a model that assumes the factor loadings (discrimination) and thresholds (severity) parameters are the same across the two groups (no DIF) to a model that allows them to differ (DIF).

  • Multiple Indicator Multiple Cause (MIMIC) – Use structural equation models and include covariates as predictors of the factor and also of the specific items.


Textbooks & Chapters

  • Holland, P. W., & Wainer, H. (1993). Differential item functioning. Hillsdale, NJ: Erlbaum.
    This is the classic textbook on differential item functioning. It highlights methods for testing test items that function differently for different groups.

  • de Ayala, R. J. (2008). The theory and practice of item response theory. New York, NY: Guilford.
    The textbook is focused on item response theory overall, but discusses differential item functioning and item bias. It also focuses on methods for identifying DIF.

Methodological Articles

  • Teresi J. and Fleishman J. Differential item functioning and health assessment. Quality of Life Research. 2007; 16(1): 33-42. This article provides a clear brief introduction to DIF and explains why it matters. It also nicely summarizes strategies for detecting DIF.

  • Schmitt N and Kuljanin G. Measurement invariance: Review of practice and implications. Human Resource Management Review. 2008; 18(4): 210-222.
    This review discusses the implications of measurement invariance: “a measurement instrument violates invariance when two individuals from different populations are identical on the underlying construct score differently on it.”

  • Swaminathan H. and Rogers H. Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement. 1990; 27(4): 361-370.
    This article describes the use of logistic regression procedures to detect DIF.

Application Articles

  • Breslau J. et al. Differential item functioning between ethnic groups in the epidemiological assessment of depression. The Journal of Nervous and Mental Disease. 2008; 196:297-306.
    This article provides an applied example using SIBTEST statistical software to detect DIF in U.S. blacks and whites in a diagnostic interview for major depression.

  • Crane P. et al. A comparison of three sets of criteria for determining the presence of differential item functioning using ordinal logistic regression. Quality of Life Research. 2007; 16 (1): 69-84.
    This article provides applied examples of three different logistic regression strategies for testing DIF, including Swaminathan and Roger’s.

  • Kwakkenbos et al. The comparability of English, French, and Dutch scores on the Functional Assessment of Chronic Illness Therapy – fatigue (FACIT-F): an assessment of differential item functioning in patients with systemic sclerosis. PLoS One. 2014; 9(3): e91979.
    The authors used multiple-indicator, multiple-cause (MIMIC) models (a type of structural equation model) to assess DIF in different language versions of the Functional Assessment of Chronic Illness Therapy-Fatigue (FACIT-F) which is commonly used to assess fatigue in rheumatic diseases.

  • Lewis T. et al. Racial/ethnic differences in responses to the everyday discrimination scale: a differential item functioning analysis. American Journal of Epidemiology. 2011; 17 (5): 391-401.
    The authors used multiple-indicator, multiple-cause (MIMIC) models (a type of structural equation model) to examine DIF on the Everyday Discrimination Scale (EDS) by race/ethnicity in the Study of Women’s Health Across the Nation (SWAN) cohort.

  • Uebelacker L. et al. Use of item response theory to understand differential item functioning of DSM-IV major depression symptoms by race, ethnicity, and gender. Psychological Medicine. 2009; 39:591-601.
    The authors used IRTLRDIF software (which relies on the LRT statistic) to test for DIF in major depression diagnostic interviews, comparing racial/ethnic and gender groups.


Join the Conversation

Join the Conversation

Have a question about methods? Join us on Facebook