Dietary Assessment

Test-Retest Reliability

The degree to which a measurement method produces the same result on repeated application to the same underlying reality — the statistical core of what consumers mean when they ask whether a scale, app, or survey is "consistent."

By James Oliver · Editor & Publisher · Updated April 18, 2026

Key takeaways

Test-retest reliability measures the reproducibility of a method across repeated applications to the same subject/meal.
It is typically reported as an intraclass correlation coefficient (ICC), with ICC ≥ 0.75 as a conventional "good" threshold.
Reliability is necessary but not sufficient for validity — a reliably biased method is worthless.
Dietary-recall instruments are notoriously poor on test-retest within a day (memory failure) and better across weeks (averaging effect).

Test-retest reliability is the degree to which a measurement method produces the same result when applied repeatedly to the same underlying reality. It is the statistical formalisation of the informal question: if I weigh this apple twice, or log this meal twice, or recall this day's intake from two different prompts, do I get the same answer? In dietary assessment, test-retest reliability is distinct from — and necessary but not sufficient for — validity.

The typical quantification

Test-retest reliability is most often reported as an intraclass correlation coefficient (ICC) for continuous outcomes or as Cohen's kappa for categorical ones. The ICC is a ratio of between-subject variance to total variance: a method where most of the observed variance comes from actual differences between subjects (rather than from measurement noise on the same subject) has a high ICC. Conventional thresholds in dietary-assessment methodology, following Cicchetti's 1994 guidance, are roughly:

ICC < 0.40: poor reliability.
ICC 0.40 to 0.59: fair.
ICC 0.60 to 0.74: good.
ICC ≥ 0.75: excellent.

Reliability vs validity

A reliable method is one that gives the same answer twice. A valid method is one that gives the right answer. The two are independent. A kitchen scale that consistently reads 5 grams high is highly reliable (test-retest ICC ≈ 1.0) but invalid (systematically biased). A 24-hour dietary recall that captures true average intake well across many subjects has validity but may have poor within-subject test-retest reliability because people eat different things on different days.

The practical implication: a method with poor test-retest reliability cannot be improved simply by averaging more measurements, because the source of the variance is the underlying reality, not the instrument. The method is measuring something, but not the construct the user thinks it is measuring.

Test-retest in dietary-recall instruments

Dietary-recall questionnaires (FFQs, 24-hour recalls, multi-day food records) are benchmarked routinely on test-retest reliability, and the results are informative. Within-day test-retest of a 24-hour recall — asking the same respondent to recall yesterday's intake twice within a short interval — typically shows ICC in the 0.6 to 0.8 range for macronutrients, lower for specific foods. Across-week test-retest of FFQs on habitual intake is higher (ICC 0.5 to 0.7 for major nutrients) because the method averages over day-to-day noise. The methodology paper of record is the EPIC validation study from Bingham and colleagues in the early 1990s.

Test-retest in consumer apps

Consumer calorie-tracking apps vary widely on test-retest reliability when a user logs the same meal multiple times:

Barcode scanning is near-perfect (ICC ≈ 1.0) on the same product — scan the same UPC twice, get the same answer.
Manual entry with the same kitchen scale and the same food has ICC dependent on the user's consistency in recipe lookup.
Photo-logging with computer vision varies per-image — the same photo run through the same model will produce the same result (deterministic), but two photos of the same meal will produce different results because the input is different.

References

Cicchetti DV. "Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology". Psychological Assessment , 1994 — doi:10.1037/1040-3590.6.4.284.
Bingham SA, Gill C, Welch A, Day K, Cassidy A, Khaw KT, Sneyd MJ, Key TJ, Roe L, Day NE. "Comparison of dietary assessment methods in nutritional epidemiology". British Journal of Nutrition , 1994 — doi:10.1079/bjn19940064.
Willett W. "Nutritional Epidemiology, 3rd Edition". Oxford University Press , 2013 .

Related terms