Eight Knowledge Foundations
Construct-irrelevant variance — score variation driven by factors extraneous to the trait being measured — is one of the two classic threats to validity. Diagnosing and removing it draws on several interconnected bodies of knowledge. The strongest assessment work holds all of these in view at once.
Educational & Psychological Measurement
The home discipline. Validity is treated as an ongoing, evidence-based judgment about a particular interpretation and use of scores — not a fixed property of a test. Within this frame, construct-irrelevant variance (CIV) is the central threat: when scores are shaped by factors outside the construct, the inference drawn from them is weakened.
Construct Theory & Test Design
CIV can only be diagnosed against a clearly specified construct. Defining the trait — reading comprehension, clinical reasoning, statistical literacy — before building items is what makes it possible to say which demands legitimately belong to the measurement and which are incidental. The working rule: define the construct first, operationalize second, validate the interpretation third.
Psychometrics & DIF
Reliability, item analysis, and Differential Item Functioning (DIF) supply the empirical machinery for detecting when an item behaves differently across groups for reasons unrelated to ability. Methods including the Mantel–Haenszel procedure, logistic regression, and item response theory flag items for review — a key source of evidence that an assessment is functioning fairly.
Universal Design for Learning
The parent framework. Multiple means of action and expression directly counter CIV by separating the skill being assessed from the medium used to demonstrate it — so that an assessment of reasoning does not inadvertently become an assessment of writing, typing, or timed recall. Barriers are anticipated in the design rather than addressed after the fact.
Disability Studies & Accessibility
Assistive technology, the distinction between reactive accommodation and proactive design, and standards such as WCAG. Many classic sources of CIV — timing pressure, rigid format, unnecessary sensory or motor demands — are accessibility barriers in another guise. Designing them out benefits far more learners than those who would have requested an accommodation.
Authentic & Performance Assessment
Real-world, application-based tasks reduce some of the artificial barriers that standardized formats introduce, while keeping the focus on what is actually meant to be measured. Their design requires equal rigour: authenticity does not relax the need for clear constructs, consistent scoring, and defensible inferences.
Cognitive Load Theory
A practical lens for separating load that is intrinsic to the construct from extraneous load imposed by confusing instructions, cluttered stimuli, or irrelevant complexity. Extraneous load is a frequent and largely invisible source of CIV: it lowers scores for reasons that have nothing to do with the trait under measurement.
Culturally Responsive & Equity-Centered Assessment
Language demands, cultural assumptions embedded in prompts, and structured bias review. This body of work asks whether a task measures the intended construct fairly across linguistic and cultural backgrounds — or whether it quietly rewards familiarity with the test-maker's world.
Removing Barriers Is Not Lowering the Bar
The most common misconception in this area is that removing construct-irrelevant barriers makes an assessment easier or less rigorous. The opposite is true. When a barrier unrelated to the construct is removed — a needless reading load on a mathematics item, a time limit that measures speed rather than competence, a culturally specific reference that advantages some test-takers — the score becomes a more accurate reflection of the trait being measured. Validity increases. What changes is precision, not standards.
The construct is the boundary that makes the whole judgment possible. Without a clear definition of what is and is not part of what is being measured, there is no principled way to distinguish a legitimate demand from an incidental one. This is why construct specification is treated here as the first and most consequential step, not a formality. Diagnosis of CIV is only as good as the construct definition it is measured against.
Why the Stakes Sharpen the Work
The cost of construct-irrelevant variance scales with the consequence of the decision it informs. In a low-stakes formative check, an imprecise score is a minor inconvenience. In a high-stakes or credentialing context, construct-irrelevant variance becomes a threat to the validity of the credential itself — the very thing the assessment exists to protect. In that setting, psychometric analysis, DIF review, and accommodation-law fluency move from useful refinements to defensibility-critical practice. The same principles apply across contexts; the tolerance for error is what differs.
How the Work Is Done
Identifying and removing construct-irrelevant variance is a structured, evidence-building process rather than a single review step. The stages below represent a comprehensive approach; in practice, depth is calibrated to the stakes of the assessment. Understanding the full process clarifies what is gained or lost when a stage is abbreviated.
Before any item is written, the construct is defined with enough precision that its boundary is visible: which knowledge, skills, or abilities belong to it, and which adjacent demands do not. A test of statistical reasoning, for example, must decide whether reading load, arithmetic fluency, or software proficiency are part of the target — or incidental to it.
This definition is the reference point against which every later judgment about construct-irrelevant variance is made. Where the construct is left implicit, CIV cannot be diagnosed reliably, because there is no agreed standard for what counts as extraneous.
- A written construct definition with explicit inclusions and exclusions
- A statement of the intended interpretation and use of scores
- Shared agreement among stakeholders on the boundary of the construct
Every task imposes demands. This stage inventories them and sorts each one as either intrinsic to the construct or incidental to it. Incidental demands are the candidate sources of construct-irrelevant variance: reading level above the construct's requirement, time pressure unrelated to the skill, fine-motor or sensory demands, cultural or contextual references, and extraneous cognitive load created by item design.
Cognitive load theory is useful here for distinguishing the complexity that genuinely belongs to the construct from complexity introduced by presentation. The aim is not to eliminate difficulty, but to ensure that difficulty comes from the trait being measured.
- A demand map classifying each task requirement as intended or incidental
- A prioritized list of likely CIV sources to address in design
- Design constraints carried forward into item and task construction
Items and tasks are constructed to measure the construct while minimizing the incidental demands identified in the previous stage. Universal Design for Learning's principle of multiple means of action and expression is applied directly: where the medium of response is not part of the construct, learners are offered more than one way to demonstrate the same competence, with all options mapped to a single construct-aligned rubric.
Plain-language construction, uncluttered stimuli, and flexible timing reduce extraneous load. Authentic and performance tasks are introduced where they measure the construct more directly than a standardized format would.
- Draft items and tasks with incidental demands designed out
- Where appropriate, multiple response options mapped to one rubric
- A scoring approach aligned to the construct rather than to format
Draft materials undergo structured review for cultural assumptions, unnecessary language complexity, and accessibility barriers. Sensitivity and fairness review examines whether prompts assume background knowledge unrelated to the construct. Accessibility review checks digital materials against established standards and confirms compatibility with assistive technology.
This review is qualitative and anticipatory. It catches barriers that statistical analysis alone will not surface — particularly those affecting groups too small to register reliably in later item statistics.
- A documented bias and sensitivity review with recommended revisions
- An accessibility audit covering format, language, and assistive-technology compatibility
- Revised materials ready for piloting
Items are piloted with a representative sample and examined through item analysis, reliability estimation, and Differential Item Functioning analysis. DIF compares performance across groups matched on overall ability, flagging items that function differently for reasons unrelated to the trait being measured — the empirical signature of construct-irrelevant variance at the item level.
A flagged item is not automatically discarded. Statistical flags are reviewed qualitatively to determine whether the difference reflects genuine bias or a defensible feature of the construct, and items are revised or removed on that combined evidence.
- Item statistics, reliability estimates, and DIF results
- A documented adjudication of each flagged item
- A revised item set with sources of CIV removed or justified
Proactive design removes the need for many accommodations, but not all. Remaining needs are met through a documented accommodation process aligned with the applicable legal floor — the ADA and Section 508 in the United States, and the Accessible Canada Act and provincial statutes such as the Accessible British Columbia Act and Ontario's accessibility legislation in Canada.
The distinction matters: accommodations adjust the conditions of administration without altering the construct, whereas a modification that changes the construct changes the meaning of the score. Keeping that line clear is what makes both the design and the accommodation defensible.
- A documented accommodation process consistent with applicable law
- A clear record of which adjustments preserve the construct and which would alter it
- Policy alignment suitable for review by an accrediting or governing body
The evidence gathered across the process is assembled into an argument for the specific interpretation and use of scores. Following the contemporary, argument-based view of validity, this draws on multiple sources: content relevance, response processes, internal structure, relationships to other variables, and the consequences of use. The treatment of construct-irrelevant variance is documented as part of that argument.
Validation is treated as ongoing rather than a one-time certification. As populations, contexts, and uses change, the argument is revisited — which is also what allows an accrediting body to use the same evidence for its own quality assurance.
- A validity argument linking evidence to the intended interpretation and use
- Documentation of how CIV was identified, addressed, and monitored
- A basis for periodic revalidation as conditions change
What Shapes the Work
No two assessments call for identical treatment. Decisions about purpose, construct boundary, format, and evidence determine how much construct-irrelevant variance can be tolerated — and how much effort its removal warrants. These dimensions are best settled collaboratively, early.
Purpose & Stakes
What decision will this score inform?A formative check and a credentialing exam tolerate very different levels of imprecision. The stakes of the decision set the standard for how rigorously construct-irrelevant variance must be identified and removed — and how much validity evidence the use requires.
Construct Boundary
What exactly is — and isn't — being measured?This is the decision that makes every other one possible. Until the construct's boundary is explicit, there is no principled way to call any demand extraneous. Time invested here prevents the most expensive errors later.
Format & Medium
Does the format add demands unrelated to the construct?Multiple-choice, essay, oral, portfolio, and performance formats each carry their own incidental demands. The right format is the one that measures the construct most directly for this population — not the one that is most convenient to score.
Single vs. Multiple Means
Can learners demonstrate the construct in more than one way?Where the medium of response is not part of the construct, offering multiple means of expression — mapped to one rubric — removes a major source of CIV. Where the medium is the construct, flexibility would change what is being measured.
Population & Context
Who takes this, and what variability do they bring?Linguistic, cultural, and access variability in the test-taking population determines which barriers are most consequential. The wider and less knowable the population, the more proactive design must do the work that reactive accommodation cannot.
Evidence & Defensibility
How much validity evidence does this use require?Higher-stakes uses call for deeper evidence: DIF analysis, documented bias review, and a complete validity argument. Deciding this early shapes the pilot design and determines whether the assessment can withstand external scrutiny.
Core Competencies
The competencies below form the professional foundation of this work — spanning definition, design, analysis, and defensibility across formative, high-stakes, and credentialing contexts.
Construct Specification
Defining a construct with enough precision that its boundary is operationally clear — distinguishing the trait to be measured from adjacent demands. This is the competence on which the reliable diagnosis of construct-irrelevant variance depends.
Validity Argumentation
Treating validity as an evidence-based judgment about a specific interpretation and use, and assembling content, response-process, internal-structure, external, and consequential evidence into a coherent, defensible argument.
Item & Task Design
Constructing items and performance tasks that elicit the intended construct while designing out incidental demands — applying plain-language principles, uncluttered presentation, and, where appropriate, multiple means of expression mapped to a single rubric.
Psychometric Analysis
Conducting item analysis, estimating reliability, and applying Differential Item Functioning methods — Mantel–Haenszel, logistic regression, and item-response-theory approaches — together with the qualitative review needed to interpret flagged items responsibly.
Accessibility & UDL Application
Anticipating learner variability in the design itself: applying Universal Design for Learning, meeting recognized accessibility standards, and ensuring compatibility with assistive technology so that barriers are removed before they require accommodation.
Bias & Sensitivity Review
Leading structured review of cultural assumptions, language demands, and contextual references to ensure tasks measure the construct fairly across linguistic and cultural backgrounds rather than rewarding familiarity with a particular world.
Accommodation Law & Policy Fluency
Working knowledge of the applicable legal floor — the ADA and Section 508, and Canadian frameworks including the Accessible Canada Act and provincial statutes — and the judgment to distinguish accommodations that preserve the construct from modifications that alter it.
Curricular Alignment
Using backward design and constructive alignment to ensure the assessment actually maps to the stated learning goal. Misalignment is itself a structural form of construct-irrelevant variance, and correcting it is often the highest-value intervention available.
Equity-Centered Practice & Collaboration
Designing assessment to surface and remove barriers rather than to sort, and working alongside subject-matter experts, faculty, and the communities being assessed — including co-design and the disaggregation of results to detect disparities and guide revision.
What Barrier-Free Assessment Produces
Removing construct-irrelevant variance creates value at two horizons. The shorter-term outcomes are concrete and measurable. The longer-term shift affects how an institution's results — and its credentials — hold up under scrutiny.
The construct is clearly bounded
A written definition makes explicit what is and is not being measured — the reference point that turns "this feels unfair" into a specific, addressable claim.
Barriers are named and removed
Incidental demands — reading load, time pressure, format, cultural reference — are identified and designed out, so scores reflect the trait rather than the delivery.
Scores mean what they claim
With extraneous variance reduced, the inference drawn from a score is better supported. Validity improves precisely because precision improves.
Flagged items are resolved on evidence
Items that function differently across groups are surfaced through DIF, reviewed qualitatively, and revised or justified — not left to chance or to complaint.
Decisions and credentials are defensible
A documented validity argument allows results to withstand external scrutiny — from learners, from accreditors, and from the bodies whose credentials depend on the assessment's meaning.
Equity is built in, not bolted on
Proactive design removes barriers for the whole population rather than routing individuals through a reactive accommodation process that cannot scale.
A standing capacity to review develops
Construct definitions, demand maps, and analysis routines become reusable infrastructure, so each new assessment starts further along than the last.
Results earn trust across stakeholders
When learners, faculty, and governing bodies can see that scores measure the intended construct fairly, the assessment's authority rests on evidence rather than assertion.
Where Is Your Assessment Practice in This Work?
This reflection is intended to surface useful questions about how your current assessments handle construct-irrelevant variance. There are no right or wrong answers — choose the response that most honestly reflects your situation.
Academic & Professional Citations
The knowledge claims, frameworks, and evidence in this resource draw on established scholarship, professional standards, and law. Sources are grouped by the area of the resource they primarily support, and are listed so that each can be independently verified.
Messick, S. "Validity." In R. L. Linn (Ed.), Educational Measurement (3rd ed., pp. 13–103). American Council on Education / Macmillan, 1989. Origin of construct-irrelevant variance and construct underrepresentation as the two principal threats to construct validity.
Messick, S. "Validity of Psychological Assessment." American Psychologist, 50(9), 741–749, 1995.
doi.org/10.1037/0003-066X.50.9.741AERA, APA, & NCME. Standards for Educational and Psychological Testing. American Educational Research Association, 2014. Authoritative reference for validity, fairness, and the treatment of construct-irrelevant variance.
Kane, M. T. "Validating the Interpretations and Uses of Test Scores." Journal of Educational Measurement, 50(1), 1–73, 2013.
doi.org/10.1111/jedm.12000Cronbach, L. J., & Meehl, P. E. "Construct Validity in Psychological Tests." Psychological Bulletin, 52(4), 281–302, 1955.
doi.org/10.1037/h0040957Holland, P. W., & Thayer, D. T. "Differential Item Performance and the Mantel–Haenszel Procedure." In H. Wainer & H. I. Braun (Eds.), Test Validity (pp. 129–145). Lawrence Erlbaum, 1988.
Holland, P. W., & Wainer, H. (Eds.). Differential Item Functioning. Lawrence Erlbaum, 1993. Standard reference for DIF theory and methods.
CAST. Universal Design for Learning Guidelines, Version 2.2. CAST, 2018. Source for multiple means of action and expression as applied to assessment.
udlguidelines.cast.orgWorld Wide Web Consortium (W3C). Web Content Accessibility Guidelines (WCAG) 2.1. W3C Recommendation, June 2018.
w3.org/TR/WCAG21/Sweller, J. "Cognitive Load During Problem Solving: Effects on Learning." Cognitive Science, 12(2), 257–285, 1988.
doi.org/10.1207/s15516709cog1202_4Sweller, J., Ayres, P., & Kalyuga, S. Cognitive Load Theory. Springer, 2011. Source for the intrinsic / extraneous / germane load distinction applied to item design.
Wiggins, G. Educative Assessment: Designing Assessments to Inform and Improve Student Performance. Jossey-Bass, 1998.
Montenegro, E., & Jankowski, N. A. Equity and Assessment: Moving Towards Culturally Responsive Assessment. Occasional Paper No. 29. National Institute for Learning Outcomes Assessment (NILOA), 2017.
files.eric.ed.gov/fulltext/ED574461.pdfMontenegro, E., & Jankowski, N. A. A New Decade for Assessment: Embedding Equity into Assessment Praxis. Occasional Paper No. 42. National Institute for Learning Outcomes Assessment (NILOA), 2020.
files.eric.ed.gov/fulltext/ED608774.pdfRandall, J. "Color-Neutral Is Not a Thing: Redefining Construct Definition and Representation Through a Justice-Oriented Critical Antiracist Lens." Educational Measurement: Issues and Practice, 40(4), 82–90, 2021.
doi.org/10.1111/emip.12429Inoue, A. B. Antiracist Writing Assessment Ecologies: Teaching and Assessing Writing for a Socially Just Future. The WAC Clearinghouse / Parlor Press, 2015. Open access.
wac.colostate.edu/books/perspectives/inoue/Parliament of Canada. Accessible Canada Act, S.C. 2019, c. 10. Establishes the goal of a barrier-free Canada and the federal duty to identify and remove barriers.
laws-lois.justice.gc.ca/eng/acts/A-0.6/Legislative Assembly of British Columbia. Accessible British Columbia Act, S.B.C. 2021, c. 19.
bclaws.gov.bc.ca — SBC 2021 c. 19Americans with Disabilities Act of 1990, as amended (42 U.S.C. § 12101 et seq.), and Section 508 of the Rehabilitation Act (29 U.S.C. § 794d), governing accessibility of information and technology.
ada.govWiggins, G., & McTighe, J. Understanding by Design. 2nd ed. ASCD, 2005. Source for backward design and alignment of assessment to stated goals.
Biggs, J. "Enhancing Teaching Through Constructive Alignment." Higher Education, 32(3), 347–364, 1996.
doi.org/10.1007/BF00138871