Learning Evaluation Framework Design
An exploration of the knowledge, considerations, and outcomes that characterise workplace learning evaluation — shaped by each organisation's learners, context, and culture.
Knowledge foundations
Five interconnected bodies of knowledge inform evaluation framework design. Together, they support evaluations that go well beyond what any single model or checklist can offer.
1. Evaluation theory and models
Kirkpatrick, Phillips ROI, Brinkerhoff's Success Case Method, and CIPP each reflect the context in which they were developed. Understanding this lineage helps practitioners recognise the assumptions embedded in any model — and consider how well those assumptions fit their own setting.
2. Measurement theory and psychometrics
Validity, reliability, and the distinction between statistical and practical significance are the difference between data that is trustworthy and data that is simply abundant. These concepts inform how well evaluation instruments measure what they are intended to measure.
3. Research methods
Quantitative and qualitative methods each answer different questions about what learning produces and why. The proportions depend on the evaluation questions, the learner community, and what each approach can realistically offer in the given setting.
4. Adult learning and transfer research
Grounding in how adults learn and transfer knowledge shapes what evaluation looks for. Research on transfer conditions — including support structures, practice opportunities, and organisational climate — is particularly relevant when evaluation extends beyond immediate outcomes.
5. Data literacy and organisational context
Statistical fluency supports honest interpretation and communication of findings. Equally important is understanding the organisational context — who holds the data, who has a stake in the findings, and which decisions are genuinely on the table all shape what evaluation is possible and appropriate.
Applying an established model mechanically is not the same as designing a framework. Considered design requires understanding which distinctions matter in a specific context, which levels of evaluation are genuinely answerable given real constraints, and what is gained or lost when particular elements are adapted or set aside. That judgement belongs to the people closest to the learners and the work.
Evaluation frameworks carry embedded assumptions about whose knowledge counts as evidence and who defines what success looks like. Scholarship in decolonising research methods — including work on Indigenous data governance, culturally responsive design, and participatory approaches — offers frameworks for examining and challenging these assumptions. Central to this work is the recognition that community members are holders of knowledge, not simply providers of data.
Design considerations
Every evaluation framework is shaped by the organisation it serves — its learners, its culture, and its particular context. What follows describes considerations that practitioners commonly navigate. They are not a sequence or a prescription, but an orientation to the kinds of questions that thoughtful framework design tends to address.
A framework designed to demonstrate regulatory accountability looks quite different from one oriented toward program improvement, strategic workforce decisions, or building evaluation capacity. These purposes can coexist, but they generate different priorities — and clarity about which matters most shapes every subsequent design choice.
Scope decisions — which programs, which levels, which learner communities — carry real consequences for what a framework can and cannot address. Making these choices explicitly, rather than defaulting to what is easiest to measure, is itself a meaningful act of design.
Questions this consideration addresses- Why is evaluation being undertaken, and for whom?
- Which programs or interventions are in scope — and which are not?
- Whose decisions will the findings inform?
- What resources and timelines are realistic?
Frameworks without clarity on purpose risk producing findings that no one quite knows how to use.
Established models offer different ways of organising evaluation levels. No structure fits every context. The levels a framework includes — and the methods used within each — reflect the programs being evaluated, the data available, and what the organisation can credibly claim to measure.
Method selection involves the same contextual judgement. Surveys, assessments, observations, interviews, and focus groups each suit different questions and different communities. The choice between more and less resource-intensive approaches involves real tradeoffs — frameworks that make those tradeoffs visible make it easier to revisit them when circumstances change.
Questions this consideration addresses- Which distinctions between outcome types matter in this context?
- What methods are appropriate for the learner community and program type?
- What evidence quality is needed to support the decisions being made?
- What is actually measurable given available data and access?
Data collection spans the learning lifecycle and involves learners, managers, and sometimes wider communities. The design of instruments and processes reflects both technical choices — what to ask, how, and when — and relational ones: who is being asked to share something of themselves, under what conditions, and with what understanding of how that information will be used and protected.
When learners understand why evaluation is happening and trust that their input matters, the resulting data tends to be more honest and more useful. Instruments that do not attend to these relational dimensions — or that are not culturally appropriate for the communities involved — risk producing data that reflects the instrument, not the learner.
Questions this consideration addresses- Who are the appropriate sources of data for each evaluation question?
- How will participants understand the purpose and use of evaluation data?
- Are instruments culturally appropriate and free of unexamined assumptions?
- What are the ethical obligations to people providing data?
Combining quantitative and qualitative approaches tends to produce more complete understanding than either alone — particularly when the question is not only whether outcomes occurred, but what enabled or constrained them.
Analysis methods are chosen in relation to the data type and the evaluation question. The more consequential work is interpretation: situating findings in context, distinguishing statistical from practical significance, and being clear about what the data does and does not support.
A shift in scores might reflect the instrument, the curriculum, the delivery, the learning environment, or the particular community of learners. Identifying the most plausible explanation requires judgement grounded in the specific setting. This is where knowledge of measurement, adult learning, and organisational context matters most — and where respect for the complexity of people's experiences is essential.
Questions this consideration addresses- What analysis approach is appropriate for each data type and question?
- What do the findings actually support as a conclusion — and what do they not?
- What alternative explanations for observed patterns exist?
- How do quantitative and qualitative findings relate to each other?
Findings serve different audiences with different needs. Those designing programs need insight into what is and is not working. Leadership needs strategic framing and clarity on what systemic questions the findings raise. A single report that tries to serve all audiences at once tends to serve none of them well.
Findings that are specific, contextualised, and honest about their limitations tend to be more trusted and more used. Connecting findings to decisions — rather than simply presenting data — makes it more likely that evaluation leads somewhere.
Questions this consideration addresses- Who needs to receive findings, and what will they do with them?
- What format and level of detail serves each audience?
- What decisions do the findings bear on, and how directly?
- How are limitations and uncertainties communicated honestly?
Evaluation that does not inform anything is documentation. The value of evaluation lies in its integration — findings shaping program design, surfacing transfer barriers, or prompting more fundamental questions about what is being offered to learners and why. Frameworks that build in explicit connections between evaluation and design decisions tend to improve both over time.
Each organisation finds its own approach to this integration, shaped by its culture and governance. Brinkerhoff's Success Case Method offers one documented approach: identifying learners who achieved meaningful outcomes, understanding what conditions supported them, and sharing that understanding across the organisation in ways that honour their experience.
Questions this consideration addresses- How do evaluation findings reach the people who design and deliver programs?
- What mechanisms exist for findings to inform decisions about program changes?
- How is the framework itself reviewed and adapted as the organisation learns?
- How is evaluation knowledge built and retained across the L&D team?
Key questions in framework design
Every evaluation framework reflects choices about scope, depth, timing, and whose perspectives are included. These choices are not universal — they are made in relation to each organisation's learners, programs, culture, and the decisions the framework is meant to serve.
- What is evaluation for? Purpose shapes every other design decision
-
- Demonstrating accountability to regulators or funders
- Improving program quality through ongoing evidence
- Supporting strategic workforce and capability decisions
- Building evaluation capacity within the L&D function
- What is in scope? Coverage reflects organisational priorities
-
- Specific programs or program types
- Particular learner communities
- Certain outcome levels (immediate, behavioural, organisational)
- Specific phases of the learning lifecycle
- How much evidence is enough? Evidence quality involves explicit tradeoffs
-
- What decisions will this evidence support?
- What confidence level is sufficient?
- What resources does more rigorous evidence require?
- What is lost when less rigorous approaches are used?
- When does evaluation happen? Timing determines what questions are answerable
-
- Before development — informing design choices
- During delivery — monitoring engagement and learning
- Immediately after — assessing immediate outcomes
- Over time — tracking transfer and performance change
- Whose perspectives are included? Inclusion is a design choice, not a default
-
- Learners — as data sources, participants, or co-designers
- Managers and supervisors
- Subject matter experts and facilitators
- Community members affected by the learning
- Who uses the findings? Use determines what reporting needs to look like
-
- Instructional designers making program decisions
- L&D and HR leadership overseeing programs
- Executives and boards making investment decisions
- Learners and communities whose outcomes are being evaluated
Established models offer useful vocabulary and structure. They were also developed in particular organisational and cultural contexts, and carry those contexts' assumptions. Organisations that adapt models thoughtfully — making explicit what they are keeping, changing, and setting aside, and why — tend to produce evaluation that is more relevant to their own learners and more credible to the communities they serve.
There is no universally correct number of evaluation levels, no single best method, and no template that carries unchanged from one context to another. The value of engaging with the field lies in developing the judgement to design something that genuinely fits — the learners, the work, and the organisation.
What well-designed evaluation enables
Contextually appropriate evaluation tends to produce change at two horizons — some visible relatively quickly, others that take hold over time as evaluation becomes part of how an organisation understands and supports its learners.
What becomes visible
Evidence replaces assumption
Evaluation surfaces what learners are experiencing, knowing, and able to do — offering a more grounded picture than assumption or inference alone.
Investment is directed purposefully
Understanding which programs produce which outcomes for which learners makes it possible to direct resources where they are most needed.
Transfer barriers become visible
Evaluation beyond immediate outcomes often reveals that learning occurred but did not carry into practice — pointing toward environmental, relational, or structural factors that programs alone cannot resolve.
Evaluation competence grows
Practitioners who engage seriously with evaluation develop expertise in measurement, analysis, and evidence communication — capacities that strengthen their broader professional practice.
What becomes embedded
Evaluation is part of design, not after it
Where evaluation is woven into program design from the beginning, learning from evidence becomes continuous. Programs tend to improve more quickly, and the cycle of design-deliver-evaluate becomes genuinely iterative.
L&D earns strategic credibility
Organisations able to speak credibly about learning impact — with evidence grounded in their own context — are better positioned to advocate for their learners in decisions about investment, policy, and workplace conditions.
Equity gaps become visible and addressable
Frameworks that disaggregate findings and centre diverse perspectives surface inequities that aggregate data tends to obscure — making it possible to address them with the specificity and care they deserve.
Organisational learning compounds
Careful documentation of what supports learning — for whom, and under what conditions — builds knowledge that accumulates across programs and cycles, informing future design with hard-won understanding.
Models that compress meaningfully different outcomes into a single category can obscure where a program is succeeding and where it is not. Finer distinctions — between, for example, knowledge acquired, competence demonstrated under realistic conditions, and behaviour applied in practice — make it possible to locate a gap more precisely and respond more thoughtfully.
A learner may report a positive experience without having acquired the intended knowledge. Acquired knowledge does not guarantee the ability to apply it under realistic conditions. Demonstrated competence in a controlled assessment does not guarantee that learning carries into everyday work. Each gap is meaningful, and each points toward a different kind of response. Evaluation that can distinguish between them is better placed to serve learners and organisations well.
Evaluation framework lineage
Workplace learning evaluation frameworks have developed over seven decades. Each model built on — and sometimes pushed back against — what came before. Understanding this lineage helps practitioners read the assumptions embedded in any model they work with.
| Model | Core contribution | Structure | Limitations |
|---|---|---|---|
| Kirkpatrick (1959)The foundational model | Established a four-level structure — reaction, learning, behaviour, results — and a shared vocabulary that has shaped the field ever since. | 4 levels | Conflates outcomes that may be meaningfully distinct. Does not address transfer conditions, financial return, or broader social impact. Causality between levels is assumed rather than demonstrated. |
| Phillips ROI (1990s)Financial accountability | Added a fifth level expressing return on investment in monetary terms, making the financial case for L&D investment explicit. | 5 levels | Resource-intensive. Requires attribution assumptions that are often difficult to substantiate. Most appropriate where isolating the financial impact of a specific program is genuinely feasible. |
| Kaufman's Five LevelsSocietal accountability | Extended the framework to include societal and community impact, arguing that organisations exist in and affect broader social contexts. | 5 levels including society | Societal impact is difficult to attribute to specific programs with confidence. Few organisations have the data infrastructure to evaluate at this level reliably. |
| Brinkerhoff's Success Case MethodNarrative evidence | Shifted attention from average outcomes to individual experience — using structured interviews to understand what conditions enabled notable success or contributed to limited impact. | 6 stages | Does not offer population-level data. Particularly valuable alongside quantitative methods when understanding enabling conditions matters as much as measuring outcomes. |
| CIPP Model (Stufflebeam)Comprehensive program evaluation | Evaluated Context, Input, Process, and Product — attending to the conditions that shape outcomes, not only the outcomes themselves. | 4 types | Resource-intensive. Developed for educational program evaluation; less commonly adapted to workplace L&D settings. |
Kraiger, Ford, and Salas (1993) proposed a taxonomy of learning outcomes across three dimensions: cognitive (knowledge and understanding), skill-based (procedural competence), and affective (attitudes, motivation, confidence). Each dimension requires different measurement approaches and responds to different conditions in the learning environment. Affective outcomes are frequently underemphasised in published frameworks, despite evidence that shifts in learner confidence and motivation can be stronger predictors of on-the-job transfer than knowledge gains alone.
Reflection on current practice
This reflection surfaces questions about evaluation practice. There are no right or wrong answers — choose the response that most honestly reflects the current situation. Takes approximately two minutes.
Citations
Sources informing the knowledge claims and frameworks discussed in this resource, grouped by area.
Evaluation theory and models
Kirkpatrick, D.L. & Kirkpatrick, J.D. Evaluating Training Programs: The Four Levels. Berrett-Koehler, 1994 (revised 2006).
Phillips, J.J. Return on Investment in Training and Performance Improvement Programs. Gulf Publishing, 1997.
Brinkerhoff, R.O. The Success Case Method. Berrett-Koehler, 2003.
Stufflebeam, D.L. & Shinkfield, A.J. Systematic Evaluation. Springer, 1985.
Measurement theory and psychometrics
Kraiger, K., Ford, J.K., & Salas, E. “Application of cognitive, skill-based, and affective theories of learning outcomes to new methods of training evaluation.” Journal of Applied Psychology, 78(2), 311–328, 1993.
DeVellis, R.F. Scale Development: Theory and Applications. SAGE Publications, 2016.
Research methods
Creswell, J.W. & Plano Clark, V.L. Designing and Conducting Mixed Methods Research. SAGE Publications, 2017.
Patton, M.Q. Qualitative Research and Evaluation Methods. SAGE Publications, 2014.
Adult learning and transfer
Baldwin, T.T. & Ford, J.K. “Transfer of training: A review and directions for future research.” Personnel Psychology, 41(1), 63–105, 1988.
Merriam, S.B., Caffarella, R.S., & Baumgartner, L.M. Learning in Adulthood: A Comprehensive Guide. Jossey-Bass, 2007.
Equity and decolonising data
Quinless, J.M. Decolonizing Data: Unsettling Conversations about Social Research Methods. University of Toronto Press, 2022. ISBN 9781487523336. utppublishing.com
Zheng, L. FAIR Framework for equitable data practice. lilyzheng.co/fair-framework
First Nations Information Governance Centre. OCAP Principles (Ownership, Control, Access, Possession). fnigc.ca
This resource draws on scholarship in evaluation theory, psychometrics, research methods, adult learning, and decolonising data. It does not constitute professional or legal advice.