Confusion - The Start of Learning
When Confusion Becomes a Measurement Problem
Discrimination Is Not About Tricking Learners
Training Item Writers for Better Assessment Design and Interpretation
Technology and the Future of Item Quality
“You know nothing, Jon Snow…”
Could be familiar for fans of the television series Game of Thrones, Similarly, Socrates once said: “The only true wisdom is in knowing you know nothing.” Plato. (2002) Both expressions draw attention to the limits of human knowledge. In society, the confession of “I do not know” has been traumatized with stigma, regardless of how Socrates found it as the only true wisdom or continued use of it in different contexts.
On the other hand, accepting that we do not know everything has natural calmness. Recognising what we do not know can also add value to what we already know. So, “I know” and “I do not know” are philosophically intertwined confessions. The true confession of confusion is a sign of self-aware learning (Glaser, Chudowsky, & Pellegrino, 2001; Tanner, 2010).
After all, understanding what students do not yet know is often the first step towards helping them learn it. However, this immediately raises another question: how can we be confident that an incorrect answer genuinely reflects a lack of knowledge rather than confusion created by the assessment itself? To answer this, we first need to examine how confusion itself can become a measurement problem.
Students should not be confused when taking assessments. If they are, they lose the opportunity to demonstrate their true knowledge and skills. A lost opportunity for self-demonstration in assessment can have consequences, affecting university admissions, access to scholarships, career pathways, and other important educational decisions. Therefore, high-stakes assessments should be supported by high-quality assurance processes that promote clarity and, ultimately, maintain the validity, reliability, and fairness of educational decisions.
In the educational assessment context, public confusion often arises from two sources: difficulty understanding what assessment scores actually mean and concerns about the clarity of the questions used to generate those scores. Let us unpack each of them below:
Score Interpretation issue. In assessment, the concepts of knowing and not knowing are translated in a particular way. “I know” is typically reflected in a correct response, coded as 1, while “I do not know” is reflected in an incorrect response, coded as 0. These coded responses contribute to a numerical score that describes a student's level of proficiency. However, reporting only scores are abstract representations of learning. It can be difficult to interpret such scores for a student. What happens next is a game of assumptions about what a particular score actually means.
In principle, every score band should represent a defined level of proficiency, describing what learners know and can do, similar to CEFR’s A1, B2, C1 levels. When we see these CEFR levels, we begin to form expectations about a person's language proficiency. For instance, when someone holds a C1 certificate, we understand that they have the language skills necessary to undertake higher education in that language. In other words, the proficiency level is associated with a set of expected abilities and competencies that help us interpret what the certificate represents. Without appropriate assessment procedures (clear test specifications, piloting and pre-testing, standard setting, and psychometric analysis) defining each level of proficiency: poor, medium, high performers, these interpretations become difficult to justify and may not accurately reflect student achievement (AERA, APA, & NCME, 2014; Kane, 2013).
Ideally, assessment results should be reported not only as scores, but also with clear descriptions of what each score band represents and how student performance should be interpreted. This would help transform scores from abstract numbers into clear information that can support learning and improvement.
Question Clarity Issue. The clarity of assessment questions is often compromised by ambiguous wording, unnecessarily complex language, poorly structured distractors, or unclear instructions. All these factors cause struggle for reasons unrelated to the student's actual knowledge or skills. Item writers may deliberately attempt to make questions as difficult as possible. In doing so, student performance may reflect either a misunderstanding of the question or the measurement of skills and knowledge beyond those intended by the assessment. As a result, this creates a construct-irrelevant measurement problem and can lead to unfair conclusions about learner achievement, reducing the validity of assessment results (Messick, 1989; Kane, 2013).
Ideally, item writers should have a clear understanding of the target proficiency levels they are attempting to assess, such as low-, medium-, and high-performing learners. This familiarity would help them design items that appropriately reflect and differentiate performance across these groups, making sure that assessment results are driven by differences in proficiency rather than unnecessary confusion.
In both score interpretation and question clarity, public confusion is understandable, and assessment systems should be designed to address such risks. This naturally leads to an important question: how can we create a good assessment item that clearly distinguishes between the knower and the non-knower at each level of proficiency?
One of the most common misconceptions among inexperienced item writers is that highly discriminating questions must be confusing in order to be difficult. This thinking misses out an important consideration in assessment design: the characteristics of lower- and higher-performing learners that should be reflected in the assessment.
In reality, item discrimination refers to an item's ability to distinguish between learners who are well prepared and those who are less prepared (Hambleton, Swaminathan, & Rogers, 1991). The process of defining these learner groups begins during standard setting, where performance levels are described in terms of what learners typically know and can do. Standard setting therefore serves as an important bridge between the scores reported at the end of an assessment and the interpretation of those scores.
By understanding the characteristics of learners at each performance level, item writers can develop questions that accurately measure achievement and effectively discriminate between different levels of performance, without introducing unnecessary confusion.
For example, an item writer developing a question for lower-performing learners should have a clear understanding of the expected boundaries of knowledge and skills at that level. The table below describes some cases of confusion, ambiguity and clearness designed for lower-, and high-performing learners (easy and difficult items):
Lower-Performing Learners (Easy item) | |
Confusing/Ambiguous Item | Clear and Discriminating Item |
Which of the following fractional representations corresponds most accurately to a quantity constituting one of two equivalent divisions of a whole? A. 1/2 B. 1/3 C. 1/4 D. 1/5 | Which fraction is equal to one-half? A. 1/2 B. 1/3 C. 1/4 D. 1/5 |
Higher-Performing Learners (Difficult Item) | |
A school is planning a trip for 180 students. Each bus can carry 42 students. How many buses are needed? A. 4 B. About 4.3 C. 5 D. 6 | A school is planning a trip for 180 students. Each bus can carry 42 students. What is the minimum number of buses needed? A. 4 B. 5 C. 6 D. 7 |
Note: The assessment items presented above are illustrative examples created by the author to demonstrate the distinction between difficulty, discrimination, ambiguity, and confusion in assessment design. The principles discussed are informed by established literature on validity, item writing, and psychometrics (Messick, 1989; AERA, APA & NCME, 2014; Haladyna & Rodriguez, 2013).
Let’s review these examples as reviewers and provide institutional feedback afterwards.
In the lower-performing learner example, the item on the left is simple in terms of cognitive demand, but students must first decipher unnecessarily complex wording. Consequently, the item introduces construct-irrelevant difficulty and begins to measure reading ability rather than understanding of fractions. In contrast, the item on the right uses clear and direct language, allowing the assessment to focus on the intended construct and will likely distinguish / discriminate well between learners who have mastered the concept and those who have not.
In the higher-performing learner example, the item on the left is appropriately difficult for the target group, but it contains an ambiguous stem asking, “How many buses are needed?” without explicitly stating that all students must be accommodated. As a result, some students may interpret the task differently. In this case, performance may reflect interpretation of the instruction rather than mathematical reasoning alone. In contrast, the item on the right uses clear wording while maintaining an appropriate level of challenge for higher-performing learners. The challenge comes from the intended mathematical reasoning rather than from interpreting the question. As a result, the item is more likely to discriminate effectively among higher-performing learners.
Institutionally, the review of the item is the first stage of quality assurance. The second stage is field-testing or piloting, which provides evidence of how the item performs with real learners. The final stage is psychometric analysis, which confirms whether the item functions as intended and effectively differentiates between learners. Once these steps have been completed, assessment results can be interpreted and reported with greater confidence (Hambleton et al., 1991).
Overall, the discrimination is differentiation of learners at each level. The differentiation is a skill to look at student population from an eagle eye view and then start with a detailed mindset positioning the item as stimuli for demonstration of knowledge/skills for the student envisioned according to his/her level.
On a fundamental level, high-quality assessment depends on the expertise of the professionals who develop, review, interpret, and use assessment results. Professional development helps assessment practitioners understand key principles such as validity, reliability, developing practical skills in standard setting, item design (cognitive demand, and item discrimination), and test result interpretation (AERA et al., 2014; Wiliam, 2011).
Training in standard setting helps educators and assessment specialists define what learners at different proficiency levels know and can do (Kane, 2013; AERA et al., 2014). It creates the bridge between assessment scores and their meaning, making sure that score bands are linked to clear descriptions of learner performance and not arbitrary numerical thresholds.
Training in item design helps item writers develop questions that accurately target the intended knowledge and skills (Haladyna & Rodriguez, 2013). It helps them to distinguish between productive challenge and unnecessary confusion, making sure that assessment items differentiate learners fairly without introducing construct-irrelevant factors that compromise validity.
Training in test result interpretation helps educators, policymakers, and other stakeholders move beyond simply reporting scores (Wiliam, 2011; Glaser et al., 2001). They learn how to interpret assessment evidence, identify learning strengths and gaps, and make informed decisions that support improvement. Without this understanding, assessment results can easily become a source of misunderstanding and assumptions.
Many of the concerns raised by policymakers, parents, and the public about whether assessments are designed to confuse or trick students can often be traced back to limited awareness of these assessment processes. When assessment systems invest in training and are transparent about how standards are set, how items are developed, and how results are interpreted, public confidence increases and misconceptions about assessment become less common.
In this discussion, we have seen that confusion can be a productive starting point for learning, but it becomes a problem when it distorts measurement and interpretation. The real challenge is not to eliminate confusion altogether, as encountering uncertainty is often part of both learning and assessment. Rather, it is to distinguish between productive confusion that supports learning and effective discrimination that supports assessments to differentiate fairly between levels of learner proficiency.
Technology can help us make this distinction, but only when it is guided by strong assessment literacy, professional judgment, and a clear understanding of validity, fairness, and learner proficiency. Item statistics, process data, response times, distractor analyses, and AI-assisted reviews can reveal patterns that may not be visible through expert judgment alone (Mislevy, 2018; OECD, 2019). Digital platforms can also support piloting and pre-testing, automate psychometric analyses, monitor item performance, and provide evidence for standard-setting decisions. Most importantly, technology can support quality assessment only if item writers, educators, and decision-makers understand the difference between confusion and discrimination. In the end, better assessment depends not only on better tools, but also on stronger assessment literacy across the entire education system.

Vali Huseyn is an educational assessment expert and quality auditor, recognized for promoting excellence and reform-driven scaling in assessment organizations. He mentors edtech & assessment firms on reform-aligned scaling by promoting measurement excellence, drawing on his field expertise, government experience, and regional network.
He holds a master’s degree in educational policy from Boston University (USA) and Diploma of Educational Assessment from Durham University (UK). Vali has supported national reforms in Azerbaijan and, through his consultancy with AQA Global Assessment Services, works with Kazakhstan and the Kyrgyz Republic to align assessment systems with international benchmarks such as CEFR, PISA, and the UIS technical criteria. He also works as a quality auditor in partnership with RCEC and most recently audited CENEVAL in Mexico. In addition, he promotes awareness of the use of technology across the assessment cycle through his work with Vretta. Fluent in Azerbaijani, Russian, Turkish, and English, he brings a deep contextual understanding to cross-country projects.
If you would like to reflect on how public confusions and potential solutions to avoid them in high-stake assessments or explore opportunities to discuss and showcase innovative practices in digital assessment, please feel free to contact Vali Huseyn at: vali@bu.edu | LinkedIn