Standard Setting for Assessments: The Art of Defining Proficiency Boundaries

CONTACT

Previous Post Next Post

May 31, 2024

Standard Setting for Assessments: The Art of Defining Proficiency Boundaries

Subscribe to Vretta Buzz

The importance of standard setting has increased in recent years due to the important role that assessment plays in the educational accountability landscape. National assessment data have begun to play a significant role in educational policy-making, particularly in understanding the performance of a country's education system through the lens of student performance levels, such as those reflected in school graduation certificates. Assessment programs such as the Nation’s Report Card or NAEP in the USA and PISA internationally require defining proficiency levels. By implementing the standard setting process, we aim to maintain consistent difficulty in annual assessments over time, despite changes in test forms and content, establish clear benchmarks for student achievement, and effectively communicate assessment results to the public.

The application of standard setting procedures varies across jurisdictions due to several factors, including the level of transparency in educational accountability, political sensitivities such as the need to classify a "suitable" proportion of students by proficiency, the undermining of procedures by pre-established legal cut scores, and the technicality of explaining standard setting methodologies. Consequently, awareness and practical familiarity with this methodology may be limited both within the assessment community and beyond. Regardless of these challenges, a clear understanding of the standard setting procedure by anyone interested in education would enable them to maximize the benefits associated with this unique method in the assessment industry.

This article aims to simplify the understanding of the standard-setting process and help our readers in the education community see its role within the broader context of the assessment cycle and educational system.

101: Basics of Standard Setting

Standard setting methodology in the assessment domain helps to qualify quantitative assessment data in advance of educational decision-making through reliable classification of students into categories. Such a classification brings clarity in interpreting student abilities across different proficiency levels. Thus, setting standards establishes clear boundaries for student ability levels, bringing consistency and alignment with national educational policies and large-scale assessments. So, understanding and valuing this process can improve your dialogue and decision-making within your specific context and responsibilities.

According to the Standards for Educational and Psychological Testing by the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education (2014), standard setting is defined as "the process of establishing cut scores for academic achievement standards." Standard setting is integral to both learning and assessment, so we will explore these perspectives as types of standards to promote systematic thinking about the culture of standard setting. In fact, the content standards of the curriculum, referred to as achievement descriptors for each level in our context, are benchmarked against performance standards during the standard setting process. Since changes in the curriculum or assessments lead to altered teaching practices, these changes must be reflected in the assessments, necessitating the conduct of standard setting studies, while validation studies are conducted at other times to support the claims made from the assessment data and its interpretation.

In language assessments, the Common European Framework of Reference for Languages (CEFR) serves as both a content and performance standard framework, outlining specific language learning goals and measurable proficiency outcomes at each level. Similarly, jurisdictions may set foreign language proficiency levels (such as B1) for school graduation, while exams like IELTS, TOEFL, or Duolingo’s English Test conduct standard setting studies to align their scoring with CEFR proficiency levels. Such alignment efforts facilitate the comparison of results across various exams through concordance studies.

Establishing Boundaries: Teacher and Policy-Maker Perspectives

I will explain standard setting from two perspectives familiar to practitioners - teachers and policymakers. From a teacher's perspective, standard setting can be explained with a daily task in the classroom. Teachers may view their classroom as a map, with each student positioned at different points according to their learning needs. Through interactions, teachers assess the abilities and knowledge levels of their students, which helps them plan subsequent instruction steps or determine appropriate support services. Simply, teachers categorize learning paths and establish standards for student performance within these categories, which in turn shapes the support system they implement for further learning.

Similarly, policymakers responsible for making decisions in the area of policies affecting learners' outcomes need to first understand what constitutes adequate student performance at each level and where each student group in the educational map stands, aiming to categorize learners based on their abilities. This approach helps in the allocation of resources by identifying suitable support systems for various student groups in specific subjects, thereby improving each learner's progress journey. Analogously, we often refer to adaptive assessments as being well-suited to covering a broad range of learners' abilities, assessing each according to their proficiency level; in a like manner, standard setting facilitates adaptive decision-making.

Standard Setting Studies: Methods and Approaches

In the context of educational assessment, the most commonly used standard-setting methods are the Angoff method, initially introduced by Angoff in 1971, and the Bookmark method, further developed and advanced by CTB/McGraw-Hill research scientists in 1996. Each of these methods are briefly described below:

Angoff	Bookmark
This method also involves two main stages: item review and expert judgment: Item Review: This step focuses on understanding each item as a standalone challenge for the test-taker. So, each test item is presented to a panel without being ordered by difficulty. Additionally, the items are reviewed independently so that the experts' judgments are not influenced by the perceived difficulty of other items, thus avoiding “group bias”. Panel Judgment: Consequently, each expert evaluates the probability that a minimally competent test-taker (MCT)* would answer an item correctly. This is done by assigning a probability score, typically ranging from 0 (no chance the MCT would get it right) to 1 (certainty they would answer correctly).	This method includes two stages: preparing items and holistic teamwork: Preparing Items: Items of the test must be calibrated using Item Response Theory through pilot testing. Then items are ordered from easiest to hardest based on their IRT difficulty estimates, known as the "b-value," Such an ordering of items creates a ranked list of items that forms the basis for the subsequent bookmark placement. Holistic Teamwork: Consequently, a panel is invited to review initially ranked items in an ascending manner (from the easiest to the hardest), and decide an item - a point where the experts believe an MCT would begin to answer incorrectly. That is precisely where the minimum proficiency level - the cut score - should be set, and a "bookmark" is inserted.

Angoff

Bookmark

This method also involves two main stages: item review and expert judgment:

Item Review: This step focuses on understanding each item as a standalone challenge for the test-taker. So, each test item is presented to a panel without being ordered by difficulty. Additionally, the items are reviewed independently so that the experts' judgments are not influenced by the perceived difficulty of other items, thus avoiding “group bias”.

Panel Judgment: Consequently, each expert evaluates the probability that a minimally competent test-taker (MCT)* would answer an item correctly. This is done by assigning a probability score, typically ranging from 0 (no chance the MCT would get it right) to 1 (certainty they would answer correctly).

This method includes two stages: preparing items and holistic teamwork:

Preparing Items: Items of the test must be calibrated using Item Response Theory through pilot testing. Then items are ordered from easiest to hardest based on their IRT difficulty estimates, known as the "b-value," Such an ordering of items creates a ranked list of items that forms the basis for the subsequent bookmark placement.

Holistic Teamwork: Consequently, a panel is invited to review initially ranked items in an ascending manner (from the easiest to the hardest), and decide an item - a point where the experts believe an MCT would begin to answer incorrectly. That is precisely where the minimum proficiency level - the cut score - should be set, and a "bookmark" is inserted.

* an individual just at the threshold of competence.

** Illustrations of ordered item booklet. Adapted from Mitzel et al. (2001), p. 256.

There are additional methods used in standard setting practice that I have not included in the table above, but I have briefly described two more below, as they cover methods and approaches relevant for various contexts:

Modified Angoff: This variation of the traditional Angoff method involves adjusting the probability estimates to account for different complexities within the test items, providing a more nuanced approach to evaluating test-taker competence.

Ebel Method: The Ebel method categorizes each test item by relevance and difficulty before recommending cut scores, thus structuring the evaluation process to ensure that all aspects of an item's importance and challenge are considered.

Given the various strengths and specific applications of each method, my final recommendation is to carefully assess the unique requirements and context of each educational assessment. For high-stakes assessments requiring objective evidence, the Bookmark method is highly recommended due to its comprehensive and systematic approach.

The Cycle of Standard Setting

The standard setting process can be viewed as a cycle (as depicted below) of important steps such as panel selection, training, review of questions, data analysis, and final recommendations on cut scores for either two proficiency levels ("pass" and "fail") or four (below basic, basic, proficient, and advanced). At the start of the cycle, the purpose of the assessment is identified and the standard setting method is decided.

Sample Calculation: Secondary Education Certification Assessment (25 multiple-choice Math questions)
Steps/ Method	Angoff	Bookmark
Panel Selection and Training	Panel Composition: A panel of 7-10 panel members are selected based on their experience with instructional expertise, content knowledge, and diverse social perspectives, along with a systemic overseer, such as a school board representative, who provides a big-picture view. Panel composition may vary depending on the exam's purpose, potentially including members from the private sector or academia for university admissions standard setting exercises. Training: Face-to-face or virtual training is organized to align panel members on the purpose, methodological process, importance of standard setting, and definitions of performance levels; Special emphasis is placed on training for those involved for the first time. Typically, about a day or two is spent on preparation, the standard setting process itself, and making the final decision in a standard setting study of four proficiency levels.
Review of the items	Item Review: The panel independently reviews 25 math questions, which are provided without any order based on difficulty.	Item Ordering: The 25 math questions are calibrated using IRT to determine their difficulty levels and are then ordered from easiest to hardest. Panel Consensus: The panel reviews the ordered items, proceeding from the easiest and placing a bookmark at question number 16 for Proficient level, where they believe an MCT would start to struggle.
Data Analysis	Expert Judgment: Each expert panel member estimates the probability of a minimally competent test-taker correctly answering each of the 25 questions. For simplicity, let's assume these average probabilities for a few questions might look like this: Q 1: 0.30 (30% chance of getting it right) Q 2: 0.45 (45% chance of getting it right) Q 3: 0.55 (55% chance of getting it right) ... remaining questions assumed similarly distributed. Averages of these probabilities are calculated across all expert estimations for each question. Averaging Probabilities: To find the expected number of correct answers from this test by the MCC, we simply sum these probabilities and divide by the number of questions. Let’s assume that the response of this calculation would be 0.62. This means that, on average, a minimally competent test-taker is expected to have a 62% chance of correctly answering any given question on this test. At the same time, 0.62 of 25 questions is 15.5 and it means that, on average, a minimally competent test-taker is expected to correctly answer about 15.5 questions out of the 25-question test, based on the probability estimates provided by the experts.	Item level data analysis comes earlier in this method, that serves the rationale for item ordering.
Final Recommendation: Cut Score	Calculating the Cut Score: This value - 15.5 provides a basis for setting the cut score for different proficiency levels within the test. If rounding to the closest whole number, the Proficient cut score could be set at 16.	Setting the Cut Score: The question where the bookmark is placed serves as the cut score; Suppose the bookmark for the “Proficient” level is at question 16, this is the point where an MCT is expected to begin having difficulty answering correctly, but must correctly answer the first 15 questions (considered easier), to meet the passing threshold.

The table below illustrates the practical steps for a standard setting procedure at large, with the following main steps:We have the cut score as a single absolute score, which is typically used in high-stakes certification exams, versus a range of scores that is often employed to provide feedback for ongoing learning purposes. Additionally, some steps, such as conducting iterative rounds for reconfirming the final decision, documenting the process, communicating final decisions to stakeholders, monitoring cut score implementation, and gathering evidence for ongoing validation, have been merged or abbreviated for clarity. It may be a good idea for the panel to take the test, and to provide them with impact data - detailing how many students would be affected and at what level - to fully set the tone of the responsibility required for decisions resulting from the cut score recommendation. Regardless of any additional measures tailored to specific contextual needs, each step should adhere to the standards outlined in the "Standards for Educational and Psychological Testing" mentioned earlier in our introduction.

Overcoming Obstacles in Standard Setting: Practical Recommendations

In every step of the standard setting process, challenges related to human interaction or the working environment may arise. Below is a table that outlines some of these challenges and provides mitigation strategies to ensure the final outcome of the process serves its purpose in the best way possible:

Challenges in Standard Setting	Best Practices for Effective Standard Setting
Panelist Bias	Selecting diverse expert panels: we need to make sure the panel is diverse in terms of demographics, expertise, and perspectives to minimize individual biases.
Defining Performance Levels	Comprehensive training: we are required to provide an informative training session for panel members on the criteria and definitions of performance levels to ensure consistency.
Inconsistency Among Panelists	Structured consensus processes: it would be wise to use structured methods like the Delphi technique to reach consensus among panel members.
Technical and Methodological Complexity	Utilizing multiple methods for validation: try employing various standard setting methods and cross-validate results to improve reliability and validity.
Resistance from Stakeholders	Transparent communication: engage with stakeholders throughout the process and communicate the rationale behind decisions clearly.

Modernizing Assessment: AI in Standard Setting

The increased use of technology in educational assessment, particularly the integration of AI in the assessment cycle, encourages us to also consider the future role of AI in the standard setting process. Although standard setting, like the paper-based conduct of assessments, was predominantly conducted on a face-to-face basis, current realities necessitate a different approach - evaluating the maturity of standard setting processes run as part of the solution pipeline offered by technology providers. In the journey of modernization of educational assessments, data management platforms are quite important for handling the flow of information and systematically aligning it with stakeholders' needs. In this context, the standard setting process could benefit from additional functionalities in technology-driven assessment platforms by enabling the extraction of item-level data from a data management pipeline and facilitating the organization of standard setting activities with experts within the same platform.

As part of the evolutionary journey in data management, data warehouses (places where structured data is stored and easily queried) and data lakes (large pools of raw, unstructured data) have traditionally been used for operational reporting, analytics, and supporting advanced data exploration and innovation through big data and AI technologies. Now, we are entering a new era with the 'lakehouse,' (a modern system that combines the fast data retrieval of data warehouses with the large storage capacity of data lakes) a modern data management architecture that combines the features of both data lakes and data warehouses on cloud data platforms.

Future of Standard Setting: Collaboration and Technology

In the future, standard setting will increasingly shift to a digital format, fully integrating into the digital assessment cycle, while being supported with the advanced data reporting, analytics, and innovation through big data and AI technologies. To support the ongoing virtual implementation of standard setting, which lacks the personal interaction found in face-to-face settings, the collaborative functionality and advanced data exploration capabilities of assessment platforms could be enhanced. A systematic approach to both content and performance standards, accompanied by ongoing validation efforts between standard setting exercises, will support consistent implementation and promote a holistic culture around assessment.

About the Author

Vali Huseyn is an educational assessment specialist, recognized for his expertise in development projects of various aspects of the assessment cycle. His capability to advise on the improvement of assessment delivery models, administration of different levels of assessments, innovation within data analytics, and creation of quick, secure reporting techniques sets him apart in the field. His work, expanded by collaborations with leading assessment technology firms and certification bodies, has greatly advanced his community's assessment practices. At The State Examination Centre of Azerbaijan, Vali significantly contributed to the transformations of local assessments and led key regional projects, such as reviews of CEFR-aligned language assessments, PISA-supported assessment literacy trainings, and the institutional audit project, all aimed at improving the assessment culture across the country and former USSR region.

Discover guided practices in modernizing assessments and gain insights into the future of educational assessments by connecting with Vali on LinkedIn.

Previous Post Next Post