End of Rotation Exam Scale Score FAQs
Starting in 2018, End of Rotation™ exam scores are reported on a scale from 300 to 500. Using this standard metric will allow for easy comparisons between versions of the exam over time and forms of the exam delivered to a cohort. A number of questions have been asked that deserve some detailed, written explanation. This FAQ series will evolve as faculty feedback comes in.
- Topic 1: What does it mean? The basics of scale scores
- Topic 2: What do students need to know? Interpreting and grading
- Topic 3: How do we make the transition? Checklist of steps
- Topic 4: The technical details
Here is your introduction to scale scores and what they mean for the End of Rotation exams.
Topic 1: What does it mean? The basics of scale scores
What are scale scores?
PAEA End of Rotation exams are reported as scale scores rather than as the number correct. Scale scores are scores that have been mathematically transformed from one set of numbers (i.e., the raw score) to another set of numbers (i.e., the scale score), in order to make them more comparable. Doing so allows for a single performance report per specialty exam. The primary benefit of scale scores is that it allows all scores on all versions and forms of the End of Rotation exams to be comparable across years and cohorts because they all use the same scale metric. While many theories can be used to determine scale scores, PAEA chose linear transformation of item response theory modeling.
Using the number correct score for fair comparisons of students’ performance on different forms of the same test is not ideal — the forms have different questions that may be slightly more or less difficult. To make consistent and fair decisions based on assessment results, the scores reported from different forms need to be comparable. As such, reporting scores using a scale score rather than a number correct score allows PA programs to do this more accurately. When looking at any individual specialty exam (i.e., Family Medicine or General Surgery), scores from forms within the same year and across years are comparable because they are reported on the same metric.
Once the scale scores are in place, can different specialty exam scores be compared?
Scale scoring allows scores from the same specialty exam to be compared between exam versions, forms, and cohorts. However, scores can only be compared within each specialty exam, not between specialty exams.
While you cannot directly compare total scale scores between specialty exams, you can look at subscores from similar content areas for program evaluation. Trends can be analyzed in content and task areas between exams for those areas that appear on multiple exams. For example, if subscores in endocrinology are consistently low across many of the End of Rotations exams, you may want to reflect back on your curriculum and determine if endocrinology is covered in sufficient detail during the didactic phase of the program. This provides valuable data to ensure your program is meeting its objectives
Topic 2: What do students need to know? Interpreting and grading
What are some practices for converting scale scores into letter grades?
As has always been the case, it is important to reiterate that PAEA does not set standards for End of Rotation exams, nor do we advise programs on setting their own performance bars or passing score requirements. Additionally, the national mean score of any PAEA End of Rotation exam does not suggest a pass bar. These need to be established at the program level.
In making this change, our goal is to help programs understand how to use scale scores to make informed decisions about student performance. The following are a few general considerations and some methodological guidance to help programs determine how to convert a scale score into a letter grade.
Scale scores require that program faculty determine how to translate scale scores into grades. Depending on the program, this may be pass/fail or letter grades. PAEA End of Rotation raw scores range from 0 to 100, making it appear relatively easy to transfer raw scores into percentages and thus letter grades. But even when 0 to 100 raw scores were reported, a raw score of 80 did not translate to “80%” for every program. PA programs have used a number of different techniques including z-scores, cohort-level means and standard deviations, as well as program trends over time, to determine performance bars and passing score requirements that fit their program’s grading scale.
End of Rotation exams range from 300–500, which is significantly different from a raw score. However, the process of translating that scale score to a grade should follow the same principles that were used when translating a raw score of 0–100 to a grade.
KEY CONCEPTS
A unified scale eliminates differences between forms. PAEA transitioned to scale scores to remove small differences in difficulty between exam forms. We do not recommend using the conversion tool to revert the scale score back to a raw score for purposes of assigning a grade. Not only is this a cumbersome extra step, but it also reintroduces differences in difficulty between forms that scale scores eliminate.
Bell-shaped curves. It is important to remember that only the score metric has changed. Scores will still follow the same bell-shaped curve, with means and standard deviations. For example, see the following classic histogram of model-based performance data for the soon-to-be-released Emergency Medicine End of Rotation exam, Version 6:
Consistency is important. No matter which method your program chooses to convert the scale score into a grade, we encourage you to maintain consistency with practices you have followed in the past. As a starting point, look to how your program has historically set performance bars, passing scores, and grades.
A single decision point allows programs to make the strongest determination. When considering performance bars and passing scores, programs can make the strongest, most defensible statement about a student’s performance by setting just one pass/fail bar. The more performance bars set, the more difficult it is to differentiate student performance. For example:
- Programs can make the most reliable determinations about student performance if they set one performance bar. The grouping of all the students who passed or failed the exam into two categories, regardless of their exact score, is both fair and defensible. This clearly differentiates students who have appropriate knowledge (they passed) from those who have significant knowledge deficits (they failed).
- Programs can make reasonable determinations about student performance if they set a few performance bands, i.e., setting scoring quartiles or ranges. Looking at a traditional grading scale, we can say that all students who score between 90–100% are high achievers. The grouping of students into defined performance bands can stratify students into performance categories and is also fair and defensible.
- It becomes risky, however, to differentiate students in small range performance bands or point-by-point. Using that traditional grading scale as an example, we cannot make meaningful determinations about the difference between two students who scored a 94% and 95% on an exam.
SAMPLE PRACTICES
We encourage program faculty to meet, discuss, and agree upon the following:
- Performance bars and/or passing scores for the new scale score metric that are best for your program, as well as the rationale for those choices so you can defend the grades, if necessary (i.e., in the setting of an academic action)
- The method your program will use to convert scale scores to recordable grades
- Which student policies need to be updated, as well as how and when those updates will be communicated
The options below are presented to provide considerations and methodological guidance. This guidance will help your program consider overall supervised clinical practice experience (SCPE) and/or course grades in the context of how to convert scale scores to best meet your grading criteria. This list is not all-inclusive, and order does not indicate an endorsement or preference.
Pass/Fail
Pass/fail — compensatory. In this model, programs use the PAEA End of Rotation exams as a pass/fail exam, setting a single pass bar. If a student passes the exam, a specific point value will be assigned that contributes to the overall course or SCPE grade. If a student fails the exam, it is associated with a lower point value that contributes to their overall grade. This is called the compensatory model because it allows students who fail a PAEA End of Rotation exam to “compensate” and still potentially receive a passing course or SCPE grade if they were strong in other graded areas (i.e., preceptor evaluation, OSCE, or other course assignments). Remediation, if required by the program, should be considered for students who do not achieve a passing grade.
Pass/fail — non-compensatory. Also known as pass/fail — conjunctive hurdle. In this model, programs also use the PAEA End of Rotation exams as a pass/fail exam, setting a single pass bar. However, a student cannot fail the exam and still pass the course or SCPE. It’s called a conjunctive hurdle because students are required to pass multiple hurdles to pass the course or SCPE. They must pass the exam and get a passing course grade that is made up of other required course components. In this model, the PAEA End of Rotation exams become a standard for progression.
Letter Grades
Percentage-based performance bands. In this model, programs convert a scale score range to percentage-based performance bands, choose a single mid-point for each band, and assign that mid-point as a grade. Any student who performed within that range would be assigned the same percentage-based grade. For example, your program may decide that a scale score range of 400–450 converts to 80–90% and that all students falling in this performance band are assigned an 85% as an exam grade. That 85% then becomes a single component of the course, or SCPE grade. With this model, students may be frustrated if their score was on the higher end of the band, but they were still assigned an 85%. However, we know that it’s difficult to make any substantial determinations about the difference in knowledge between a student who got an 85% versus an 89%, making this model defensible. This model is similar to the way the assignment of final letter grades works at many institutions. An A grade may equate to any grade of 93–100%. The A is recorded on the transcript, not the specific percentage. If more precision is required, smaller bands can be developed. Students can further distinguish themselves via other components of the overall evaluation. It’s important to note that programs should not convert the scale score back to a raw number to implement this model. The correct way to implement this model is to look at the scale score means and standard deviations and set scoring ranges that can be converted to percentage-based performance bands.
Z-score. Using the z-score model is, in essence, applying a second scale to the PAEA scale score — it may be a reasonable option for programs, especially those programs already using the z-score. We can use the example provided above (PAEA Emergency Medicine End of Rotation exam, Version 6) where the mean scale score is 402.99 and the standard deviation is 22.55, and then calculate a z-score for a hypothetical test score of 450:
The student has a z-score that is 2.08 standard deviations above the mean. Refer to a statistics textbook or online program for a z-table.
Note: All of these models should only be applied to total scores, not subscale scores. Trying to convert subscales for an individual student in a small content area to an alternative metric limits your ability to make meaningful inferences or determinations.
What have programs done to manage the change to scale scores?
Three programs describe their analysis, decision-making, and communication processes to PAEA in the form of short interviews. Read their stories here.
How can we help communicate the changes to the students?
Once these decisions have been made, programs will need to determine which student policies need to be updated, as well as how and when those updates will be communicated to students. Programs that experience this change during a clinical year, particularly toward the beginning of the clinical year, may want to create a policy revision (addendum) that is provided to students in writing. Other programs near the end of the clinical year may simply want to use the raw score conversion tool to create raw scores for the final few rotations and not change their policy for the current class. Either way, changes should be communicated in course or SCPE syllabi and in appropriate handbooks or other places where policies may be outlined. In addition to making formal policy updates, programs should be prepared to introduce the scale score to students in-person — perhaps during a call back and/or during faculty advisor meetings. Ideally, this should be done before releasing grades for the first scale-score exam.
Explaining to students that only the score metric has changed is imperative. A student and program can still use the national comparative data along with keyword feedback to help develop an individually tailored remediation or study plan. As always, students should be cautioned against over-interpreting sub-scores, because while a student may have achieved a perfect score in an individual content area, it’s possible there were only three questions in that section.
The release of scale scores for PAEA End of Rotation exams represents an important enhancement that removes small differences in difficulty between exam forms. This shift may change the way grade calculations are done in programs using PAEA End of Rotation exam scores. Faculty should discuss the various options for assignment of course and SCPE grades and determine what makes the most sense for the program. This decision then should be communicated to students through policy updates and student meetings.
Topic 3: How do we make the transition? Checklist of steps
What steps should I take to prepare for the transition?
We have compiled this checklist of considerations in advance to help you prepare for the upcoming enhancement. While the process may differ from institution to institution, the general components are the same.
Step 1 – Converting scale scores into letter grades
Meet, discuss, and agree upon the following:
▢ | Performance bars and/or passing scores for the new scale score metric that are best for your program, as well as the rationale for those choices so you can defend the grades if necessary (i.e., in the setting of an academic action). |
▢ | The method your program will use to convert scale scores to recordable grades — these FAQ items will help. |
If you need to continue using raw scores to finish out a cohort in the middle of their clinical training, a few conversion processes will need to happen: | |
▢ | Determine who will convert student scale scores to raw scores. |
▢ | Ensure this person/team have access to scores and the Conversion Tool through the Reports tab. This person/team must be assigned as a Faculty user in ExamDriver. Instructions for accessing the tool are here. |
▢ | Establish a time frame to share grades with students. (You can still release results right away, but they will be reported on the 300–500 scale.) |
Step 2 – Updating policies
▢ | Determine which student policies need to be updated, as well as how and when those updates will be communicated to students. You will likely need to update the policy in your student handbook, SCPE syllabi, or other places where policies or grading scales may be communicated. |
▢ | Make sure you remove distinctions between the forms on scale score reporting. We are only reporting one set of comparative statistics since the scale equates the forms (i.e., instead of reporting Family Medicine, Version 6.1, 6.2, and 6.3, we will now only report Family Medicine, Version 6). |
Step 3 – Communicating changes
▢ | Ensure all faculty and staff working with clinical-year students know about the change. |
▢ | Ensure the leadership knows about the change. |
▢ | Ensure the students know about the change. |
▢ | Ensure PAEA knows about any issues that arise from implementing the change (email us at exams@PAEAonline.org). |
Though this conversion is a significant change, scale scores are the strongest methodological way to evaluate students on exams with multiple forms that change year-over-year. Equating forms onto a single metric allows you to make more robust determinations about student performance and program performance over time. If you get stuck on any part of this process, please reach out to us so we can help you understand potential solutions.
Topic 4: The technical details
What process did you use to determine the scale scores?
Our psychometric team analyzed End of Rotation exam performance data following best statistical practices and presented the results to the PAEA Assessment team. The final process to determine the scale scores was a linear transformation of item response theory (IRT) scores.
What did you use as the mean to determine the scale scores, and why?
The statistical analysis of the means and equating of scale scores was completed by our psychometricians. PAEA used two primary criteria to select the best option: 1) the mean used should best represent a normally distributed base sample, and 2) the sample should be most representative of the current student population. The mean that was selected to determine the scale score was the mean score across the most recent year of complete data for each End of Rotation exam. These data provide an even distribution of scores and is the most recent representative picture of our current student population.
This standard will be reviewed every three to five years per best psychometric practices.
In what increments will scores be reported?
To allow for the greatest precision of scores, one-point increments are used for the scale scores.
Does the equalization normalize raw scores into scale scores, or can we still see non-normal distributions in scale scores?
Scale scoring creates a fixed center point and, due to the large number of national student test-takers, the national distribution is normalized. However, scale scoring will not get rid of score differences within individual programs, if they exist. Scale scoring will not eliminate uneven distributions of both high and low scores, depending on the variation in your program.
How are the national comparative data available on the first day of administration calculated?
Since publication of the End of Rotation exams in 2014, PAEA has provided members with projected national comparative data (means and standard deviations) on the first day of administration. These comparative data were based on the projected performance of a reference population of PA students and the historical performance of individual items included on the new forms. When the exam was retired in mid-July, we then were able to report the final statistics for that version of the exam. The projected mean and standard deviations were extremely close to the final numbers that were reported.
For scale scores starting in 2018, the method we used to make national comparative data available on the first day of administration is different than the historical method described above. Scale scores are developed based on actual performance of a reference population of students taking each of the specialty exams. Utilizing empirical versus projected data in statistical analysis yields slightly different results, but ensures that data are stable across years. To best represent our student test-takers, we felt it was important to transform raw scores into scale scores using empirical data.