When Algorithms Are Too Accurate

When Algorithms Are Too Accurate
By Jill Cheney, October 16, 2020

An annual rite of passage every Spring for innumerable students is college entrance exams. Regardless of their name, the end result is the same: to influence admission applications. When the Covid-19 pandemic swept the globe in 2020, this milestone changed overnight. Examinations were cancelled, leaving students and universities with no traditional way to evaluate admission. Alternative solutions emerged with varying degrees of veracity.

In England, the solution used to replace their A-level exams involved developing a computer algorithm to predict student performance. In the spirit of a parsimonious model, two parameters were used: the student’s current grades and the historical test record of the attending school. The outcome elicited nationwide ire by highlighting inherent testing realities.

Overall, the predicted exam scores were higher – more students did better than on any previous resident exam with 28% getting top scores in England, Wales and Northern Ireland. However, incorporating the school’s previous test performance into the algorithm created a self-fulfilling reality. Students at historically high performing schools had inflated scores; conversely, students from less performing schools had deflated ones. Immediate cries of AI bias erupted. However, the data wasn’t wrong – the algorithm simply highlighted the inherent biases and disparity in the actual data modeled.

Reference points did exist for the predicted exam scores. One was from teachers since they provide a prediction on student performance. The other was from student scores on previous ‘mock’ exams. Around 40 percent of students received a predicted score that was one step lower than their teachers’ predictions. Not surprisingly, the largest downturn in predictions occurred amongst poorer students. Many others had predicted scores below their ‘mock’ exam scores. Mock exam results support initial university acceptance; however, they must be followed-up with commensurate official exam scores. For many
students, the disparity between their predicted and ‘mock’ exam scores jeopardized their university admission.

Attempting to rectify the disparities came with its own challenges. Opting to use teacher predicted scores required accepting that not all teachers provided meticulous student predictions. Based on teacher predictions alone, 38% of predicted scores would have been at the highest levels: A*s and As. Other alternatives included permitting students to retake the exam in the Fall or allowing the ‘mock’ exam scores to stand-in should they be higher than the predicted ones. No easy answers existed when attempting to navigate an equitable national response.

As designed, the computer model assessed the past performance of a school over student performance. Individual grades could not offset the influence of a school’s testing record. It also clearly discounted more qualitative variables, such as test performance skills. In the face of a computer-generated scoring model, a feeling of powerlessness emerged. No longer did students feel they possessed control over their future and schooling opportunities.

Ultimately, the predictive model simply exposed the underlying societal realities and quantified how wide the gap actually is. In the absence of the pandemic, testing would have continued on the status quo. Affluent schools would have received higher scores on average than fiscally limited schools. Many students from disadvantaged schools would have individually succeeded and gained university admission. The public outcry this predictive algorithm generated underscores how the guise of traditional test conditions assuages our concerns about the realities of standardized testing.