Pat has tremendous potential but is clearly coasting. Horst used to be so steady but something must be up. Steve never really got going – this job is just not for him….
W Edward Deming’s ‘red bead problem’ reveals the pitfalls we can fall into when judging performances with a precision that exceeds the validity of the inferences we can actually make from the measures available.
A group of ‘workers’ are all given a specialised paddle – a board with 50 circular depressions in it. Their job is to put it into a container of red and white beads and take it out. Beads settle into each of the depressions. Successful outcomes are white beads, red beads are to be avoided.
There are 4 white to every 1 red bead in the container. On average, with each dip of the paddle, workers pick out 10 red beads. But there is considerable variation offering alluring but spurious trends and differences in workers’ performances. There are apparently fantastic workers (who get promoted) and terrible workers (who get retrained and eventually sacked if performance doesn’t improve). It’s tempting to infer a narrative about each worker. See an example set of results (from a rather less diverse era!) below.
Despite our human tendency to infer narratives emphasising individual agency, the variation in outcomes here is entirely random. The inferences made by the managers in this exercise are spurious. But they do have a special power. What J L Austin would call illocutionary power. The speech-act of grading a worker brings into being a social fact. (Dylan Wiliam explores this idea in relation to pupil assessments here)
One of the messages which Deming drives home with this unsubtle parable is that variations in performance levels are more attributable to systemic than individual factors. A second is that in real world settings we cannot expect random variation to be evenly distributed. Martin (Professor Emeritus, University of South Florida) summarise these as follows:
‘The misconception that workers can be meaningfully ranked is based on two faulty assumptions. The first assumption is that each worker can control his or her performance. Deming (1986, 315) estimated that 94 percent of the variation in any system is attributable to the system, not to the people working in the system. The second assumption is that any system variation will be equally distributed across workers. Deming (1986, 353) taught that there is no basis for this assumption in real life experiences. The source of the confusion comes from statistical (probability) theory where random numbers are used to obtain samples from a known population. When random numbers are used in an experiment, there is only one source of variation, so the randomness tends to be equally distributed. This is because samples based on random numbers are not influenced by such things as the characteristics of the inputs and tools (e.g., size of the beads and depressions in the paddles) and other real world phenomena. However, in real life experiences, there are many identifiable causes of variation, as well as a great many others that are unknown. The interaction of these forces will produce unbelievably large differences between people (Deming 1986, 110) and there is no logical basis for assuming that these differences will be equally distributed.’
Results have gone up or down. What does this mean? The perils of best-practice carousels
In schools an increase or decrease in outcomes is often assumed to demonstrate a palpable difference in quality of the teaching process. Especially as regards the exam performance of teachers’ (or departments’) classes year-on-year results. A drop in results is thought to be the product of inferior teaching, and higher results to be the product of improvements. This is probably often wrong.
Personally I experienced this when working as a GCSE RE teacher. We were a dept of 4. We used the same resources and assessments with our classes. Over a 4 year period each of us was apparently the ‘best’ at getting positive residuals and at some point each of us was also the worst. One year (because we had two classes each) I was each of these simultaneously.
The same can occur between schools. In a town near mine there’s an English dept best practice carousel. Each year HoDs are told to confer with the dept who has the best results (for a similar non-selective school). Twice in the past three years the depts sharing best practice have fallen foul of an inevitable regression to the mean- embarrassingly swapping seats from workshop leader to delegate.
Of course, learning is not entirely random. There is a link between teaching and the learning elicited in students. It’s just rather more complex and murkier than we often assume. Different teachers and different methods will likely have a greater or lesser effect upon learners. But this effect is often eclipsed by other variation.
Common and special cause variation
To consider things in a more nuanced and statistically literate way, we need to understand two causes of variation: common and special causes. Common cause variation is the variation which is always present in a process. It’s effect is somewhat predictable. Special cause variations are sporadic and unpredictable – driven by events or processes which are in themselves unusual. Bill NcNeese uses commuting times to illustrate the difference. On a normal day we might not be able to predict exact commuting time – but know it’s likely to be between 20 and 30 minutes. Common cause variations in traffic flow, red lights, pedestrians etc interact to produce travel times commonly in this range. Occasionally there will be special cause variations – a flat tire for example. This could add hours to travel time.
In education it is very common to confuse the former for the latter. And to assume a degree of individual volitional responsibility for outcomes which belies the complex dynamics behind variations between relatively small sets of exam grades. It’s all too easy to conclude that great results are largely the product of a deliberate and transferable strategy.
What is important here is to recognise that no amount of tweaking will radically change the common cause variation. To shift things significantly we would need to move house, change job, change route or change the means of transport. The conflation of both types of variation into is probably what lies behind Prof Coe’s observation that in order to improve a school, remarketing and changing pupil intake is a plausible (but empty) route to making it look like the ‘improvement’ has worked. http://www.cem.org/attachments/publications/ImprovingEducation2013.pdf
However, we should probably expect variable and somewhat unpredictable outcomes. Variation in learning was famously captured in Nuthall’s seminal work The Hidden Lives of Learners. A summary of which can be found here. My talk on the same theme is available here.
Pointing out the inevitably of variation – of results likely doing down as often as they go up is hard to do without sounding like an apologist for failure. However, that’s the reality for many of us. It is possible to accept the inevitability of variation whilst also committing to focussing additional resources and attention upon those who seem most disadvantaged.
Special causes of variation – may need direct intervention and *may* be the direct responsibility of the person closest to the variant. This is an unforgivably cold way of pointing out that from time to time there will be significant issues and incidents which demand our attention and care. As classroom teacher these issues may pertain to the life experiences of individual students. As a school leader there may ‘special causes of variation’ in the lives and practice of teachers. In either case, whilst these may impact upon that year’s exam results, they may not. Either way they may represent needs which it within our role to respond to.
Equally there may be special causes of variation which are not transferable or relevant to others. In one school, for example, the Science dept were promoted as bastions of ‘best practice’ because of a significant upturn in results. Their results’ improvement exceeded the undulating variation we might expect from general causes of variation. However, it was also the case that in that year there was a ‘special’ change – the switch from all students taking triple science to mixed provision where some students dropped down to dual science. As a result overall grades improved.
Regardless of the specific causes behind changes, exam results are a lagging measure – coming at the end of a course rather than triggering action, questions or support in the moment.
The problem is, we’re not very good at reliably identifying good teaching. We often downplay the unknowledge which permeates learning. Acknowledging most or all of the following but perhaps not quite realising their full implications.
- Lesson observations have limited reliability. Often trained observers grade less successfully (against longer term learning measured in tests) than untrained ones.
- Exam result variation is shaped significantly by cohort and class variation as well as the limited reliability of standardised assessments .
- Variation in pupil aptitude/prior knowledge/ability is far from randomly distributed between (and often within) schools.
- There is less than we might assume which passes as ‘research-based’ practice when it comes to specific ways of teaching specific ideas to specific age-groups – beyond ‘time on task’ being associated with more learning.
The variation in performance of those teaching different classes is not as random as in Deming’s set-piece. However, for those of us judging a teacher and their class, in the moment or even at the end of the year – given our unknowledge of the domain, the inner workings of students minds, and the long-term impact of any given lesson or sequence of them – we are often essentially none the wiser.
But workers need managing. Teachers need to be performance managed. Schools need to improve their practice (or so the prevailing narrative mandates). And so, performance is often managed via exam targets. And professional development often entails the sharing of best practices which may or may not actually be better than the status quo, and which may or may not be applicable to our particular context.
What about the wider system?
It can be helpful to use Deming’s Red Bead Experiment (or other devices) to prompt a change of thinking. We can get caught up with the specific processes which we have become familiar to such an extent that we can’t imagine outside of them.
We imbue them with a solidity and special efficacy which obscures their contingent nature. Taking a step back, considering the complex system of which individual actions are a part, and exploring more strategic changes can bring improvements which would never result from a narrow emphasis upon individual performances within a rigid system.
To return to Deming’s parable. We could buy bigger paddles; consider optical sorting technology increase the frequency of dips; or obtain a different bead mix. We may also identify practices which can be cut without any particular loss. In Deming’s case he included a step in which beads were tipped back and forth between containers. This was taught to new workers as integral to the process but in reality had no impact upon outcomes.
In schools, iconoclastic thinking can be helpful. Encouraging us to rethink the status quo. Or, more specifically to think outside of it. We will likely find habituated practices which we can stop without any particular loss. Focussing on improving the efficiency of dominant practices and processes might help, but it’s likely that we can make bigger gains (or merely cut out inefficiencies) by changing or eliminating some altogether.
In recent years – the cutting of the ubiquitous high frequency ‘data drops’ is a good example of the later change. For a time, improving a school involved demanding that teachers and leaders did data drops (and analysis, and interventions) more often. It took a long time for the penny to drop that we could cut these in half (or more) gaining time and losing nothing.
More recently the cessation of apparently essential PD monitoring and evaluation processes rendered irrelevant by covid related disruption and CAGs are another. Often, there are alternative (and perhaps better) ways to meet particular ends than the most common ones. The disruption caused by Coivd19 may have led to some positive changes (as well as the pruning unnecessary practices). Digital parents evenings, self-marking quizzes, the judicious use of recorded explanations etc may all continue in future.
This year, in comparison to other years, has been marked by special cause variation! Systems that coped well in more normal times are under strain – or overwhelmed. This is, of course, hugely challenging for students, leaders and teachers who have come to expect a degree of predictability and stability from a school year.
The temptation is to argue for sweeping changes now as we rebuild. Perhaps the provisional socially constructed nature of what went unquestioned as ‘business as usual’ has been made clearer to us in its absence.
In particular attempts to slip back into a management process which leans heavily on the last three years’ exam results to inform intervention, improvement, support and development will, necessarily need re thinking. It is likely the system was less reliable than we assumed. It is clear that such an approach is simply not viable – for three more years at least!
However, in the complex system of schooling in which we work, we do well to be skeptical of urges we might have to leap on obvious solutions and wholesale changes. This issue is explored here by Matthew Evans
We will benefit from a return to the order and regularity of more normal practice, despite its inherent messiness – hopefully soon. However, I hope that the exceptional moment we are living through may help us engage more critically with business as usual in future.