Calibration Training App
1st interview
Overview of Tetlock findings
The headline result of the tournaments was the chimp sound-bite, but EPJ’s central findings were more nuanced. It is hard to condense them into fewer than five propositions, each a mouthful in itself:
- Overall, EPJ found over-confidence: experts thought they knew more about the future than they did. The subjective probabilities they attached to possible futures they deemed to be most likely exceeded, by statistically and substantively significant margins, the objective frequency with which those futures materialized. When experts judged events to be 100 percent slam-dunks, those events occurred, roughly, 80 percent of the time, and events assigned 80 percent probabilities materialized, on average, roughly 65 percent of the time.
- In aggregate, experts edged out the dart-tossing chimp but their margins of victory were narrow. And they failed to beat: (a) sophisticated dilettantes (experts making predictions outside their specialty, whom I labeled “attentive readers of the New York Times”—a label almost as unpopular as the dart-tossing chimp); (b) extrapolation algorithms which mechanically predicted that the future would be a continuation of the present. Experts’ most decisive victory was over Berkeley undergraduates, who pulled off the improbable feat of doing worse than chance.
- But we should not let terms like “overall” and “in aggregate” obscure key variations in performance. The experts surest of their big-picture grasp of the deep drivers of history, the Isaiah Berlin–style “hedgehogs”, performed worse than their more diffident colleagues, or “foxes”, who stuck closer to the data at hand and saw merit in clashing schools of thought.10 That differential was particularly pronounced for long-range forecasts inside experts’ domains of expertise. The more remote the day of reckoning with reality, the freer the well-informed hedgehogs felt to embellish their theory-driven portraits of the future, and the more embellishments there were, the steeper the price they eventually paid in accuracy. Foxes seemed more attuned to how rapidly uncertainty compounds over time—and more resigned to the eventual appearance of inherently unpredictable events, Black Swans, that will humble even the most formidable forecasters.
- A tentative composite portrait of good judgment emerged in which a blend of curiosity, open-mindedness, and unusual tolerance for dissonance were linked both to forecasting accuracy and to an awareness of the fragility of forecasting achievements.12 For instance, better forecasters were more aware of how much our analyses of the present depend on educated guesswork about alternative histories, about what would have happened if we had gone down one policy path rather than another (chapter 5). This awareness translated into openness to ideologically discomfiting counterfactuals. So, better forecasters among liberals were more open to the pos-sibility that the policies of a second Carter administration could have prolonged the Cold War, whereas better forecasters among conservatives were more open to the possibility that the Cold War could have ended just as swiftly under Carter as it did under Reagan. Greater open-mindedness also protected foxier forecasters from the more virulent strains of cognitive bias that handicapped hedgehogs in recalling their inaccurate forecasts (hindsight bias) and in updating their beliefs in response to failed predictions (cognitive conservatism).
- Most important, beware of sweeping generalizations. Hedgehogs were not always the worst forecasters. Tempting though it is to mock their belief-system defenses for their often too-bold forecasts—like “off-on-timing” (the outcome I predicted hasn’t happened yet, but it will) or the close-call counterfactual (the outcome I predicted would have happened but for a fluky exogenous shock)—some of these defenses proved quite defensible. And. though less opinionated, foxes were not always the best forecasters. Some were so open to alternative scenarios (in chapter 7) that their probability estimates of exclusive and exhaustive sets of possible futures summed to well over 1.0. Good judgment requires balancing opposing biases. Over-confidence and belief perseverance may be the more common errors in human judgment but we set the stage for over-correction if we focus solely on these errors and ignore the mirror image mistakes, of under-confidence and excessive volatility.
Extremizing (explained again)
The example I used in the Super Forecasting book was the example from the advisors to President Obama when he was making the decision about whether to launch the Navy SEALs at a large house in the Pakistani city of Abbottabad.
The thought experiment runs like this, that if when the President went around the room and he asked his advisors how likely is Osama to be in this compound, this mystery compound, if each advisor had said 0.7, what probability should the President conclude is the correct probability? Most people sort of look at you and say well, it’s kind of obvious, the answer is 0.7, but the answer is only obvious if the advisors are clones of each other. If the advisors all share the same information and are reaching the same conclusion from the same information, the answer is probably very close to 0.7
Imagine that one of the advisors reaches the 0.7 conclusion because she has access to satellite intelligence. Another reaches that conclusion because he access to human intelligence. Another one reaches that conclusion because of code breaking, and so forth. So the advisors are reaching the same conclusion, 0.7, but are basing it on quite different data sets processed in different ways. What’s the probability now? Most people have the intuition that the probability should be more extreme than 0.7, and the question then becomes how much more extreme?
More note on that
Centre for Effective Altruism where I’ve been working the last few years, we often get people to independently come up with probability estimates for different things before we discuss something, and then after we discuss it.
We’ve never done this thing of then combining them and then saying well, if we’re all on one side, then that should make us even more confident than the average of our answers. But perhaps we shouldn’t, anyway, because we’re all clones of one another or something like that or we all have access to too similar information, but that’s maybe something we should consider doing.
Philip Tetlock: Well, well-functioning groups that are very good at overcoming biases like failing to share distinctive information, groups that are effective at that, you want to be careful about extremising. For example, it wasn’t a good idea to extremise the judgements of super forecasting teams.
How to solve – does your research mean that we shouldn’t trust experts?
skeptics are over-claiming
It’s very hard to strike the right balance between justified skepticism of pseudo-expertise, and there’s a lot of pseudo-expertise out there and there’s a lot of over-claiming by legitimate experts, even. So justified skepticism is very appropriate, obviously, but then you have this kind of know-nothingism, which you don’t want to blur over into that. So you have to strike some kind of balance between the two, and that’s what the preface is about in large measure.
Experts good and bad
Bad - Laws of diminishing returns in