“All models are wrong. Some are useful.”

Image: mark_irvine, Flickr, CC by 2.0

I worked for 4 years in modeling and simulation analysis, and I’d like to weigh in on the models currently being used to set policy, says Lawton Clites, a Fredericksburg, VA businessman and SRC reader.

1. All models are wrong. Every one. Look at how accurate a weather forecast is for next week. Weather is something meteorologists have been modeling every day for decades, and on which they have immediate and regular feedback. Yet our ability to predict rain more than a day or two out is iffy at best. Now imagine you’re modeling the spread of a new and poorly understood virus among 330,000,000 people (or even just the 8,000,000 people in VA). Something we’ve never seen before, in various ways. And you’re not trying to guess if it’ll rain or not tomorrow – you’re trying to estimate how many people it will put in the hospital 4 months from now. 

“All models are wrong. Some are useful.”

Dr. George Box

2. Models are only as good as the data we put in them. It is quite likely over 90% of corona cases in Virginia (where I live) – both current cases and cases up to this point – are unreported. But is it 90% or 99% unreported? We don’t know. Data from other countries isn’t much better. For some, it’s much, much worse. We don’t even know when the virus got here.

3. Think of all the variables that might impact spread. We think the average person with this passes it to two others at some point during a two to three week infection. But we’re not sure. “Experts” used to say masks didn’t help. Now they say they do. Some say it can travel 6ft. Others 13ft. We have an idea how long it can remain detectable on various surfaces, but not how long it remains infectious on those surfaces. Or what the likelihood is of infection spreading through surface contact. What is the minimum viral load for infectiousness? Do infections caused by different viral loads progress differently in patients? It appears to be affected by temperature and humidity, but we don’t know by how much. There is far more about this virus and how it spreads that we *don’t* know than there is that we do. 

4. Models are basically experiments conducted in a computer. And like expedients, they are of little value unless they are repeatable. Not just by the people who designed them, but by other scientists. 

5. A single run of a model is like a single participant in a study. Unless you’re running your model 10,000 times over a variety of values for each variable and random seeds, it’s not worth the time it took to make it. I once ran a model I designed 1,000,000 times in an afternoon. 1,000,000 runs is a dataset. One run is not. 10 isn’t, either. 

Saying we have ONE model from ONE university that’s not been publicly released for scrutiny, that’s been propagated with incomplete and suspect data, and that’s been run an undisclosed number of times, but which says we’ll hit our peak in mid August, is basically worthless to any actual data scientist. Its interesting, but that’s it. It’s not predictive.