Angry statisticians dispute research that found high infection rates in Santa Clara County

SAN JOSE, Calif. — Researchers are engaged in a fierce debate over the startling estimates in a Stanford study that suggested as many as 81,000 people could already have been infected with coronavirus in Santa Clara County, with some of the world’s top number crunchers calling the study sloppy, biased and an example of “how NOT to do statistics.”

“I think the authors owe us all an apology … not just to us, but to Stanford,” wrote Andrew Gelman, a professor of statistics and political science and director of the Applied Statistics Center at Columbia University.

Yet after a weekend of attacks on the paper, a study announced Monday out of the University of Southern California on a sampling of residents in Los Angeles reached a very similar conclusion: It found hundreds of thousands of adults there may have already been infected. As of Monday, Los Angeles County had recorded fewer than 13,000 cases.

The Santa Clara County study concluded that the virus had infected 2.5% to 4.2% of residents here; in LA, the estimated infection rate ranged from 2.8% to 5.6%.

The early studies set off a firestorm — not only among academics taking to Twitter to debate sampling methods, false positives and Bayesian inferences with a furor reminiscent of the banning of @BabyYodaBaby — but critics who believe the numbers show that COVID-19 is merely a partisan-driven flu hoax.

The showdown over a few percentage points has captured our cultural zeitgeist, sheltering at home in fear of both a virus and an economic meltdown.

In response, on Sunday, the Stanford study’s authors said they are planning to soon release a detailed appendix that addresses many of the “constructive comments and suggestions” the team has received.

“This is exactly the way peer-review should work in scientific work, and we are looking forward to engaging with other scholars as we proceed in this important work,” said Dr. Jayanta Bhattacharya, professor of medicine at Stanford University, who along with colleague Dr. Eran Bendavid, also assisted with the University of Southern California study.

The estimate, posted on the website medRxiv, comes from a first-in-the-nation community study of newly available antibody tests of 3,300 Santa Clara County residents in early April. Like all other emerging COVID-19 research papers, the work had not been peer reviewed prior to its release.

Based on those tests, the authors contend that between 48,000 and 81,000 of the county’s 1.9 million residents had been infected with the virus as of the first week of April. That’s 50 to 85 times more than the official count of cases at the time.

If true, it suggests that the large majority of people who contract COVID-19 recover without ever knowing they were infected. If undetected infections are that widespread, then the death rate in the county could be less than 0.2%, making the virus far less lethal than authorities have assumed. Los Angeles authorities also peg their death rate at 0.2% based on the USC study.

Santa Clara County Executive Dr. Jeff Smith remains steadfast in his interpretation of the study’s findings: It suggests that asymptomatic people spread the virus, and that more than 95% of the population remains susceptible to infection.

“That all means that there is more risk than we initially were aware of,” said Smith, lamenting how some are using the study to challenge Bay Area health officials’ unprecedented stay-home orders.

Similarly, Los Angeles Department of Public Health’s chief science officer Dr. Paul Simon said Monday that the LA study “suggests that many folks out there have infections and aren’t aware of it, or have mild symptoms….I think it is really important to continue the social distancing at least for the next month.”

The Los Angeles study was smaller than the Santa Clara County study, testing fewer than 1,000 people. This puts it at greater risk of distorted results. Researchers there plan to repeat the study to improve the accuracy of their results and track the virus’ spread.

However, researchers in LA took a more representative sample of residents than the Stanford team, using a market research firm rather than recruiting study subjects through Facebook, and including more minority groups.

They both used the same test kit, which is not FDA approved and has a 90 to 95% accuracy rate.

The Stanford study’s authors said they adjusted for the test kit’s performance and their limited sampling techniques to estimate the prevalence of the virus in Santa Clara County.

But over the weekend, some of the nation’s top number crunchers said their extrapolation of the results rests on a flimsy foundation.

They contended the Stanford analysis is troubled because it draws sweeping conclusions based on statistically rare events, and is rife with sampling and statistical imperfections.

Gelman of Columbia University called the conclusions “some numbers that were essentially the product of a statistical error.”

“They’re the kind of screw-ups that happen if you want to leap out with an exciting finding,” he wrote, “and you don’t look too carefully at what you might have done wrong.”

From the lab of Erik van Nimwegen of the University of Basel came this: “Loud sobbing reported from under Reverend Bayes’ grave stone,” referring to a famed statistician. “Seriously, I might use this as an example in my class to show how NOT to do statistics.”

“Do NOT interpret this study as an accurate estimate of the fraction of population exposed,” wrote Marm Kilpatrick, an infectious disease researcher at the University of California Santa Cruz. “Authors have made no efforts to deal with clearly known biases and whole study design is problematic.”

Others accused the authors of having agendas before going into the study. Back in March, Bhattacharya and Bendavid wrote an editorial in the Wall Street Journal arguing that a universal quarantine may not be worth the costs. Their colleague John Ioannidis has written that we lack the data to make such drastic economic sacrifices.

One major problem with the Santa Clara County study relates to test specificity. It used a kit purchased from Premier Biotech, based in Minneapolis with known performance data discrepancies of two “false positives” out of every 371 true negative samples. Although it was the best test at the time of the study, that’s a high “false positive” rate that can skew results, critics say — especially with such a small sample size.

With that ratio of false positives, a large number of the positive cases reported in the study — 50 out of 3330 tests — could be false positives, critics note. To ensure a test is sensitive enough to pick up only true SARS-CoV-2 infections, it needs to evaluate hundreds of positive cases of COVID-19 among thousands of negative ones.

This potential error in the test can easily dominate the results, they said.

Statistician John Cherian of D. E. Shaw Research, a computational biochemistry company, made his own calculations given the test’s sensitivity and specificity — and conservatively estimated the proportion of truly positive people in the Stanford study to range from 0.2% to 2.4%.

Adjusting for demographics, Cherian’s calculations suggest that county prevalence could plausibly be under 1% and the mortality rate could be over 1%.

The “confidence intervals” in the paper – that is, the range around a measurement that conveys how precise the measurement is – “are nowhere close to what you’d get with a more careful approach,” he noted.

Assuming a sensitivity of 72%, this is a histogram of possible true positive rates, according to statistician John Cherian.

Even if the test were completely accurate, there would still be sampling problems in the Stanford study, critics said.

Biostatistician Natalie E. Dean of the University of Florida called it a “consent problem.” The Facebook ad might have attracted people who thought they were exposed to the virus and wanted testing.

“The prevalence drops off quickly when adjusted for even a small self-selection bias,” wrote Lonnie Chrisman, chief technical officer at the Los Gatos data software company Lumina Decision Systems.

Addressing the critics, Stanford’s Ioannidis, professor of medicine and biomedical data science at Stanford University, promised an expanded version of their study will be posted soon. “The results remain very robust,” he said.

In the end, no single study is going to answer the question of how prevalent COVID-19 is in our communities, scientists said. More studies with different technologies and analytic approaches are needed.

That’s coming. A UC Berkeley project, which will begin in May, will test a large and representative swath of 5,000 East Bay residents. Scientists will take saliva, swab and blood samples from volunteers between the ages of 18 and 60 around the region.

UC San Francisco and a privately funded operation will test all 1,680 residents of rural Bolinas for evidence of the virus. UCSF will launch a similar effort Saturday in San Francisco’s densely populated and largely Latino Mission District, where it hopes to test 5,700 people.

Results are expected soon from seroprevalence surveys run by other groups around the world, including teams in China, Australia, Iceland, Italy and Germany

“This pandemic,” wrote research scientist Ganesh Kadamur, “has been one giant Stats class for everyone.”

———

Visit The Mercury News (San Jose, Calif.) at www.mercurynews.com

Distributed by Tribune Content Agency, LLC.