In this episode of
Expert Insights for the Research Training Community, Dr. Wilbert
van Panhuis, director of the Coordination Center of the Models of
Infectious Disease Agents Study, or MIDAS, explains how to become an
infectious disease modeler and describes the types of questions that
modelers study. Hear him talk about training opportunities provided by the
The original recording of this episode took place as a webinar on May 11,
2020, with NIGMS host Dr. Dorit Zuk. A Q&A session with webinar
attendees followed Dr. Panhuis’ talk.
Recorded on May 11, 2020
Download Recording [MP3]
Welcome to Expert Insights for the Research Training Community—A podcast
from the National Institute of General Medical Sciences. Adapted from
our webinar series, this is where the biomedical research community can
connect with fellow scientists to gain valuable insights.
Dr. Dorit Zuk:
Good afternoon, everyone. I am Dorit Zuk from NIGMS, and I hope everyone
is staying well and safe. First, I’d like to extend our thanks to
everyone who is doing all they can do to help with this pandemic,
particularly all those working on the front lines, whether it be your
local hospital, the research lab next door, or the grocery store down
This is one in a webinar series for students, post-docs, and faculty.
We’ve had a couple last week and more to come. Each hour-long webinar
will include a 10- to 15-minute presentation by the speaker, followed by
a moderated question-and-answer session.
It’s my pleasure today to introduce our speaker. Wilbert van Panhuis has
an MD and a PhD. He is an assistant professor in the departments of
epidemiology and biomedical informatics at the University of Pittsburgh.
He works in the field of computational epidemiology and population
health informatics. And among many, many other things he does, he is the
PI of the coordinating center of the Models of Infectious Disease Agents
Study, fondly known as MIDAS, which is a network of more than 300
scientists interested in infectious disease modeling.
Wilbert has been very busy coordinating research and data access related
to the COVID-19 pandemic and hopefully he’ll tell us a little bit about
that. He will also tell us about what it’s like to be an infectious
disease modeler and how you can become one. So Wilbert, over to you.
Dr. Wilbert van Panhuis:
Thanks, Dorit. Thanks for inviting me to do the webinar, and I’m very
happy to talk here about the work that I do as an infectious disease
epidemiologist and the work that our network is doing. As mentioned by
Dorit, I will talk about work in the context of the infectious disease
modeling research and also the MIDAS network.
So a little bit of context first. As Dorit mentioned, I’m the PI of the
MIDAS Coordination Center. The MIDAS Coordination Center is a
collaboration between multiple institutions. It’s led by the University
of Pittsburgh, but also includes the Fred Hutchinson in Seattle,
University of Texas at Austin, University of Massachusetts at Amherst,
and University of Florida. MIDAS stands for Models of Infectious Disease
Agents Study, and it’s a global network of 400 scientists and students,
and many others as well, for infectious disease modeling.
About 100 members of the network are modeling COVID-19 right now, and so
this has been a very big focal point for our modelers. The website is
mentioned there as well. Another project just in the context here at
Pitt is the project Tycho, which is our longstanding data repository for
infectious diseases that improves access to infectious disease data in
an easy-to-use format and ties in very nicely the work that we’re doing
on modeling, as well as on data science, and trying to combine the two,
and I’ll talk a little bit about that.
What we’ll talk about today in the first 10 to 15 minutes: what are the
scientific problems that are studied by infectious disease modelers?
What are infectious disease modelers?
And also a little bit more about the career trajectories. What could you
do and what have people done to become infectious disease modelers? And
then more specifically, what are the training opportunities for
infectious disease modeling in the MIDAS network? And a little bit,
then, about what the MIDAS network is doing in terms of modeling
research on COVID-19. And after that, we’ll go back to the questions.
So what is modeling? These are just some very basic concepts that do not
just apply to infectious diseases but to many different modeling
endeavors and research, but just to get us all on the same page, because
different people can have different things in mind when we talk about
So here is a quote that I actually took from Wikipedia that was very
pointed, actually was originally from John von Neumann, a famous
mathematician, that a model is a mathematical construct which, with
additional verbal interpretations, describes observed phenomena in the
And so if you think about a model, and here I’ve put some kind of
additional information here, there are observations about the world. So
here you see A’s and B’s and C’s, and this is input data, if you wish.
And modelers make a model out of that data about the world, about a
phenomenon. And so this becomes a mathematical or computational
representation of a phenomenon.
And then what do you do with a model? Once you have a model
representation of a real-world phenomenon, what do you do with this?
Then you can use that to gain insight about biological mechanisms. How
do things work? How does an infectious disease work? Or how does
something else work? You can also use it to evaluate strategies to
change something, the control of an epidemic, for example. Once you have
a model of an epidemic, you can evaluate how you can control it. And
then there is forecasting that you can also do with your model.
So these are the three application areas most relevant for infectious
disease modeling, but the general concept applies. In terms of the
difference between modeling research and other research like randomized
clinical trials or laboratory experiments is that here we make a
computational representation of something in the real world, and we use
that representation to gain insights, evaluate control strategies, or do
forecasting. So that’s the general framework in which we work.
What are then some of the problems studied by infectious disease
modelers using this kind of methodology? First, there are a large
variety of modeling methods used. So in the context of what this model
is that we just talked about, there are many different types of models.
So you’ve heard about statistical modeling, mathematical modeling,
compartment models, agent-based models, network models, probabilistic
models, decision-analytic models, and others.
So there are lots of different types of models. There is also a lot of
overlapping terminology. When I say mathematical model here and a
compartment model uses mathematical equations, so it could be considered
a mathematical model. And we don’t need to get caught up in terminology,
but just to say that there are a wide variety of models used for
infectious disease modeling. The most common ones that you will see in
the literature are mathematical compartment models, agent-based models,
or network models.
What are the problems that are studied in this field? I think there are
about three main categories that you could list. One is clarifying
biological mechanisms of infectious diseases. These are more
hypothesis-driven questions about how things work. So examples are, does
infection with measles virus reduce the immunity against other
infections? And we’ll get into that a little bit later. What is the
incubation period for COVID-19? These are questions that have been
addressed by models.
A second category is estimating the impact of interventions. So what if
we want to do something against an infectious disease? How can we
evaluate what the potential use would be in a computer model rather than
before we do it in the population? So one question is how many deaths
are prevented by influenza vaccination every year? Or how much testing
is needed in order to stop the COVID-19 epidemic?
We all know testing is important and tracing is important and other
methods. So how much would you need to do in order to contain the virus?
That’s a question that a model can answer. And then in terms of
forecasting, infectious disease forecasting is very similar and an
emerging field very similar to weather forecasting. So the question
there is very simple: How many cases of COVID-19 can we expect in the
next two weeks? How many cases of influenza can we expect this year?
These are clear forecasting questions.
So two examples of scientific problems here. One is clarifying
biological mechanisms, is this study about long-term, measles-induced
immunomodulation increases overall childhood infectious disease
So this study shows, using modeling methods, that here on the right-hand
side, for example, you see the yearly prevalence of immunomodulation by
measles, which is really related to the number of measles cases. And
here on the Y-axis, the non-measles-related mortality. So this is how
much the non-measles mortality is related to measles itself, and this
study, using a variety of methods, shows very nicely that if you have
measles you have more chance of dying from other diseases as well. And
that’s a mechanism question that’s been answered here by the
computational model. The other end is the impact of interventions.
So here I show a study that was already published in 2006 looking at
strategies for mitigating an influenza pandemic. So here are, for
example, on the top right, BC is border control, and RMR is reducing
mobility. You can see if you do these things, if you do mobility and
border control at certain lengths, you can delay the peak in the
epidemic with 42 days, 56 days, etc. And you can also reduce, this is
the total number of cases, you can reduce the total number of cases
going on if you do these measures.
So these models can help to quantify exactly the gain that we could
achieve by certain interventions for stopping a pandemic, and that’s
helpful before you decide to implement them. So these are ways that can
help policymakers with their decisions, and also experiment in a
computer setting before it’s in the population.
I don’t show an example here of forecasting because it’s very
straightforward, very similar to weather forecasting. So these are a
little bit about what kind of methods and what kind of questions do
modelers answer, and we can go more into that in the questions if you’re
interested. How do you become an infectious disease modeler?
I think a lot of people recently, especially with the COVID-19 pandemic,
are interested in infectious disease modeling. How can you become one?
How can you make your career go this way?
So first I would say that infectious disease modeling is
interdisciplinary. Even in our MIDAS network we have people from many
different disciplines. There are people from quantitative sciences and
biological sciences, and even more and more social sciences such as
history or psychology or economics are becoming very relevant for
infectious disease modeling because we want to know more about people’s
behavior and how people will react to include that in the model. We want
to know about economical consequences. We want to include that in the
model, so we work with economists.
So I put this kind of on the side here because in current practice it’s
not highly integrated yet, not as much as these two guys here. So the
quantitative sciences are mathematics, computer science, computational
biology, engineering. And then on the biological sciences, the biology,
zoology, ecology, and medicine. And there may be other sciences as well
here. And that’s how modeling kind of works.
So you can approach your modeling career from different ways. One is to
say, I guess the most common route that I’ve seen at least is to do a
major in a quantitative science, and then learn some biology for the
disease that is interesting to you, and then use your quantitative
background to model that disease. And many people have gone that route.
On the other hand, you can also start with a biological science and then
learn quantitative methods to represent computational models and become
an infectious disease modeler. That is, I think, a bit of a less common
I guess it’s not easy for people that come from a biology background (or
for myself from a medical background) to learn quantitative methods. On
the other hand, it seems possible for people that have a strong
quantitative background to learn some of the biology. But there are lots
of different routes to get into this field, and it’s definitely not
prescriptive. Many people find their own ways.
Here I show a couple of examples of members of the MIDAS network. Some
of these are very prestigious modelers that have been very prominently
in the news for COVID-19. You can also go to the link here. MIDAS
network has people in it, and the people have all their links to their
websites and their profiles, so you can see what people have done to see
what routes they have taken.
Here is Betz Halloran. She’s part of our center as well. She did an
undergrad in physics and philosophy of mathematics and then graduate
training in medicine, Master of Public Health, and a doctorate in
population sciences. And so she did have the quantitative undergrad and
then moved into more biological sciences.
Neil Ferguson here that you may have heard of in Imperial College London
is a pure physics undergrad and also did a doctorate in physics, and
then during his post-doctorate work started to apply his skills to
biology and to infectious disease questions.
Elaine Nsoesie here at Boston University did her undergrad in
mathematics, again very quantitative, and then did her graduate work in
statistics, genetics, bioinformatics, and computational biology and is
now modeling many infectious diseases as well. So she has done that
biology part later in her career.
And then here Micaela Martinez actually comes much more from a biology
background. Throughout her whole career has worked on both mathematics
and biology at the same time. And so there are many different ways to
get into this field, and you can see from each member how they are
moving their way, which is actually very, very interesting.
But the MIDAS network can help as well. You don’t have to figure out
everything on your own. If people are interested in infectious disease
modeling, including students, even undergrad or high school students
that we’ve interacted with, can go to our network for training. MIDAS
network aims to be a very student-friendly network. Students are
considered full members of the network, and I’ll show later how you can
become a MIDAS member if that’s of interest. We have students in our
steering committee represented and students present their work during
our conferences and events.
There are also a fair amount of training opportunities for students
interested in learning about infectious disease modeling, often with the
possibility for travel support to go to those training opportunities.
When COVID-19 has reduced and we can go to physical, in-person events
again, then there is often travel support either from the organizing
institute or from our center.
Here you can see some examples of short courses in infectious disease
modeling have become increasingly available. I just list a couple of
examples here. There’s the annual summer institutes in Statistics and
Modeling of Infectious Diseases, SISMID, very well known at the Fred
Hutchinson. There is the clinic on Meaningful Modeling of
Epidemiological Data, MMED, in South Africa. There’s also Mathematical
Modeling for the Control of Infectious Diseases at Imperial College
London, and the Introduction to Infectious Disease Modeling and Its
Application at the London School in London as well. But there are many
other ones, both in the U.S. and abroad.
Other opportunities are the MIDAS webinar series. We are organizing a
webinar every last Friday of the month, and on the website you can find
information about that. There are also increasingly online courses. For
example, through Coursera. This is an example, Epidemics: The Dynamics
of Infectious Diseases, a really high-quality course done by Penn State
University. And also we have an annual conference to increase diversity:
Mathematical Modeling and Public Health.
This is geared toward undergraduate students who’d like to learn more
about modeling, and we organized that. And then we have various other
workshops, ad hoc workshops organized by MIDAS and others, and we always
ask for participation not only by faculty and staff but also by
So following the MIDAS network developments here will expose you to
training events that are available through the network.
Then a little bit about COVID-19 of course. Our network is highly active
on this, as is our Coordination Center. But like I said earlier, about
100, a quarter of the MIDAS network, is working on modeling research for
COVID-19. Many members work with international, national, state, or
local health agencies. As I said, models can help decision makers decide
about what to do with the epidemic, what interventions to use, and there
is a lot of interaction between modeling groups and health departments.
There are five modeling working groups between MIDAS and U.S. CDC. We
have a mailing list. Anybody can subscribe to the COVID-19 mailing list
to stay updated on COVID-19 modeling work by MIDAS. And also we, as a
coordination center, do a lot of work to centrally disseminate data
information and model results to MIDAS members and to the general
public, and there has been a massive engagement, really, of the modeling
community for COVID-19 research enabled by the MIDAS Coordination Center
and by members of the MIDAS network.
A little bit of research here. You can see some examples. You’ve
undoubtedly heard a lot about modeling of COVID-19 in the popular media.
The New York Times has done a lot of work on showing research by MIDAS
members and also others. Here shows work by Alessandro Vespignani on
social network analysis and how the mobility patterns affect the spread
of the coronavirus epidemic.
Here’s also very interesting work done by Nick Reich and his colleagues,
shown on 538 website, where he displays six or so models that forecast
where we’re going to go with the epidemic and show the actual data as
well. And so week by week, you can see what the data do and what the
models have forecasted, and you can see how good they were, and which
model was the best model in forecasting COVID-19. Very data-driven
comparative work to improve modeling and to also inform people about
what the different models are saying in a way that is easy to
Then here is a little work done by Micaela Martinez about the
seasonality, why dozens of diseases wax and wane with the seasons and
about COVID-19. So a lot of this modeling work is featured highly during
this pandemic and also we’re doing our best to put high-quality
information out there.
The role of the coordination center is in the background to help the
modeling research. And so the big overarching data objectives you could
call is to make data FAIR, following the FAIR guiding principles. So
that means to make data Findable, Accessible, Interoperable, and
Reusable. A lot of the information coming out during this outbreak is in
a very ad hoc way.
You’ve seen there are numerous dashboards, numerous COVID-19 case
trackers, numerous articles coming out, and we want to make these data
well-annotated and easy to discover and use by researchers using data
science principles. And so the MIDAS Coordination Center brings together
the data science principles and practices, such as the FAIR guiding
principles and concepts such as good metadata and good discoverability
together with what comes out of the research domain and then help that
to accelerate the modeling work and to provide high-quality modeling
So how to get involved. There are lots of ways to start to connect to
the network through the MIDAS website. It gives you all the information
about MIDAS people, MIDAS members, MIDAS projects, etc. You can request
membership online, subscribe to the mailing list, etc. So that’s the
first place to go. There is the webinar series as well every last Friday
of the month. And so I would say go there if you’re interested to see
the presentations by MIDAS researchers and also students for their work.
There are emails that you can use and also follow us on social media. So
there are lots of ways to get connected to research, as well as
training, by the MIDAS network.
And then I’d like to acknowledge, as anything, this is not an individual
project. There are lots of people involved behind the scenes. We have a
large team of programmers, coordinators, curators, and students, all
working to make this happen. And also, of course, I’d like to
acknowledge funding by NIGMS and by NIH Big Data to Knowledge and the
Research Data Alliance to help move forward this data and modeling
Thank you very much, and back to Dorit for any questions that people may
Thanks, Wilbert, for a really interesting talk. So many things to talk
about, but let’s stick to the training for a minute. The first question
I had for you is: You talked about how to become an ID modeler and also
what kind of questions ID modelers study, but perhaps you can talk a
little bit about what sectors you would find infectious disease modelers
in beyond academia, that of course you work in.
Many people may think that infectious disease modeling is something that
only happens in universities, but that’s really not necessarily the
case. Lots of people come from a research background and they are doing
research. But lots of people work in different settings. So modeling is
not only done in universities for research questions. It’s done by
government agencies that are working on infectious disease planning and
infectious disease research as well.
Think about U.S. Centers for Disease Control, but also some state health
departments, World Health Organization, many places use modeling
already. You can also think about industry. Lots of students go to the
pharmaceutical industry or insurance industry to model infectious
diseases, and also to model the potential impact of vaccinations and
profiles of products.
And then also insurance industries, of course, they need to know the
risk of events, including infectious diseases, and they would also do a
lot of modeling. So there is increasingly a use of modeling also in all
the sectors that people go to.
Great, thanks. A question from the chat, you talked about how you become
an infectious modeler or other type of modeler from a quantitative
background or from a biological background, but what if you’ve got a
social science background, which is slightly less integrated than those
two seem to be? Do you think the fellowships, the courses you mentioned,
or do you think further study is needed?
That’s a really interesting question. As I said, more and more people
from social sciences are also interested in modeling. We’re actually at
University of Pittsburgh working with a variety of departments where
people want to use modeling in their own domain.
So within social sciences there is an increasing amount of modeling and
modeling training that people can do. For infectious disease modeling
specifically, I think people from social science backgrounds could as
easily join and do the training programs, post-doctoral training, or
even doctoral training, that is available with people in the MIDAS
network or others.
Although I have to say that for most of the modeling work, you do need
quantitative training and you do need quantitative background, so if
you’re a social scientist background and you would like to do modeling
and you don’t have the mathematics or the programming or computer
science skills, then you could supplement your existing knowledge with
coursework in some of those quantitative sciences and then go in
doctoral work that is still infectious disease modeling or through some
of the summer programs or short courses that are being offered.
I do have students, for example, that are coming in with a more
non-quantitative background for a doctoral program, and what we do is
they take courses and electives in quantitative sciences to help them
gear their career towards infectious disease modeling.
Did you need to do that when you came with an MD to this field?
Yeah, I had to. I’m not sure that I did it, but maybe I should have. I
came in with my MD, my medical training. Of course, during medicine you
don’t learn computer programming or statistics all that much, so I did
have to do a PhD in epidemiology and then from there on in my
post-graduate work and also as junior faculty, I learned a lot on
modeling by collaborating with modelers. And in the end actually, I even
did a K01, a training grant with NIH, with the Big Data to Knowledge
program, to cross-train, even as a junior faculty, from medical sciences
to computational sciences.
I spent about five years before I did the MIDAS Coordination Center to
learn about data science and informatics before I made this switch. So
yes, it did take a fair amount of investment to get that kind of
So lots of ways to do it. On that note, and then I’m going to switch to
something else, those courses you mentioned in the slides. Are more
details available on the MIDAS website or where can one find
We have a training page on the MIDAS website, midasnetwork.us/training.
People can go to the website if they are interested. What we’re doing
right now is compiling information about all the short courses, modeling
courses, seminars, webinars, etc., that are being done on infectious
Currently it’s a challenge because lots of people are doing webinars,
and they pop up very rapidly, so it’s hard to keep track of what’s going
on, but we’re doing our best to put as quickly as possible up on our
website what events are taking place. And then as we move forward, we
will post things like modeling textbooks, modeling short courses,
curriculum, people in the MIDAS network that are doing training, and so
our website is a good place to start your explorations.
I will say there is a public mailing list for people like me who are not
modelers, but you get notices about the seminars, so that’s good.
If you want to just follow MIDAS and what goes on, then you can do the
public mailing list with all the announcements that are for broad
dissemination will go through that. So there are multiple ways you can
join, or be a part, or be notified of what’s going on.
Switching gears a little bit, but not completely, we’ve seen lots and
lots of reports about models in the media, and I wonder what you think
about communicating complex mathematical models to non-scientists,
whether it be just the public or any other stakeholder? And as a second
part to that question, just to make it complicated, how do you think
trainees should think about that?
There is a lot to unpack in that question. I would say there are a
couple of aspects there. One is the communication part and then one is
how is a trainee to do that. So the communication has changed, I would
say, over the last…especially during this pandemic. So what’s happened
in the past when people want to publish their work, as a scientist, you
publish your paper in a peer reviewed journal and the system of peer
review helps to review the scientific quality and rigor of the work,
theoretically, and then when people publish papers it has been vetted by
the community in some way.
Now what’s happened recently is scientists are putting their research
out more and more early in the process because during the pandemic, of
course, there is a need for early information to make your decisions. So
for example, the first time we heard about the COVID-19 outbreak we
wanted to know about the incubation period. How long does it take for
this virus from infection to symptoms showing?
So if you see someone with symptoms, how long ago were they infected?
That’s a really important piece of information. And so researchers start
to study this question, putting information up as pre-prints. And
pre-prints are servers where you can post your research and it’s tagged
with you name, and you can show it’s your work, so nobody else can claim
it was their work, but it has not been peer reviewed yet.
And so that’s a complicated process, because it does release early
information, but it does not quality-check this information. So what
we’ve seen is an incredible amount of almost 100 to 200 papers per day
coming out on pre-prints, and so it’s a deluge of information, and now
people have to sort out the good research from not-so-good research. And
the complicated part is that the methods are very complicated.
So modeling methods are typically not easy to understand if you don’t
have the training, and so it’s hard for journalists or government
decision makers or even people in government to decide, “Oh wait, this
model I can use or this model I should not use.” So as a scientific
community we’re trying, and our coordination center is trying as well,
to find ways to catalog good models, good modeling studies, and models
that are not so good.
One nice example for that is this forecasting study that I showed where
you can actually see eight different models and the data, and you can
see every day what the model is forecasting, and you can see which one
was the right one. And so there are ways to kind of start to get at that
question of what is good-quality modeling and what is modeling that is
too quick and too dirty work to be used for some important decisions.
As a community we need to make that distinction. So that was kind of the
first part of the question. How can trainees work on this and learn
this? I think there are a couple of things. One is to first learn how to
do rigorous research and how to do good research. And so doing research
and doing modeling training with good modeling groups will help with
that, because through mentorship and through training, you learn how to
not only make good models but also how to behave and how to be a
diligent, accountable scientist moving forward.
And the other thing is to think about how you can use data for your
models. There is a lot of modeling that can be done with a lot of
assumptions and shortcuts, but you can also try to make the model based
on actual data sources, which can sometimes be a lot of work to go after
those data sources and standardize them and make them work in your
model, but it can also be an additional validation of your model.
And then lastly, I would say the way the models are described in the
papers also. You can decide at any point in your research whether you’re
going to publish a model or whether you’re going to not yet publish the
model. When is it good enough? And that’s some judgment that you can
learn and that you can learn yourself but you can learn also by actively
seeking out this kind of training through professional development
opportunities, or through workshops on reproducibility of research or on
ways that models should be described in literature before you can
So I think more and more of that there is part of our community that’s
trying to move forward with the best practices and consensus about how
to approach reproducibility and quality of modeling and communication of
modeling to the public, but there is also more and more responsibility
with the individual researchers to take their work seriously and their
communication seriously. And they can seek out training, and they should
seek out training to develop themselves in those areas.
Thanks. As someone who is not a modeler but who is interested in
communication, one other thing that I would add and ask you is, how
important is it to be really clear about the caveats of your work and
what it can and cannot definitively say?
We learn when we do statistics or other types of research that you make
your model and you do your research experiments, whether it’s a
computational experiment or a real experiment, and then in your results
description, your discussion, you describe the inference that you can
make based on your experiment, and we have to be careful about making
the correct inference.
So if I use data from a nursing home and I publish a COVID-19 model,
then those results are about people in nursing homes. You cannot say
that this is true for the general population. So those kinds of things
have to be described carefully.
And I think looking carefully at how you describe your models and also,
like you’re saying, put in what the model cannot do. So in your
discussion section, you do delineate the limitations of your model and
hopefully that will be picked up by the readers and the users.
One other interesting thing that we’re doing as a center in terms of the
data science aspect is that we more and more want to not only have
human-readable PDF narratives of model results, which are scientific
papers, but we also want machine-interpretable information in a
structured format that describes what the model did and what the model
can be used for or not.
And so as we’re putting disclaimers and use limitations in the research
papers, we can also put that in computational representations of the
information so that later on we can have computer algorithms determine
as well and learn what you can do with this model and what you cannot do
with this model. But there is a lot of research that we can do in that
kind of aspect, in science communication, especially now with the
pre-prints and with the high amount of attention from the public, as
well as from the media in infectious disease modeling.
And I do think that in that sense, the current time period will really
help accelerate some of that work just because of the need of the
Thank you. So switching gears just a little bit, and again complicated
question, but in general, what types of data do you need to construct an
accurate model for forecasting the course of an infectious disease
outbreak or the results of an intervention strategy?
There again, there is a dichotomy between what data you use and what you
want to do with the model. So a lot of modelers, I’ve been in many
conversations with health departments where health departments would say
or health agencies would say, “What can you model for us?”
And then modelers would say, “Well, what data do you have?”
And then they would say, “Well, that depends on what you can model.”
But then the modelers would say, “Well, the model depends on what data
So we have this chicken-and-egg thing going back and forth a lot. And so
the typical answer is depending on the question that you want to answer
with the model, you need certain data. And it’s typically a good idea to
approach modeling just like any other research question in saying, “I
have a specific biological phenomenon that I want to model,” whether
it’s vaccination, influenza, or whether it’s mosquito control for
malaria, or whether it’s business closures for COVID-19, these are all
similar questions. Once you’ve defined the question.
So let’s take, for example, what is the effect of transport reductions
and mobility reductions on COVID-19 transmission? If that’s the
question, then you need information about COVID-19, about the virus,
about the biological behavior of the virus, but you do need the mobility
data and the transportation data to look at that effect. And you need
the patterns of transport and mobility. So for that question that’s set
in that framework, you get those data sets and then model that question.
If it’s about vaccination, potentially, you would need information about
immunity against COVID-19 and how well immunity works. And then you need
to unveil that part of the biology of the virus to be able to do that
So there are lots of different layers to what you…And also the level of
detail. If you’re okay with doing a model for the whole of Pennsylvania
or the whole of the U.S., your data can be extremely kind of coarse. But
if you want to model local transmission, then you need to go local and
then you need to go to county-level and ZIP code-level data.
So I think it’s like with any research question. We can pay attention to
what the questions are that we want to model, what components we want to
include, and that will then define what type of model that you want to
make and what type of data you need for that. But I would say that, of
course, because it’s COVID-19 or because it’s certain infectious
diseases, you always need data about the virus or the bacteria, the
pathogen, as well as on the host population, and then the information
about whatever particular thing you want to study about that disease and
And then the question is the quality of the data. You want to be sure
that you understand, at least, the quality of the data. Just like some
people don’t understand modeling methods very well and should be careful
to interpret what’s being written if they don’t understand it, it’s also
important for a modeler to understand where the data come from, so that
you can assess carefully whether the data are good enough to use in your
model, and that it doesn’t have certain biases or other problems with it
that you’re not aware of.
Are people using machine learning in this context to make predictive
models? And how much is that being used?
Machine learning is used not as a model in itself, because it doesn’t
include the mechanistic representation of disease transmission
necessarily, but it is used in context of forecasting and in training
models on data. And so we’ve seen various groups in the MIDAS network
that use machine learning to train models to forecast epidemics.
So use historical data to learn how the data relates to, for example,
how data on population or historical case counts relate to future case
counts. You can even tie that in with climate data or other types of
data, and then machine learning is used to find the patterns in these
data and then project forward what you can expect in the future, and so
in that context machine learning is used.
And of course in our case, in terms of making data accessible and
standardized, machine learning is also used in the data preparation part
where we want to extract data from literature and from tables and from
existing data sources and do that automatically. And so you can use
machine learning in that context as well.
So yeah, I think both machine learning and AI are increasingly used in
infectious disease modeling as well, especially since we’re starting to
deal with very large data sets and increasingly availability of
computational power. During the pandemic also there has been an enormous
amount of computational resources being made available by the tech
community, at no cost almost, to the infectious researchers, and as that
continues, more and more computation-intensive methods will be used to
represent the disease as well.
So you actually answered a couple of the questions that I had about how
much computer resources modeling needs and whether they can use large
data sets. But can MIDAS help with that when it comes to computer
resources and access to the big data sources?
You’re right that a lot of the models can use computational power. There
is kind of different modeling methods, as I described, and different
methods need more or less computational resources. So first, I would say
it’s not necessary to have access to large-scale computing to do
infectious disease modeling. Anybody can do infectious disease modeling
even with simple programs like R or MATLAB or other programs and run
small-scale models or models that make generalizations and then the
computational resources are less.
And we’ve seen excellent resources being published with very limited
computational resources. That being said, modeling detailed data, for
example, mobility patterns now at very high spatial resolution based on
cell phone data, tracking COVID-19 transmission like that, and also
implementing behavior so that people do different things based on their
circumstances, that is very computationally intensive. Imagine making a
model for the whole U.S. It’s 300 million people, and the model would
have to track every person and what they do. That’s an incredibly
computationally intensive system.
And so, yes, the MIDAS Coordination Center is set up to make
computational resources available to modelers. So if any student or
faculty is joining the MIDAS network for their modeling research, they
have access to computational resources provided by the coordination
center, and that is funded by NIGMS. And NIH has a program called
STRIDES that is providing large-scale computational computing from the
big tech companies available at a lower cost for research.
As a coordination center, we have set up all the contracting and the
infrastructure to make high-performance computing available to MIDAS
members so that they don’t have to make their own contracts or they
don’t have to necessarily even learn how to do it because there is staff
to train also in the use of high-performance computing. And so as the
field grows and the developments in computing increase, we make sure as
much as we can to also give that and make those resources accessible to
researchers and students in the network.
Cool, thanks. So specific interesting question, interesting to me. Do
you have any insight on how one reconciles data findings that are not
statistically significant but may be biologically significant, or
vice-versa, and how you sort of figure that out?
That’s an interesting question. As you’re doing statistics training, you
learn more and more about what the relevance is of statistical
significance and biological significance.
On one way of the spectrum, you can make anything statistically
significant if you grow your sample size because it’s a factor of
variability in the sample and the size of the sample. So you grow your
sample; you can at one point make that significant.
On the other hand, biological significance may be very small. So just
maybe for the other people as well, statistical significance is defined
by how consistently, I’ll say it a little bit in lay terms, how
consistent your results are and how varied the results are. If all the
results, all your measurements, are close together and they point to the
same outcome, that can be considered statistically significant. It’s
very unlikely that this result is a false negative, that this result is
actually not really a result, it’s a coincidence. That’s statistical
If you look at biological significance, it means is this relevant to
understand the biology? And so there are lots of instances where studies
are done, including in infectious diseases, where you see very
interesting results that could have major biological implications. So if
we know that the basic reproduction number for COVID-19 is two or three
or four, has massive implications for what we need to do in terms of
vaccination, what we need to do in other things. So that shift between R
naught from one point or two points further makes a very big difference
biologically. But we may find something else that’s statistically
significant but that has almost no biological relevance.
And so how to reconcile this is that when you build a computational
model and you’re using lots of factors together to build your model, so
you use the incubation period, and you use the basic reproduction
number, and you use the information about the vaccine, and you use the
information about other components of society to build your model, you
may have to use pieces of information that are not statistically
significant, that have not been found to be statistically significant in
an empirical test with lots of sample size.
But then what you can do is build your model to test not the
significance with a numeric approach, but you can take your model and
then see if the biological mechanism is actually working. And so we’ve
seen studies, for example, that say, OK, some of the literature shows
that having influenza can increase the chances of dying from pneumonia.
So influenza can worsen pneumonia. So there can be studies on that, but
you can also say, “I’m now going to run a clinical trial,” but you can
also say, “Listen, I’m going to build a model, which people have done,
and I’m going to represent influenza and pneumonia, and I’m going to
look at different mechanisms through which influenza can worsen
And then you look at all these different model results with your
different hypotheses represented, and you are comparing those results
with the data to see which one is the most likely to be the case. And so
you can use models in that sense to explore mechanisms for which you may
not have had statistical power and you may never get the statistical
power, but you can also look at the mechanism involved there. So the
different ways that models can include non-statistically significant or
significant and non-biologically relevant or biologically relevant,
combining these different components to see how it may work in society.
Thanks. I have a question here. How different are the models of
transmission of different viruses, say for example COVID-19 versus flu
versus measles or anything else you can think of?
They can be extremely similar in terms of the mathematical constructs.
It’s absolutely amazing, if you look at the early modeling books, for
example, those by Roy Anderson and Robert May (that was in the ’80s,
there was a book published), of the mathematical equations that govern
infectious disease transmission. If you look at the set of equations,
they can reproduce an epidemic pattern almost exactly. So these basic
equations govern a lot of the process.
So from disease to disease, the models don’t need to be very different,
but what is different is the parameter values that you select. So you
have to put in the incubation period, which may be very different from
one virus to another, and you have to put in the basic reproduction
number, and you have the transmission rate, and the contact patterns.
Some diseases transmit more in children. Others transmit more in older
people. So those virus-specific values have to be included, but the
model itself, the structure of the model, can be very, very similar
between different diseases.
In fact, the big model that got a lot of publicity from Imperial College
that was published a couple of weeks ago or months ago about COVID-19
used the same methodology that I showed here in the slides about the
influenza pandemic in 2006. And so the model was almost the same, but it
was adapted to represent a different virus.
Thanks. Back to data, because that seems to be underlying everything
we’re talking about today. So I wanted to talk a bit about the FAIR
principles, and you’re going to have to help me remember: Findable,
Accessible, Interoperable, and Reusable. I know you’ve been, over the
last few weeks, months, however much time we’ve been doing this, you’ve
been looking at many, many types and sets of data that people have been
putting out there. How difficult is it, or what do you need to do to try
and make them as FAIR as possible?
That’s a really excellent question.
As I mentioned in the slides, a lot of information is being produced by
researchers, by journalists, by volunteers, by anybody, to model and to
inform about COVID-19. But a very, very, very small proportion of that
follows the FAIR guiding principles. The FAIR guiding principles were
published in I think 2014 or early ’15, about five years ago, about how
we can make data ready for automatic interpretation. Some people, even
my colleague Barend Mons, who is part of the FAIR originators, said FAIR
stands for Fully AI Ready, FAIR. So there was another term for that.
So to use AI, you need data in a format that computers can understand.
And we are miles and miles behind in global health to make that happen.
So a lot of the work that we’re going to do now, and what we are doing
right now as a center and what needs to be done, is to create metadata.
So metadata is information about a data set in a structured form that a
computer can understand. And we use a standard schema for that, and we
use standard vocabularies for that, that computers have learned how to
understand. And so we annotate data sets with a very detailed set of
metadata, which takes an incredible amount of work to create, but once
that’s created, then everything will be accelerated because we can
search through that data, we can edit, we can reason over that date, we
can make computer algorithms, etc.
So that’s what’s required is use metadata, give data standard
identifiers, so that each data set has its own unique identifier and we
can find it through that identifier, and there is good metadata and
there is versioning, and the data are stored in a repository that can be
accessed at some point. Doesn’t need to be openly accessible, but it
does need to have some way to access it. And so as a center, we’re
looking at all the COVID-19 data coming out, which is a very large task,
and then we’re picking the high-priority data sets, moving that into the
FAIR process, and then getting, hopefully, an infrastructure out of this
pandemic that will serve effective modeling for many decades to come.
Great. Thinking about that data coming from all different sources, you
mentioned there’s researchers, but they’re volunteers. There’s
electronic health records. And if data don’t have that metadata that you
need at the source, is it possible to work with it to add some of it at
least? Or is it a lost cause?
There is definitely a very interesting balance there. At some point, and
we’re doing some separate work here with the Research Data Alliance to
see what is the minimum level of metadata required for a researcher to
decide, can I use this data or not? Whether it’s in a good format or not
in a good format.
But as a researcher, if I give anybody data, the person would look at
the data, ask me some questions, and then decide, I can use this or not.
And so if we see data sets from whatever source and there is not
metadata, what we do as a center is we contact the data creators, the
people who have created the data, and we ask them the questions, or we
look up information.
Can we find the information? Then we put it in the metadata schema. Now
if there is a lot of missing data in the metadata schema, a researcher
may decide that this is not enough for them to use, which would be a
pity. That is why we want creators to think about this in advance for
their data to be useful.
If there is enough metadata, we also represent that. There will be very
complete metadata, and then researchers can use it. So in the end, we
represent what we can. We have some internal standards of what we’re
willing to put out on our website and publish or not.
And then there are community standards that people can decide on their
own whether certain data are good enough for them or not. And a lot
depends on the questions that they’re trying to ask with their models.
Thank you. We have a couple more minutes, and I’m going to switch gears
completely and ask the question that came to me through the chat, and
that is: K through 12 teachers and students are probably very interested
in modeling now as well, since they see it in every newspaper. So how
can we, and MIDAS in particular, the MIDAS Coordinating Center in
particular, help with that?
Very interesting question. And we definitely want to benefit from that
increased interest in modeling and in quantitative sciences. So there
are, as I said before, lots of people coming into modeling need at least
some quantitative background and quantitative training. Even when they
are medical doctors like myself, you still need to train yourself in
So I think the best thing that at that level, at the early level, that
people can do is to improve as much as possible or to gear towards
quantitative and science, STEM training for their students. And there
are lots of programs already in place by NIH and others that are
improving and trying to increase the number of students going into STEM
training, and I think that’s something that can happen at that level.
We are also doing a lot of, for example, here at Pitt through the DBMI
program, there are summer programs for students from high schools and
middle schools even that are doing modeling research with modeling
groups at that level already during their summers to then go towards
undergrad and graduate school after that in modeling.
We also don’t want to pigeonhole students too early that may have very
broad careers, and at a certain age it’s very difficult to say what
their careers are going to look like, so having broad quantitative
training is always a good idea in that sense, and it leaves all the
doors open to students moving forward from there to start specializing
in certain parts of modeling work, and that would set them up to have
that opportunity. If they would choose not to go further into infectious
disease modeling but do something else, that quantitative training would
still benefit them many other ways. So I think at that level I would say
that would be the best way to go.
Twelve-year-old me would have loved that.
Thanks very much. I see our time is up. Thanks for a really interesting
discussion, Wilbert. And I remind everybody that this will be posted on
the NIGMS website, and check out the MIDAS Coordination Center website
Great. Thanks very much and bye everyone.
Thanks for having me. Bye.
Connect With Us: