In this episode of Expert Insights for the Research Training Community, Dr. Wilbert van Panhuis, director of the Coordination Center of the Models of Infectious Disease Agents Study, or MIDAS, explains how to become an infectious disease modeler and describes the types of questions that modelers study. Hear him talk about training opportunities provided by the MIDAS Network.
The original recording of this episode took place as a webinar on May 11, 2020, with NIGMS host Dr. Dorit Zuk. A Q&A session with webinar attendees followed Dr. Panhuis’ talk.
Recorded on May 11, 2020
Download Recording [MP3]
Welcome to Expert Insights for the Research Training Community—A podcast from the National Institute of General Medical Sciences. Adapted from our webinar series, this is where the biomedical research community can connect with fellow scientists to gain valuable insights.
Dr. Dorit Zuk:
Good afternoon, everyone. I am Dorit Zuk from NIGMS, and I hope everyone is staying well and safe. First, I’d like to extend our thanks to everyone who is doing all they can do to help with this pandemic, particularly all those working on the front lines, whether it be your local hospital, the research lab next door, or the grocery store down the street.
This is one in a webinar series for students, post-docs, and faculty. We’ve had a couple last week and more to come. Each hour-long webinar will include a 10- to 15-minute presentation by the speaker, followed by a moderated question-and-answer session.
It’s my pleasure today to introduce our speaker. Wilbert van Panhuis has an MD and a PhD. He is an assistant professor in the departments of epidemiology and biomedical informatics at the University of Pittsburgh. He works in the field of computational epidemiology and population health informatics. And among many, many other things he does, he is the PI of the coordinating center of the Models of Infectious Disease Agents Study, fondly known as MIDAS, which is a network of more than 300 scientists interested in infectious disease modeling.
Wilbert has been very busy coordinating research and data access related to the COVID-19 pandemic and hopefully he’ll tell us a little bit about that. He will also tell us about what it’s like to be an infectious disease modeler and how you can become one. So Wilbert, over to you.
Dr. Wilbert van Panhuis:
Thanks, Dorit. Thanks for inviting me to do the webinar, and I’m very happy to talk here about the work that I do as an infectious disease epidemiologist and the work that our network is doing. As mentioned by Dorit, I will talk about work in the context of the infectious disease modeling research and also the MIDAS network.
So a little bit of context first. As Dorit mentioned, I’m the PI of the MIDAS Coordination Center. The MIDAS Coordination Center is a collaboration between multiple institutions. It’s led by the University of Pittsburgh, but also includes the Fred Hutchinson in Seattle, University of Texas at Austin, University of Massachusetts at Amherst, and University of Florida. MIDAS stands for Models of Infectious Disease Agents Study, and it’s a global network of 400 scientists and students, and many others as well, for infectious disease modeling.
About 100 members of the network are modeling COVID-19 right now, and so this has been a very big focal point for our modelers. The website is mentioned there as well. Another project just in the context here at Pitt is the project Tycho, which is our longstanding data repository for infectious diseases that improves access to infectious disease data in an easy-to-use format and ties in very nicely the work that we’re doing on modeling, as well as on data science, and trying to combine the two, and I’ll talk a little bit about that.
What we’ll talk about today in the first 10 to 15 minutes: what are the scientific problems that are studied by infectious disease modelers? What are infectious disease modelers?
And also a little bit more about the career trajectories. What could you do and what have people done to become infectious disease modelers? And then more specifically, what are the training opportunities for infectious disease modeling in the MIDAS network? And a little bit, then, about what the MIDAS network is doing in terms of modeling research on COVID-19. And after that, we’ll go back to the questions.
So what is modeling? These are just some very basic concepts that do not just apply to infectious diseases but to many different modeling endeavors and research, but just to get us all on the same page, because different people can have different things in mind when we talk about modeling.
So here is a quote that I actually took from Wikipedia that was very pointed, actually was originally from John von Neumann, a famous mathematician, that a model is a mathematical construct which, with additional verbal interpretations, describes observed phenomena in the world.
And so if you think about a model, and here I’ve put some kind of additional information here, there are observations about the world. So here you see A’s and B’s and C’s, and this is input data, if you wish. And modelers make a model out of that data about the world, about a phenomenon. And so this becomes a mathematical or computational representation of a phenomenon.
And then what do you do with a model? Once you have a model representation of a real-world phenomenon, what do you do with this? Then you can use that to gain insight about biological mechanisms. How do things work? How does an infectious disease work? Or how does something else work? You can also use it to evaluate strategies to change something, the control of an epidemic, for example. Once you have a model of an epidemic, you can evaluate how you can control it. And then there is forecasting that you can also do with your model.
So these are the three application areas most relevant for infectious disease modeling, but the general concept applies. In terms of the difference between modeling research and other research like randomized clinical trials or laboratory experiments is that here we make a computational representation of something in the real world, and we use that representation to gain insights, evaluate control strategies, or do forecasting. So that’s the general framework in which we work.
What are then some of the problems studied by infectious disease modelers using this kind of methodology? First, there are a large variety of modeling methods used. So in the context of what this model is that we just talked about, there are many different types of models. So you’ve heard about statistical modeling, mathematical modeling, compartment models, agent-based models, network models, probabilistic models, decision-analytic models, and others.
So there are lots of different types of models. There is also a lot of overlapping terminology. When I say mathematical model here and a compartment model uses mathematical equations, so it could be considered a mathematical model. And we don’t need to get caught up in terminology, but just to say that there are a wide variety of models used for infectious disease modeling. The most common ones that you will see in the literature are mathematical compartment models, agent-based models, or network models.
What are the problems that are studied in this field? I think there are about three main categories that you could list. One is clarifying biological mechanisms of infectious diseases. These are more hypothesis-driven questions about how things work. So examples are, does infection with measles virus reduce the immunity against other infections? And we’ll get into that a little bit later. What is the incubation period for COVID-19? These are questions that have been addressed by models.
A second category is estimating the impact of interventions. So what if we want to do something against an infectious disease? How can we evaluate what the potential use would be in a computer model rather than before we do it in the population? So one question is how many deaths are prevented by influenza vaccination every year? Or how much testing is needed in order to stop the COVID-19 epidemic?
We all know testing is important and tracing is important and other methods. So how much would you need to do in order to contain the virus? That’s a question that a model can answer. And then in terms of forecasting, infectious disease forecasting is very similar and an emerging field very similar to weather forecasting. So the question there is very simple: How many cases of COVID-19 can we expect in the next two weeks? How many cases of influenza can we expect this year? These are clear forecasting questions.
So two examples of scientific problems here. One is clarifying biological mechanisms, is this study about long-term, measles-induced immunomodulation increases overall childhood infectious disease mortality.
So this study shows, using modeling methods, that here on the right-hand side, for example, you see the yearly prevalence of immunomodulation by measles, which is really related to the number of measles cases. And here on the Y-axis, the non-measles-related mortality. So this is how much the non-measles mortality is related to measles itself, and this study, using a variety of methods, shows very nicely that if you have measles you have more chance of dying from other diseases as well. And that’s a mechanism question that’s been answered here by the computational model. The other end is the impact of interventions.
So here I show a study that was already published in 2006 looking at strategies for mitigating an influenza pandemic. So here are, for example, on the top right, BC is border control, and RMR is reducing mobility. You can see if you do these things, if you do mobility and border control at certain lengths, you can delay the peak in the epidemic with 42 days, 56 days, etc. And you can also reduce, this is the total number of cases, you can reduce the total number of cases going on if you do these measures.
So these models can help to quantify exactly the gain that we could achieve by certain interventions for stopping a pandemic, and that’s helpful before you decide to implement them. So these are ways that can help policymakers with their decisions, and also experiment in a computer setting before it’s in the population.
I don’t show an example here of forecasting because it’s very straightforward, very similar to weather forecasting. So these are a little bit about what kind of methods and what kind of questions do modelers answer, and we can go more into that in the questions if you’re interested. How do you become an infectious disease modeler?
I think a lot of people recently, especially with the COVID-19 pandemic, are interested in infectious disease modeling. How can you become one? How can you make your career go this way?
So first I would say that infectious disease modeling is interdisciplinary. Even in our MIDAS network we have people from many different disciplines. There are people from quantitative sciences and biological sciences, and even more and more social sciences such as history or psychology or economics are becoming very relevant for infectious disease modeling because we want to know more about people’s behavior and how people will react to include that in the model. We want to know about economical consequences. We want to include that in the model, so we work with economists.
So I put this kind of on the side here because in current practice it’s not highly integrated yet, not as much as these two guys here. So the quantitative sciences are mathematics, computer science, computational biology, engineering. And then on the biological sciences, the biology, zoology, ecology, and medicine. And there may be other sciences as well here. And that’s how modeling kind of works.
So you can approach your modeling career from different ways. One is to say, I guess the most common route that I’ve seen at least is to do a major in a quantitative science, and then learn some biology for the disease that is interesting to you, and then use your quantitative background to model that disease. And many people have gone that route. On the other hand, you can also start with a biological science and then learn quantitative methods to represent computational models and become an infectious disease modeler. That is, I think, a bit of a less common method.
I guess it’s not easy for people that come from a biology background (or for myself from a medical background) to learn quantitative methods. On the other hand, it seems possible for people that have a strong quantitative background to learn some of the biology. But there are lots of different routes to get into this field, and it’s definitely not prescriptive. Many people find their own ways.
Here I show a couple of examples of members of the MIDAS network. Some of these are very prestigious modelers that have been very prominently in the news for COVID-19. You can also go to the link here. MIDAS network has people in it, and the people have all their links to their websites and their profiles, so you can see what people have done to see what routes they have taken.
Here is Betz Halloran. She’s part of our center as well. She did an undergrad in physics and philosophy of mathematics and then graduate training in medicine, Master of Public Health, and a doctorate in population sciences. And so she did have the quantitative undergrad and then moved into more biological sciences.
Neil Ferguson here that you may have heard of in Imperial College London is a pure physics undergrad and also did a doctorate in physics, and then during his post-doctorate work started to apply his skills to biology and to infectious disease questions.
Elaine Nsoesie here at Boston University did her undergrad in mathematics, again very quantitative, and then did her graduate work in statistics, genetics, bioinformatics, and computational biology and is now modeling many infectious diseases as well. So she has done that biology part later in her career.
And then here Micaela Martinez actually comes much more from a biology background. Throughout her whole career has worked on both mathematics and biology at the same time. And so there are many different ways to get into this field, and you can see from each member how they are moving their way, which is actually very, very interesting.
But the MIDAS network can help as well. You don’t have to figure out everything on your own. If people are interested in infectious disease modeling, including students, even undergrad or high school students that we’ve interacted with, can go to our network for training. MIDAS network aims to be a very student-friendly network. Students are considered full members of the network, and I’ll show later how you can become a MIDAS member if that’s of interest. We have students in our steering committee represented and students present their work during our conferences and events.
There are also a fair amount of training opportunities for students interested in learning about infectious disease modeling, often with the possibility for travel support to go to those training opportunities. When COVID-19 has reduced and we can go to physical, in-person events again, then there is often travel support either from the organizing institute or from our center.
Here you can see some examples of short courses in infectious disease modeling have become increasingly available. I just list a couple of examples here. There’s the annual summer institutes in Statistics and Modeling of Infectious Diseases, SISMID, very well known at the Fred Hutchinson. There is the clinic on Meaningful Modeling of Epidemiological Data, MMED, in South Africa. There’s also Mathematical Modeling for the Control of Infectious Diseases at Imperial College London, and the Introduction to Infectious Disease Modeling and Its Application at the London School in London as well. But there are many other ones, both in the U.S. and abroad.
Other opportunities are the MIDAS webinar series. We are organizing a webinar every last Friday of the month, and on the website you can find information about that. There are also increasingly online courses. For example, through Coursera. This is an example, Epidemics: The Dynamics of Infectious Diseases, a really high-quality course done by Penn State University. And also we have an annual conference to increase diversity: Mathematical Modeling and Public Health.
This is geared toward undergraduate students who’d like to learn more about modeling, and we organized that. And then we have various other workshops, ad hoc workshops organized by MIDAS and others, and we always ask for participation not only by faculty and staff but also by students.
So following the MIDAS network developments here will expose you to training events that are available through the network.
Then a little bit about COVID-19 of course. Our network is highly active on this, as is our Coordination Center. But like I said earlier, about 100, a quarter of the MIDAS network, is working on modeling research for COVID-19. Many members work with international, national, state, or local health agencies. As I said, models can help decision makers decide about what to do with the epidemic, what interventions to use, and there is a lot of interaction between modeling groups and health departments.
There are five modeling working groups between MIDAS and U.S. CDC. We have a mailing list. Anybody can subscribe to the COVID-19 mailing list to stay updated on COVID-19 modeling work by MIDAS. And also we, as a coordination center, do a lot of work to centrally disseminate data information and model results to MIDAS members and to the general public, and there has been a massive engagement, really, of the modeling community for COVID-19 research enabled by the MIDAS Coordination Center and by members of the MIDAS network.
A little bit of research here. You can see some examples. You’ve undoubtedly heard a lot about modeling of COVID-19 in the popular media. The New York Times has done a lot of work on showing research by MIDAS members and also others. Here shows work by Alessandro Vespignani on social network analysis and how the mobility patterns affect the spread of the coronavirus epidemic.
Here’s also very interesting work done by Nick Reich and his colleagues, shown on 538 website, where he displays six or so models that forecast where we’re going to go with the epidemic and show the actual data as well. And so week by week, you can see what the data do and what the models have forecasted, and you can see how good they were, and which model was the best model in forecasting COVID-19. Very data-driven comparative work to improve modeling and to also inform people about what the different models are saying in a way that is easy to understand.
Then here is a little work done by Micaela Martinez about the seasonality, why dozens of diseases wax and wane with the seasons and about COVID-19. So a lot of this modeling work is featured highly during this pandemic and also we’re doing our best to put high-quality information out there.
The role of the coordination center is in the background to help the modeling research. And so the big overarching data objectives you could call is to make data FAIR, following the FAIR guiding principles. So that means to make data Findable, Accessible, Interoperable, and Reusable. A lot of the information coming out during this outbreak is in a very ad hoc way.
You’ve seen there are numerous dashboards, numerous COVID-19 case trackers, numerous articles coming out, and we want to make these data well-annotated and easy to discover and use by researchers using data science principles. And so the MIDAS Coordination Center brings together the data science principles and practices, such as the FAIR guiding principles and concepts such as good metadata and good discoverability together with what comes out of the research domain and then help that to accelerate the modeling work and to provide high-quality modeling work.
So how to get involved. There are lots of ways to start to connect to the network through the MIDAS website. It gives you all the information about MIDAS people, MIDAS members, MIDAS projects, etc. You can request membership online, subscribe to the mailing list, etc. So that’s the first place to go. There is the webinar series as well every last Friday of the month. And so I would say go there if you’re interested to see the presentations by MIDAS researchers and also students for their work. There are emails that you can use and also follow us on social media. So there are lots of ways to get connected to research, as well as training, by the MIDAS network.
And then I’d like to acknowledge, as anything, this is not an individual project. There are lots of people involved behind the scenes. We have a large team of programmers, coordinators, curators, and students, all working to make this happen. And also, of course, I’d like to acknowledge funding by NIGMS and by NIH Big Data to Knowledge and the Research Data Alliance to help move forward this data and modeling agenda.
Thank you very much, and back to Dorit for any questions that people may have.
Thanks, Wilbert, for a really interesting talk. So many things to talk about, but let’s stick to the training for a minute. The first question I had for you is: You talked about how to become an ID modeler and also what kind of questions ID modelers study, but perhaps you can talk a little bit about what sectors you would find infectious disease modelers in beyond academia, that of course you work in.
Many people may think that infectious disease modeling is something that only happens in universities, but that’s really not necessarily the case. Lots of people come from a research background and they are doing research. But lots of people work in different settings. So modeling is not only done in universities for research questions. It’s done by government agencies that are working on infectious disease planning and infectious disease research as well.
Think about U.S. Centers for Disease Control, but also some state health departments, World Health Organization, many places use modeling already. You can also think about industry. Lots of students go to the pharmaceutical industry or insurance industry to model infectious diseases, and also to model the potential impact of vaccinations and profiles of products.
And then also insurance industries, of course, they need to know the risk of events, including infectious diseases, and they would also do a lot of modeling. So there is increasingly a use of modeling also in all the sectors that people go to.
Great, thanks. A question from the chat, you talked about how you become an infectious modeler or other type of modeler from a quantitative background or from a biological background, but what if you’ve got a social science background, which is slightly less integrated than those two seem to be? Do you think the fellowships, the courses you mentioned, or do you think further study is needed?
That’s a really interesting question. As I said, more and more people from social sciences are also interested in modeling. We’re actually at University of Pittsburgh working with a variety of departments where people want to use modeling in their own domain.
So within social sciences there is an increasing amount of modeling and modeling training that people can do. For infectious disease modeling specifically, I think people from social science backgrounds could as easily join and do the training programs, post-doctoral training, or even doctoral training, that is available with people in the MIDAS network or others.
Although I have to say that for most of the modeling work, you do need quantitative training and you do need quantitative background, so if you’re a social scientist background and you would like to do modeling and you don’t have the mathematics or the programming or computer science skills, then you could supplement your existing knowledge with coursework in some of those quantitative sciences and then go in doctoral work that is still infectious disease modeling or through some of the summer programs or short courses that are being offered.
I do have students, for example, that are coming in with a more non-quantitative background for a doctoral program, and what we do is they take courses and electives in quantitative sciences to help them gear their career towards infectious disease modeling.
Did you need to do that when you came with an MD to this field?
Yeah, I had to. I’m not sure that I did it, but maybe I should have. I came in with my MD, my medical training. Of course, during medicine you don’t learn computer programming or statistics all that much, so I did have to do a PhD in epidemiology and then from there on in my post-graduate work and also as junior faculty, I learned a lot on modeling by collaborating with modelers. And in the end actually, I even did a K01, a training grant with NIH, with the Big Data to Knowledge program, to cross-train, even as a junior faculty, from medical sciences to computational sciences.
I spent about five years before I did the MIDAS Coordination Center to learn about data science and informatics before I made this switch. So yes, it did take a fair amount of investment to get that kind of training.
So lots of ways to do it. On that note, and then I’m going to switch to something else, those courses you mentioned in the slides. Are more details available on the MIDAS website or where can one find information?
We have a training page on the MIDAS website, midasnetwork.us/training. People can go to the website if they are interested. What we’re doing right now is compiling information about all the short courses, modeling courses, seminars, webinars, etc., that are being done on infectious disease modeling.
Currently it’s a challenge because lots of people are doing webinars, and they pop up very rapidly, so it’s hard to keep track of what’s going on, but we’re doing our best to put as quickly as possible up on our website what events are taking place. And then as we move forward, we will post things like modeling textbooks, modeling short courses, curriculum, people in the MIDAS network that are doing training, and so our website is a good place to start your explorations.
I will say there is a public mailing list for people like me who are not modelers, but you get notices about the seminars, so that’s good.
If you want to just follow MIDAS and what goes on, then you can do the public mailing list with all the announcements that are for broad dissemination will go through that. So there are multiple ways you can join, or be a part, or be notified of what’s going on.
Switching gears a little bit, but not completely, we’ve seen lots and lots of reports about models in the media, and I wonder what you think about communicating complex mathematical models to non-scientists, whether it be just the public or any other stakeholder? And as a second part to that question, just to make it complicated, how do you think trainees should think about that?
There is a lot to unpack in that question. I would say there are a couple of aspects there. One is the communication part and then one is how is a trainee to do that. So the communication has changed, I would say, over the last…especially during this pandemic. So what’s happened in the past when people want to publish their work, as a scientist, you publish your paper in a peer reviewed journal and the system of peer review helps to review the scientific quality and rigor of the work, theoretically, and then when people publish papers it has been vetted by the community in some way.
Now what’s happened recently is scientists are putting their research out more and more early in the process because during the pandemic, of course, there is a need for early information to make your decisions. So for example, the first time we heard about the COVID-19 outbreak we wanted to know about the incubation period. How long does it take for this virus from infection to symptoms showing?
So if you see someone with symptoms, how long ago were they infected? That’s a really important piece of information. And so researchers start to study this question, putting information up as pre-prints. And pre-prints are servers where you can post your research and it’s tagged with you name, and you can show it’s your work, so nobody else can claim it was their work, but it has not been peer reviewed yet.
And so that’s a complicated process, because it does release early information, but it does not quality-check this information. So what we’ve seen is an incredible amount of almost 100 to 200 papers per day coming out on pre-prints, and so it’s a deluge of information, and now people have to sort out the good research from not-so-good research. And the complicated part is that the methods are very complicated.
So modeling methods are typically not easy to understand if you don’t have the training, and so it’s hard for journalists or government decision makers or even people in government to decide, “Oh wait, this model I can use or this model I should not use.” So as a scientific community we’re trying, and our coordination center is trying as well, to find ways to catalog good models, good modeling studies, and models that are not so good.
One nice example for that is this forecasting study that I showed where you can actually see eight different models and the data, and you can see every day what the model is forecasting, and you can see which one was the right one. And so there are ways to kind of start to get at that question of what is good-quality modeling and what is modeling that is too quick and too dirty work to be used for some important decisions.
As a community we need to make that distinction. So that was kind of the first part of the question. How can trainees work on this and learn this? I think there are a couple of things. One is to first learn how to do rigorous research and how to do good research. And so doing research and doing modeling training with good modeling groups will help with that, because through mentorship and through training, you learn how to not only make good models but also how to behave and how to be a diligent, accountable scientist moving forward.
And the other thing is to think about how you can use data for your models. There is a lot of modeling that can be done with a lot of assumptions and shortcuts, but you can also try to make the model based on actual data sources, which can sometimes be a lot of work to go after those data sources and standardize them and make them work in your model, but it can also be an additional validation of your model.
And then lastly, I would say the way the models are described in the papers also. You can decide at any point in your research whether you’re going to publish a model or whether you’re going to not yet publish the model. When is it good enough? And that’s some judgment that you can learn and that you can learn yourself but you can learn also by actively seeking out this kind of training through professional development opportunities, or through workshops on reproducibility of research or on ways that models should be described in literature before you can publish them.
So I think more and more of that there is part of our community that’s trying to move forward with the best practices and consensus about how to approach reproducibility and quality of modeling and communication of modeling to the public, but there is also more and more responsibility with the individual researchers to take their work seriously and their communication seriously. And they can seek out training, and they should seek out training to develop themselves in those areas.
Thanks. As someone who is not a modeler but who is interested in communication, one other thing that I would add and ask you is, how important is it to be really clear about the caveats of your work and what it can and cannot definitively say?
We learn when we do statistics or other types of research that you make your model and you do your research experiments, whether it’s a computational experiment or a real experiment, and then in your results description, your discussion, you describe the inference that you can make based on your experiment, and we have to be careful about making the correct inference.
So if I use data from a nursing home and I publish a COVID-19 model, then those results are about people in nursing homes. You cannot say that this is true for the general population. So those kinds of things have to be described carefully.
And I think looking carefully at how you describe your models and also, like you’re saying, put in what the model cannot do. So in your discussion section, you do delineate the limitations of your model and hopefully that will be picked up by the readers and the users.
One other interesting thing that we’re doing as a center in terms of the data science aspect is that we more and more want to not only have human-readable PDF narratives of model results, which are scientific papers, but we also want machine-interpretable information in a structured format that describes what the model did and what the model can be used for or not.
And so as we’re putting disclaimers and use limitations in the research papers, we can also put that in computational representations of the information so that later on we can have computer algorithms determine as well and learn what you can do with this model and what you cannot do with this model. But there is a lot of research that we can do in that kind of aspect, in science communication, especially now with the pre-prints and with the high amount of attention from the public, as well as from the media in infectious disease modeling.
And I do think that in that sense, the current time period will really help accelerate some of that work just because of the need of the pandemic.
Thank you. So switching gears just a little bit, and again complicated question, but in general, what types of data do you need to construct an accurate model for forecasting the course of an infectious disease outbreak or the results of an intervention strategy?
There again, there is a dichotomy between what data you use and what you want to do with the model. So a lot of modelers, I’ve been in many conversations with health departments where health departments would say or health agencies would say, “What can you model for us?”
And then modelers would say, “Well, what data do you have?”
And then they would say, “Well, that depends on what you can model.”
But then the modelers would say, “Well, the model depends on what data you have.”
So we have this chicken-and-egg thing going back and forth a lot. And so the typical answer is depending on the question that you want to answer with the model, you need certain data. And it’s typically a good idea to approach modeling just like any other research question in saying, “I have a specific biological phenomenon that I want to model,” whether it’s vaccination, influenza, or whether it’s mosquito control for malaria, or whether it’s business closures for COVID-19, these are all similar questions. Once you’ve defined the question.
So let’s take, for example, what is the effect of transport reductions and mobility reductions on COVID-19 transmission? If that’s the question, then you need information about COVID-19, about the virus, about the biological behavior of the virus, but you do need the mobility data and the transportation data to look at that effect. And you need the patterns of transport and mobility. So for that question that’s set in that framework, you get those data sets and then model that question. If it’s about vaccination, potentially, you would need information about immunity against COVID-19 and how well immunity works. And then you need to unveil that part of the biology of the virus to be able to do that model accurately.
So there are lots of different layers to what you…And also the level of detail. If you’re okay with doing a model for the whole of Pennsylvania or the whole of the U.S., your data can be extremely kind of coarse. But if you want to model local transmission, then you need to go local and then you need to go to county-level and ZIP code-level data.
So I think it’s like with any research question. We can pay attention to what the questions are that we want to model, what components we want to include, and that will then define what type of model that you want to make and what type of data you need for that. But I would say that, of course, because it’s COVID-19 or because it’s certain infectious diseases, you always need data about the virus or the bacteria, the pathogen, as well as on the host population, and then the information about whatever particular thing you want to study about that disease and that host.
And then the question is the quality of the data. You want to be sure that you understand, at least, the quality of the data. Just like some people don’t understand modeling methods very well and should be careful to interpret what’s being written if they don’t understand it, it’s also important for a modeler to understand where the data come from, so that you can assess carefully whether the data are good enough to use in your model, and that it doesn’t have certain biases or other problems with it that you’re not aware of.
Are people using machine learning in this context to make predictive models? And how much is that being used?
Machine learning is used not as a model in itself, because it doesn’t include the mechanistic representation of disease transmission necessarily, but it is used in context of forecasting and in training models on data. And so we’ve seen various groups in the MIDAS network that use machine learning to train models to forecast epidemics.
So use historical data to learn how the data relates to, for example, how data on population or historical case counts relate to future case counts. You can even tie that in with climate data or other types of data, and then machine learning is used to find the patterns in these data and then project forward what you can expect in the future, and so in that context machine learning is used.
And of course in our case, in terms of making data accessible and standardized, machine learning is also used in the data preparation part where we want to extract data from literature and from tables and from existing data sources and do that automatically. And so you can use machine learning in that context as well.
So yeah, I think both machine learning and AI are increasingly used in infectious disease modeling as well, especially since we’re starting to deal with very large data sets and increasingly availability of computational power. During the pandemic also there has been an enormous amount of computational resources being made available by the tech community, at no cost almost, to the infectious researchers, and as that continues, more and more computation-intensive methods will be used to represent the disease as well.
So you actually answered a couple of the questions that I had about how much computer resources modeling needs and whether they can use large data sets. But can MIDAS help with that when it comes to computer resources and access to the big data sources?
You’re right that a lot of the models can use computational power. There is kind of different modeling methods, as I described, and different methods need more or less computational resources. So first, I would say it’s not necessary to have access to large-scale computing to do infectious disease modeling. Anybody can do infectious disease modeling even with simple programs like R or MATLAB or other programs and run small-scale models or models that make generalizations and then the computational resources are less.
And we’ve seen excellent resources being published with very limited computational resources. That being said, modeling detailed data, for example, mobility patterns now at very high spatial resolution based on cell phone data, tracking COVID-19 transmission like that, and also implementing behavior so that people do different things based on their circumstances, that is very computationally intensive. Imagine making a model for the whole U.S. It’s 300 million people, and the model would have to track every person and what they do. That’s an incredibly computationally intensive system.
And so, yes, the MIDAS Coordination Center is set up to make computational resources available to modelers. So if any student or faculty is joining the MIDAS network for their modeling research, they have access to computational resources provided by the coordination center, and that is funded by NIGMS. And NIH has a program called STRIDES that is providing large-scale computational computing from the big tech companies available at a lower cost for research.
As a coordination center, we have set up all the contracting and the infrastructure to make high-performance computing available to MIDAS members so that they don’t have to make their own contracts or they don’t have to necessarily even learn how to do it because there is staff to train also in the use of high-performance computing. And so as the field grows and the developments in computing increase, we make sure as much as we can to also give that and make those resources accessible to researchers and students in the network.
Cool, thanks. So specific interesting question, interesting to me. Do you have any insight on how one reconciles data findings that are not statistically significant but may be biologically significant, or vice-versa, and how you sort of figure that out?
That’s an interesting question. As you’re doing statistics training, you learn more and more about what the relevance is of statistical significance and biological significance.
On one way of the spectrum, you can make anything statistically significant if you grow your sample size because it’s a factor of variability in the sample and the size of the sample. So you grow your sample; you can at one point make that significant.
On the other hand, biological significance may be very small. So just maybe for the other people as well, statistical significance is defined by how consistently, I’ll say it a little bit in lay terms, how consistent your results are and how varied the results are. If all the results, all your measurements, are close together and they point to the same outcome, that can be considered statistically significant. It’s very unlikely that this result is a false negative, that this result is actually not really a result, it’s a coincidence. That’s statistical significance.
If you look at biological significance, it means is this relevant to understand the biology? And so there are lots of instances where studies are done, including in infectious diseases, where you see very interesting results that could have major biological implications. So if we know that the basic reproduction number for COVID-19 is two or three or four, has massive implications for what we need to do in terms of vaccination, what we need to do in other things. So that shift between R naught from one point or two points further makes a very big difference biologically. But we may find something else that’s statistically significant but that has almost no biological relevance.
And so how to reconcile this is that when you build a computational model and you’re using lots of factors together to build your model, so you use the incubation period, and you use the basic reproduction number, and you use the information about the vaccine, and you use the information about other components of society to build your model, you may have to use pieces of information that are not statistically significant, that have not been found to be statistically significant in an empirical test with lots of sample size.
But then what you can do is build your model to test not the significance with a numeric approach, but you can take your model and then see if the biological mechanism is actually working. And so we’ve seen studies, for example, that say, OK, some of the literature shows that having influenza can increase the chances of dying from pneumonia. So influenza can worsen pneumonia. So there can be studies on that, but you can also say, “I’m now going to run a clinical trial,” but you can also say, “Listen, I’m going to build a model, which people have done, and I’m going to represent influenza and pneumonia, and I’m going to look at different mechanisms through which influenza can worsen pneumonia.”
And then you look at all these different model results with your different hypotheses represented, and you are comparing those results with the data to see which one is the most likely to be the case. And so you can use models in that sense to explore mechanisms for which you may not have had statistical power and you may never get the statistical power, but you can also look at the mechanism involved there. So the different ways that models can include non-statistically significant or significant and non-biologically relevant or biologically relevant, combining these different components to see how it may work in society.
Thanks. I have a question here. How different are the models of transmission of different viruses, say for example COVID-19 versus flu versus measles or anything else you can think of?
They can be extremely similar in terms of the mathematical constructs. It’s absolutely amazing, if you look at the early modeling books, for example, those by Roy Anderson and Robert May (that was in the ’80s, there was a book published), of the mathematical equations that govern infectious disease transmission. If you look at the set of equations, they can reproduce an epidemic pattern almost exactly. So these basic equations govern a lot of the process.
So from disease to disease, the models don’t need to be very different, but what is different is the parameter values that you select. So you have to put in the incubation period, which may be very different from one virus to another, and you have to put in the basic reproduction number, and you have the transmission rate, and the contact patterns. Some diseases transmit more in children. Others transmit more in older people. So those virus-specific values have to be included, but the model itself, the structure of the model, can be very, very similar between different diseases.
In fact, the big model that got a lot of publicity from Imperial College that was published a couple of weeks ago or months ago about COVID-19 used the same methodology that I showed here in the slides about the influenza pandemic in 2006. And so the model was almost the same, but it was adapted to represent a different virus.
Thanks. Back to data, because that seems to be underlying everything we’re talking about today. So I wanted to talk a bit about the FAIR principles, and you’re going to have to help me remember: Findable, Accessible, Interoperable, and Reusable. I know you’ve been, over the last few weeks, months, however much time we’ve been doing this, you’ve been looking at many, many types and sets of data that people have been putting out there. How difficult is it, or what do you need to do to try and make them as FAIR as possible?
That’s a really excellent question.
As I mentioned in the slides, a lot of information is being produced by researchers, by journalists, by volunteers, by anybody, to model and to inform about COVID-19. But a very, very, very small proportion of that follows the FAIR guiding principles. The FAIR guiding principles were published in I think 2014 or early ’15, about five years ago, about how we can make data ready for automatic interpretation. Some people, even my colleague Barend Mons, who is part of the FAIR originators, said FAIR stands for Fully AI Ready, FAIR. So there was another term for that.
So to use AI, you need data in a format that computers can understand. And we are miles and miles behind in global health to make that happen. So a lot of the work that we’re going to do now, and what we are doing right now as a center and what needs to be done, is to create metadata. So metadata is information about a data set in a structured form that a computer can understand. And we use a standard schema for that, and we use standard vocabularies for that, that computers have learned how to understand. And so we annotate data sets with a very detailed set of metadata, which takes an incredible amount of work to create, but once that’s created, then everything will be accelerated because we can search through that data, we can edit, we can reason over that date, we can make computer algorithms, etc.
So that’s what’s required is use metadata, give data standard identifiers, so that each data set has its own unique identifier and we can find it through that identifier, and there is good metadata and there is versioning, and the data are stored in a repository that can be accessed at some point. Doesn’t need to be openly accessible, but it does need to have some way to access it. And so as a center, we’re looking at all the COVID-19 data coming out, which is a very large task, and then we’re picking the high-priority data sets, moving that into the FAIR process, and then getting, hopefully, an infrastructure out of this pandemic that will serve effective modeling for many decades to come.
Great. Thinking about that data coming from all different sources, you mentioned there’s researchers, but they’re volunteers. There’s electronic health records. And if data don’t have that metadata that you need at the source, is it possible to work with it to add some of it at least? Or is it a lost cause?
There is definitely a very interesting balance there. At some point, and we’re doing some separate work here with the Research Data Alliance to see what is the minimum level of metadata required for a researcher to decide, can I use this data or not? Whether it’s in a good format or not in a good format.
But as a researcher, if I give anybody data, the person would look at the data, ask me some questions, and then decide, I can use this or not. And so if we see data sets from whatever source and there is not metadata, what we do as a center is we contact the data creators, the people who have created the data, and we ask them the questions, or we look up information.
Can we find the information? Then we put it in the metadata schema. Now if there is a lot of missing data in the metadata schema, a researcher may decide that this is not enough for them to use, which would be a pity. That is why we want creators to think about this in advance for their data to be useful.
If there is enough metadata, we also represent that. There will be very complete metadata, and then researchers can use it. So in the end, we represent what we can. We have some internal standards of what we’re willing to put out on our website and publish or not.
And then there are community standards that people can decide on their own whether certain data are good enough for them or not. And a lot depends on the questions that they’re trying to ask with their models.
Thank you. We have a couple more minutes, and I’m going to switch gears completely and ask the question that came to me through the chat, and that is: K through 12 teachers and students are probably very interested in modeling now as well, since they see it in every newspaper. So how can we, and MIDAS in particular, the MIDAS Coordinating Center in particular, help with that?
Very interesting question. And we definitely want to benefit from that increased interest in modeling and in quantitative sciences. So there are, as I said before, lots of people coming into modeling need at least some quantitative background and quantitative training. Even when they are medical doctors like myself, you still need to train yourself in quantitative sciences.
So I think the best thing that at that level, at the early level, that people can do is to improve as much as possible or to gear towards quantitative and science, STEM training for their students. And there are lots of programs already in place by NIH and others that are improving and trying to increase the number of students going into STEM training, and I think that’s something that can happen at that level.
We are also doing a lot of, for example, here at Pitt through the DBMI program, there are summer programs for students from high schools and middle schools even that are doing modeling research with modeling groups at that level already during their summers to then go towards undergrad and graduate school after that in modeling.
We also don’t want to pigeonhole students too early that may have very broad careers, and at a certain age it’s very difficult to say what their careers are going to look like, so having broad quantitative training is always a good idea in that sense, and it leaves all the doors open to students moving forward from there to start specializing in certain parts of modeling work, and that would set them up to have that opportunity. If they would choose not to go further into infectious disease modeling but do something else, that quantitative training would still benefit them many other ways. So I think at that level I would say that would be the best way to go.
Twelve-year-old me would have loved that.
Thanks very much. I see our time is up. Thanks for a really interesting discussion, Wilbert. And I remind everybody that this will be posted on the NIGMS website, and check out the MIDAS Coordination Center website at
Great. Thanks very much and bye everyone.
Thanks for having me. Bye.
This page last updated on
8/9/2021 11:39 AM
Connect With Us: