In this episode of
Expert Insights for the Research Training Community, Dr. Susan
Gregurick, computational biologist and director of the Office of Data
Science Strategy, discusses the development of computers, the internet,
networking, and analysis platforms, weaving in her personal journey. She
then describes the role of data and computation sciences in combating
The original recording of this episode took place as a webinar on May 13,
2020, with NIGMS host Dr. Ming Lei. A Q&A session with webinar
attendees followed Dr. Gregurick’s talk.
Recorded on May 13, 2020
Download Recording [MP3]
Welcome to Expert Insights for the Research Training Community—A podcast
from the National Institute of General Medical Sciences. Adapted from
our webinar series, this is where the biomedical research community can
connect with fellow scientists to gain valuable insights.
Dr. Ming Lei:
Good afternoon. And to those of you at the West Coast, Alaska, or
Hawaii, good morning.
My name is Ming Lei. I am the division director for Research Capacity
Building at NIGMS, and I am your host today. It’s a pleasure to welcome
you to the fifth webinar of the NIGMS training webinar series.
This is a difficult time. The pandemic has disrupted everybody’s life to
a certain extent. NIGMS created this webinar series to help keep our
training community together with useful and interesting talks and
conversations, and I hope you are all enjoying them.
A few reminders before we start the presentation. All webinars in this
series are recorded and some of them already have been and all of them
will be posted on the NIGMS website, so you can view them, actually all
of them, at any time, and I would encourage you to ask your friends to
view them when they have time. Secondly, there will be a Q&A session
after the presentation.
And our speaker today is Dr. Susan Gregurick. Because she will share her
scientific journey with you, so I am not going to introduce her except
by sharing with you that as the NIH associate director for data science,
she is the leader at the center of all NIH data science activities.
So with that, Susan, take it away.
Dr. Susan Gregurick:
Thank you so much, Ming. And to all my friends at NIGMS, it’s a pleasure
to be here today.
I’ve been so excited and looking forward to this particular journey and
discussion with you for a week. It’s not often that I get to tell young
people about my own personal journey, and I hope that you see yourself
in a little bit of me and what I’ve done. I’m going to tell you about
what I’ve done in the computational sciences, which is my true love, and
how this has helped shape biomedicine and my own personal professional
So let me begin by telling you the beginning.
So I want to give you some historical perspective in the development of
computers, computer science, the internet—I actually watched the
internet’s birth more or less—networking, and analysis in my own
personal journey and how these have changed my professional life and
have helped me make my career choices.
And then I want to finish on something that’s relevant to every single
one of us around the world: How have we applied computing, internet,
technologies, analytics to address COVID-19? And we’re just at the start
of COVID-19, so we have a long ways to go.
So we’re going to have to go way back in the way-back machine to the
1980s. So the top song of 1982 is “Physical” by Olivia Newton-John,
which you may or may not have ever heard, but I’m sure you have seen the
movie E.T. It was the top-grossing movie in 1982. And I’m living
somewhere in a town in Michigan, and I’m a dancer. I actually take
ballet as well as highland dancing. I am a total goof-off. I am probably
in and out of school more than most. I’m a DJ at our local high school
My name is Susie at that time. I’m the homecoming queen. And I’m a total
closet geek. Nobody at my high school knows that I’m an avid reader of
science fiction. I’m reading Scientific American, which was about at my
level in high school. I’m taking classes at the local community
college—mostly in genetics and chemistry. I’m really fascinated with
science, but that was my secret life.
And here’s what computing looked like to me in 1980. So popular then was
the Commodore 64. I came from a community and a town where computers
were not very common, so even my high school did not have any computers,
but the local community college did. This is a typical computer science
room that I never got to visit when I was in high school, but I’m pretty
familiar with these.
And if you have never seen these before, these are punch cards. And so
when you write a computer program in 1980s and before, you have to
translate them into the punch card system, and then you feed those punch
cards into a machine that’s not really quite visible in my picture.
And the worry of every computer scientist at that time was that you
dropped those punch cards Because they’re a program and they’re in
order, and if you drop those punch cards, you will spend a significant
amount of time and worry trying to get them back in order. Just imagine
trying to debug your program using a punch card system. It was so hard
to do, so much work.
And when I was in high school, this was one of the computational biology
highlights of 1982. By the way, 1982 is the year that I graduated from
high school. This is the story of the protein dynamics of a small little
tiny protein called BPTI, bovine pancreatic trypsin inhibitor. It’s
approximately 60 amino acids long, and you can see its ribbon structure
on the screen.
Wilfred van Gunsteren and Martin Karplus did molecular dynamic
trajectories of this little tiny protein for 25 picoseconds in vacuum,
and then they put it in a spherical shell of 2,647 non-polar waters, and
then they fixed it in a crystal structure and they tried to understand
the dynamics of movement of this protein in these three scenarios, and
that particular paper and that particular simulation was a tour de force
of computational biology the year I graduated from high school.
And I was totally amazed that we could actually do calculations of
protein dynamics in these three different scenarios—in vacuum, in
non-polar solvent, and in crystal image. So moving a little bit forward
in the later 1980s, the top song is “Walk Like an Egyptian.” That was
the song when I was in college as an undergraduate. The top-grossing
movie of 1987 is Three Men and a Baby, actually a movie I never saw—it’s
not quite my interest—and I am at the University of Michigan and I’m an
undergraduate. And I graduate in the year 1987.
I’m a chemistry major and a math major. It’s not uncommon probably for
most of you to have dual majors. I am a research assistant. I am a
research assistant in mathematics. I’m a research assistant in geology.
And I’m also a research assistant in the medical school, where I am
developing hepatic imaging agents through synthetic organic chemistry,
and that is not my strength.
I do not do any more synthetic organic chemistry after undergraduate,
but at that time I thought that would be an interesting type of research
to explore. I’m also spending lots and lots of time looking for errors
in my code.
I want to just make one point to you as many of you are undergraduates.
One of the most valuable experiences that you can gain as an
undergraduate is to work in a lab. To work in a lab, to work with
graduate students, to work with other graduate students, to work with
postdoctoral researchers and your mentor, your PI mentor, will allow you
to see what research is really like. You know, it’s hard.
Sometimes you spend a lot of time working on a project and it doesn’t go
anywhere. There’s a lot of false starts. This is one of the most
valuable real-life learning experiences that you can have, and I
encourage everybody to take at least one semester and do research in a
laboratory. And, obviously, I am no longer a closet geek; I am an actual
geek at the University of Michigan.
I am known mostly in the chemistry and math department, but I do have a
lot of work that I do in coding as well. And what do computers look like
for me when I’m an undergraduate? This is one of the computers that I
worked on. It’s not my actual computer because I didn’t take that with
me. This is an IBM PC/2, and you can see that you can actually play
chess on this computer.
This is The Ohio State University, a big competitor to Michigan, by the
way. This is the supercomputing center in 1987. They are a powerhouse of
supercomputing. They are not the only ones, but I knew them well. And
this is the birth of programming languages. While you’ve probably heard
of Fortran, that was my primary language when I was coding in the late
’80s. C++, certainly, but PERL and these more interpretive and dynamic
languages really start developing in the late ’80s. What’s the
computational highlight from the year I graduated from college, which
was 1987? It is another computer simulation.
This is the diffusion of a substrate in an active site in an enzyme. And
this particular system is superoxide dismutase. And what I wanted to
show you is that unlike the last simulation, which is the dynamics in a
trajectory sort of way, these are more stochastic Brownian dynamic
simulations, and what was really super cool about Kim Sharp and Barry
Honig and Robert Fine’s work is that they actually put the charges in
the active site of the enzyme into the calculation.
And having the ability to have molecules have a charge gives you an
electrostatic [unintelligible] for what’s really happening in that
active site. And to me this was just a super cool simulation. I love the
work of Barry Honig. I’ve followed him for years, and I have watched the
field of electrostatic calculations go from point charges to probability
charges to all sorts of really innovative work, and so I just wanted to
share with you that one particular highlight.
Moving to a new decade—1990s. The top song in 1995 is “Gangsta’s
Paradise” by Coolio, featuring L.V., and the top movie, which I did see,
is Batman Forever. All those Batman movies are so great. And I am at the
University of Maryland. And, obviously, I have never left this area. I
am still living in Silver Spring today.
I am defending my PhD thesis in 1995. Just a side note. I took two years
off between my undergraduate and my graduate studies, and I worked at
the Naval Research Laboratory, where I was involved in the physical
characterization of organic molecules used for blood surrogates. And it
was a really wonderful experience because I got to see what it was like
to work in a very large team at the Naval Research Laboratory, and I got
to become much more proficient at NMR spectroscopy and IR spectroscopy
and Raman spectroscopy, and I so loved Raman and IR spectroscopy that
you’ll see it popping up in my future.
You see this character here on the giant steps. That’s my PhD thesis
advisor. That’s Millard Alexander. He’s still at the University of
Maryland. I think he might be emeritus at this point. But what did we
So I studied flux in reactive systems—systems like boron hydride—and I
studied what happens in those systems when the potential energy that
describes different excitation states cross and how do you actually
calculate curve crossing or reactions? That’s really the story of flux.
I developed a new genetic algorithm, which is a pretty cool algorithm,
for optimization of structures that have multiple potential energy
surfaces, PSEs, and obviously I’m not in computational biology. I am a
serious homebrewer, and I got married to my colleague in physics.
And this is a later picture, but that is myself and my husband, Nicholas
Phillips. When I was a graduate student, I wanted to change careers. I
wanted to think differently about computation and what we can do with
our careers, so I changed from physics to computational biology.
Here’s what computers looked like in the 1990s. This is actually a
computer that I did most of my PhD work on. It’s an Apple Macintosh. I
was so lucky to watch the birth of Mozilla, Netscape, and a little
blurry for you is the HTML language that most of you probably know how
to program in and you’re very, very efficient in. But when I was in grad
school this was completely new—and so was this.
At one point, a list every day came out, a new website, and there was a
list of the top websites that had come out that day. And the first
webcam, that’s the coffee pot at Cambridge, where I actually visited and
did some work as a grad student. There it is.
You could see the level of the coffee pot at any particular time and you
would know when you could go down and get some new coffee. And here I’m
going to play for you in the way-back machine the sound that I will
never forget [dial-up handshake]. There it is. And that horrible sound
goes on and on. That is how we had to connect to the internet. That is
my dial-up modem.
So I had to sit at home and timeshare the one computer in our grad
school house, dial up to the internet, and do our work. And most of us
actually played games, and we had to have a lot of time in order to do
our work and play games. So you guys have such a wonderful
experience—always connected, always on—but for us, that was the sound
that we heard hour after hour throughout the night.
Here’s something that was super exciting when I was early in my graduate
school days. This is BLAST—Basic Local Alignment Search Tool—developed
by a number of colleagues, including David Lipman. David Lipman is still
at NCBI and NLM here at NIH.
This was a new approach to rapidly do sequence comparison of different
sequences by doing a basic alignment. And you would get a score, and
that would tell you, for example, where the gaps were, where the
insertions were. This particular algorithm has revolutionized the way we
do comparative genomics, and now you can do slide BLAST and multi-PI
BLAST, and there’s just so much work that’s happened. But yet I bet most
of you have used BLAST or one of its child prodigies in your own
research, and it was just remarkable.
And this is really one of the reasons that I got inspired to think about
bioinformatics and data science, because I started to realize when I was
in physics that the world of data and the world of biology and the world
of computing were the next big thing, and I think that you might agree
that that’s actually true.
In the years since 1995, I have traveled to Israel for a postdoctoral
fellowship in computational biology. I was a professor of computational
biology at a university, University of Maryland Baltimore County, for a
number of years, and one of the projects that I worked on was this super
large protein complex called GroEL-GroES, that is a protein chaperone
complex, and it’s huge. It’s 14 subunits, but you can’t see it all. It’s
all together as a big complex. Each subunit is 58 kilodaltons. I
couldn’t even load that complex into memory in my computer when I was
working. I had to do very large parallel processing on supercomputers to
just do the calculations for how the GroEL-GroES chaperone complex and
the proteins that are inside that are in blue actually work.
I switched. I became a program director for the Department of Energy,
and I focused fully and totally on data—data platforms, data
computing—for energy and the environment, very particularly on
bioenergy, translating poplar and other types of soft woody plants into
bioenergy complexes. I decided to make that career change because I
wanted to have a bigger impact for a larger amount of science, and I
truly, truly am dedicated to data science.
I was a division director at NIGMS and I worked for Dr. Jon Lorsch, and
I was the director of Biophysics, Biomedical Technology, and
Computational Biosciences, and I really wanted to think about how we can
change the landscape for technology, incorporating much more new and
innovative technology as well as new ideas for team science.
And now I am the associate director for data science, where I am working
across NIH and across the community to make data, data resources,
findable, accessible, interoperable, and reusable. And I also am the
mother of two fantastic young adults, Andrew Phillips, who is a junior
in college studying, of all things, organic chemistry, and my daughter,
Abigail Phillips, who is finishing high school and hoping someday to
have a career in dance.
And I still brew beer. Almost every month I have another five-gallon
carboy of beer brewing.
And here we are today.
You have data at your fingertips, and you have wonderful platforms to
access and use that data. You’re always connected and you’re always on,
and that’s a wonderful thing. And maybe it’s a curse too, but it’s so
nice to never have to listen to that dial-up sound. You have
supercomputers like we’ve never seen before that can really address
problems of great complexity.
The problem I showed you GroEL-GroES chaperone complex could easily be
handled today without any special workarounds with massive parallel
computing. You have R and it’s Shiny, and you write in codelets that you
can match right onto the bare metal with Kubernetes.
And you can package up your code into dockers and containers and move it
around to different cloud resources. And you’re working on a community.
You have GitHub. You share your software.
This is just an example of Jupyter, but there’s such a great
software-sharing community that’s available to you. So how can we use
all these tools that we have at our hands today to address a pandemic
that’s significant? How can we partner with industry for workflows and
tools and analysis? And how can we provide you the resources so that you
can get your work done?
I want to just give you three or four use cases of what we’re doing
right now at NIH that you can use to study COVID-19. And this is an
amazing story of two intramural researchers—one of them from NIDDK and
the other from NCI—so NCI is National Cancer Institute, and NIDDK is
National Institute of Diabetes and Digestive and Kidney Diseases.
And they, in three weeks, collected specimens from pathology, created
the digital images of those specimens, de-identified them, partnered
with a company called HALO, and put those whole-slide images up for you
to use for reference so that you can study and understand COVID-19.
Right now we have much more than eight reference cases because our two
intramural researchers are getting more and more samples every day from
hospitals from different countries, so I think we’re up to 19 reference
cases, but there’s more coming in every day.
And we’re going to integrate this particular resource into a much larger
resource in the near future, but just right now you can go and do some
limited artificial intelligence algorithm development on these
resources. And we’re partnering with the gaming and the video company to
create processing workflows for CT images. CT has been one of the types
of images that you can use to detect COVID-19 in patients, and so we’re
developing those workflows by using and leveraging gaming computers.
This is a very nice artificial intelligence classification.
And we are providing high-performance computing resources to the federal
government, to industry, to academic leaders around the world so that
you can use resources from the national labs, resources from IBM, from
Google, from AWS. Over 4 million CPU cores are available. The consortium
is taking applications every day. So if you have an idea that you think
would benefit from high-performance computing, this consortium is there.
The resources are free for you.
We’ve come a long way since those days of punch cards and 25 picosecond
dynamic simulations of tiny, tiny proteins, and I’m just wondering where
you, our new and brightest generation of scientists, will take us in the
And with that, I would love to hear from you your questions, your
comments, and your thoughts. And I’m going to turn it back over to Ming.
Thank you. Thank you so much, Susan. I will say with computers, beer,
and the lovely family as a very exciting life. So as I mentioned
earlier, we are going to have questions. I will ask the first one on
behalf of our audience. So for a biology major interested in a research
career, what would be the key computational and data science training or
skills that the student should pick up while he/she is in school?
That is a great question, Ming. I would say that there are a few common
ways in which biology is coming to look at data and look at studies that
you can start to take classes in now, and that would include getting
familiar with the programming language R, because quite a few software
tools are written for and in R. But if that’s a bit of a barrier, there
are also tools such as Galaxy, which are a little bit more plug and
play, and so using the tools available in Galaxy or Jupyter, you can
have a lot of different types of computational software like Glass and
So getting familiar with those platforms and learning to use those tools
and understand what the results mean for your research would be a great
step forward. And Coursera is offering many different types of
computational classes available for students.
And I think NIH has offerings to make Coursera computational data
science classes freely available for NIH students, so we would be more
than happy to point you to those resources.
Great, great. There is actually a question from one of the students.
Where would I go to apply for access to computer resources?
From the HPC Consortium. There is a website, and the application is
processed through NSF, through a program called XSEDE. NSF will route
your application to the consortium, and the proposal is very
lightweight. It’s only, I believe, three pages, so you can certainly
easily apply for those resources, and then they will match the resource
needs to the application that you put in, so you can have access to many
different types of resources.
Related to that, NIH has training opportunities and resources available
as well, right?
Absolutely. There are a number of different training opportunities that
I did prepare as an extra slide, including our SRA metadata cloud,
BigQuery, and NIAID bioinformatics training resources. All of the
resources that I told you about today can be found on our website,
including the high-performance computing application.
And then there’s a number of training opportunities that we will be
having available, including if you really want to do computing on bare
metal, there’s a Kubernetes engine two-day course coming up later this
month. There are a number of other opportunities in the works that could
be either working with Google, GCP, or AWS. Some new opportunities for
machine learning as well as data engineering later in July.
Great. Another question is more about your own scientific journey. How
did you decide to change your field, and how did you update yourself
with the new field?
That is a great question. And it’s sort of a funny story. I was studying
physics, mostly in surface and gas-based physics. And the funding was
starting to change when I was a grad student from that physics/Silicon
Valley type of funding much more into bioinformatics, and my PhD thesis
advisor said, “There are a few opportunities in your life when you can
do a career change, and from graduate school to postdoc is one of them.
If you want to make a change to computational biology,” because he saw
all the articles I was reading, “now is the time you need to do it.”
So I wrote to a number of people to get specifically training from
people who were prior physicists who had moved to computational biology,
and that is how I chose my postdoc was by working with somebody who had
also been a physicist so that we would have some common language. It was
a hard change.
I had taken very, very little biology classes when I was an undergrad,
and obviously no biology classes when I was a grad student, so I had a
huge lift to retrain myself. I was lucky that my postdoctoral advisor
was very patient with me as I did have to take additional training and
coursework in biology in particular.
And I will be the first to admit that I do not have the strength and
background that many of my colleagues at NIGMS have in biology, and I
often have to look to them for understanding about the meaning of the
systems that I’m trying to study in much more complex detail than I
Biology is so complicated, but it’s also so fascinating.
Great. This follow-up question is from a different angle. Do you have
advice for postdocs not classically trained in data computational
science wanting to transition into the field?
Yes, absolutely. I would take the similar thought that working with an
advisor or doing a one-year sabbatical as a junior assistant professor
with a colleague who has that training in wet lab experimentation but
has also made a transition to computation will help you a lot.
So you might need to take an additional year of postdoc or sabbatical to
train in computational sciences but working hands-on in the lab with
other people in the computational field will give you a lot of insight.
I also took apart a lot of code to learn how it worked, and that is a
good way to learn how something works is to take it apart and then try
to learn how to put it back together again.
OK, this is a closely related one. What are the computational
bioinformatics opportunities as a prospective postdoc at NIH?
There are a number of computational fellowships that one can apply for.
There’s also a lot of funding for new investigators in computational
data science, and you happen to be looking at the institute that has,
I’d say, the largest amount of computational and data science funding
opportunities, NIGMS, and so working with them to get funding in one of
their programs is absolutely a wonderful opportunity.
Another one related to this, what level of math and statistics would you
need to be able to take advantage of the bioinformatic tools you
I would say that having a good basic understanding of mathematics and
statistics will always help you. In fact, when I was looking at majors
when I was in college, I was thinking of double majoring in computer
science and one of my colleagues told me that it’s much better to major
in math because math is the foundation of most computer science. And
that’s true, I see that now.
So having a strong mathematics background can never do you wrong. But if
there’s a little barrier, then having a good foundation for statistics
will definitely be a very important tool to have in your toolbox.
Another one, what programming language will be suitable to understand
I have so many favorites, but probably they’re a little old and outdated
now, and what I see is that people find R and R Shiny to be very useful,
and many of our professional PIs are writing their programs in R. So if
I had to pick one, it would be R, but if you ask me what my favorite
programming language is, it’s actually PERL. I loved PERL so much. I did
not like Java very much and I certainly didn’t like many of the threaded
languages, but I just absolutely loved PERL, but I don’t think that’s
very useful. I think R is probably going to be your best bet.
Great. What would be your advice with gaining computational skills you
want to incorporate into research rather than enter the field as a
whole? What would be the best way for an undergrad to approach a
So you want to approach a mentor and gain experience? I’m trying to
understand how to parse that.
The first part is are there ways to gain computational skills you want
to incorporate into research but not really want to become a
card-carrying data scientist.
I would say learning some of the more popular software tools, like
BLAST, for example, is a great tool. Just learning how to use it and
what those results mean for your own research would probably do you very
well. So you would never have to write any or much code at all using
existing software, but it will really help you if you sort of know the
basics and know the results and know the foundation of some of those
more popular tools.
OK, here is a specific one, which tools would you recommend for cryo-EM
image processing to determine protein structures?
I am not an expert in that, but I think there are some tools, something
like Cryolan is one tool that I’ve heard, and I believe that’s been
forwarded to the cloud and actually I believe that NVIDIA worked on that
as well for cryo-EM. There are probably other more popular and better
tools; that’s just one that I know about because of the partnership with
OK, there is also a pretty specific question that is does NIH have
open-resource for services such as sequencing samples from patients?
Absolutely. And this may or may not be available to the open community,
but our institute NHGRI, our National Human Genome Research Institute,
does do sequencing on patients, particularly also right now for COVID-19
as well as we have a national lab in Frederick, Maryland—Frederick
National Lab—which is doing sequencing on COVID-19 patients, as well as
developing serology testing and analyzing that data. I see that
CryoSPARC is another popular cryo-EM data processing tool.
Thank you so much. That must be Mary Ann Wu who has mentioned that. So
thank you very much. CryoSPARC is coming up as another popular tool.
So let’s go for another question. Given that sectors such as banking,
insurance, often offer much higher salaries to a student with that kind
of computational data science training, what would you tell those
students so that they would consider biomedical research as a rewarding
and viable career choice?
That’s a great question because it’s always on my mind as well. I would
say that being an investigator and a researcher in computational biology
and studying and understanding biology is rewarding for a number of
The flexibility that you have in your career and your career choices and
the types of work that you do, those are up to you. You make the
decisions and you are the captain of your ship, and you make the
contributions to science, unlike in the private industry where the
captain of the ship is the CEO and the board of directors, and they make
a lot of the decisions and you are implementing.
Here, when you are a researcher in an academic setting, you are the one
who is discovering and pushing the field forward. And if that passion
for understanding, addressing questions, using your skills in computer
science or in the wet lab drives you, you will stay up day and night to
do it. You will find that the passion you have for research will not be
quenched by any lack of money that you may not have by not having moved
What are some of the big issues you are working on as the NIH associate
director for data science?
Right now the biggest issue we’re working on is with respect to
COVID-19, and that is that we have to very rapidly create and move an
infrastructure to get the data and the information to scientists in such
a way that they can use their algorithms to answer really important
Data science requires data, but it requires data to be well formatted,
to be well curated, to be annotated, to be in a common model so that we
can look across many different organizations, and that’s what we’re
working on right now. And we’re spending all of our days, most of our
nights, and even our weekends—and not just me, many people at NIH—to
move the data into a way that researchers can use it right now.
Which language do you think is best to start learning if she does not
have any knowledge of programming language prior to that?
I think the best one to begin with is still probably working in R. I
learned Fortran—they don’t even teach that anymore—in college. C++
underlies many of the programming languages that are used, so that’s
always a good language to learn, especially if you want to be a
heavy-hitting computer science person.
But if you’re looking to pick something up and be pretty proficient
quickly, I do recommend looking at R.
Is there a specific platform that is better to take computer science
courses online, like Coursera or Udemy? I’m sorry if I botched the
names. I’m not familiar with them. Is one better than the other?
I’m much more familiar with Coursera, and we have developed a
partnership with them so that we can provide training for a large number
of colleagues, so that is the one that I personally know the best and
would recommend, but there probably are others. My son is very fond of
Khan Academy, and he’s been taking a lot of courses, even when he was in
high school, through Khan Academy.
Here’s one question that requires some physician training, Susan. With
the transition from in-person to online, what would you recommend for
preventing your eyes from tiring due to staring at screens for a long
I don’t know if I’m qualified to say or not, but my strategy is to take
lots of micro breaks, because I can certainly understand what you’re
saying in terms of eye strain. And also sitting down all day is not so
good either, so my personal recommendation, and I’m not a physician at
all, I’m a computer person, I like to take micro breaks.
I think you have a brewer to take care of, right?
I do, yes.
Does NIH work with the big tech companies?
Indeed. Yes, we do. We have partnerships through our STRIDES program
with Google and AWS. We partner with Palantir, which is a very large
analytics platform. We partner with NVIDIA, which is a gaming chip
developing company. We partner with smaller companies. I don’t know if
Halo is super small, but that’s the platform that we put the website up
on. So we do partner with a number of tech companies.
We’ve talked to a number of folks who are in the AI space to look at
partnerships. We partner with the national labs and with other agencies,
such as NSF. We’re looking to partner with sister agencies such as the
VA. That’s how science moves forward, is to work together. Each
partnership offers strengths, and we have a strength too. We don’t
duplicate each other’s work; we partner and together we move science
Do you recommend any data science bootcamps for more structured
I have to say that I have a colleague who is in my office, her name is
Allissa Dillman, and she runs a number of codeathons and bootcamps, and
so I would love to encourage you to take one of her bootcamps. And in
order to see which one is running, you have to go to my website, and I
just now see that we did not put it up there. But if you go to the data
science NIH.gov website, you’ll be able to find the bootcamps that we’re
I’ve done a number of jamborees and bootcamps in my past, and I’ve
always loved the ones that focused on writing analytic tools for
sequence analysis and metabolic pathway analysis. Those are my personal
favorites, but she runs bootcamps on sequence analysis; she runs
bootcamps on understanding electronic healthcare record data. She runs
so many very different types of bootcamps. But I would say that
attending one of her bootcamps would probably be a lot of fun. She’s
young and much more in tune with where computer science is going than
me. I haven’t coded in more than, I don’t know, 10 years now, I think.
All right. I’m interested in learning Python. Do you have any advice on
how I should learn?
I only have some experience with working on R.
Yeah, Python. I can just tell you my strategy for how I learned was to
get code, take it apart, and then work with it. Put in new subroutines,
new algorithms, and see if I could get it to do something new. That’s
how I helped my son learn programming, so I would suggest if you’re
interested in Python, get some codes written in Python from GitHub and
see if you can play around with it.
There’s great books by O’Reilly on understanding computer code at a
little easier level, and I would also get one of the O’Reilly books. The
Python book is particularly fun. We have that at our house.
I’m interested in bioinformatics with a biology background. I don’t have
any physics background. If I want to know more about physics, where
should I start?
That’s a great question. There are a lot of primers that you can get to
understand some of the underlying physics behind the bioinformatics.
Sometimes it’s just helpful to take a paper that you’re interested in
and read some of the references or some of the underlying methodology.
So you find a paper that you’re interested in and you see some
methodology, then go back to your textbooks and learn a little bit more
from the methodology that’s in the paper. Or you could always take a
class in physics, although they tend to be not completely relevant to
the paper that you’re reading. So that would be my suggestion.
OK, here is one that is more current. What type of information is
available in association with NIH COVID-19 samples? For example, is
there specific phenotypic information, like GI or cardiovascular
symptoms and the severity? Or medications that patients were on prior to
infection, such as ACE inhibitors? Is there proteomic or RNA sequence
data associated with histological samples you mentioned?
That is a great question, because COVID-19 is such a hydra of disease.
It’s been hard for us to get our hands around it, so we’re looking at
making and understanding some of the very basic underlying electronic
healthcare record data that will tell you about medications, about prior
conditions available, but in a de-identified way so that you wouldn’t be
able to trace it back to a particular individual, but you could look at
correlations from what is presented in the patient who has COVID-19 when
they enter the hospital with respect to what they have taken in terms of
drugs or in terms of prior conditions.
In terms of proteomics and sequences, we have much less data on that.
It’s hard to get those data. The healthcare system tends to be a little
bit taxed, and so right now getting proteomic samples has been more
challenging and we are just now getting sequencing samples from COVID-19
Putting all that information together is our grand challenge at this
point. We think we can make some of the data available. As you can tell,
it’s coming in a staged way because we have the pathology images
available right now.
We don’t even have the CT images available for researchers. They are in
the queue. They need to be de-identified. They need to be associated
with the appropriate standards and metadata so that you can use them. So
even getting those CT images is taking a long time.
Getting the other data like electronic healthcare record data
de-identified, we hope we can get that done by this summer, but it’s
going to take some time. And the sequencing data, that might be even
longer, so you can see the struggles that we have just to get the data
out for researchers to use.
Great. What do you think of current, state-of-the-art research on
protein structural prediction?
I do have a favorite, and I’ve been involved in protein structure,
determination of prediction for a while. In terms of determining the
structures, certainly X-ray scattering was a popular way to determine
structure for many, many years. I certainly worked in X-ray structure as
well as neutron scattering, which is not as refined as X-ray.
Now we see cryo-EM blossoming into a real serious research tool for
actual atom-specific structures. In comparison, also in protein
structure prediction, there was the—I don’t know if you’re familiar with
CASP, Critical Assessment of Structure Prediction, competition that was
run every two years. So I don’t know what number we’re up to now, but
when I was working on it, people were doing homology modeling, so taking
a standard and trying to align an unknown sequence to that standard.
They were working on threading. I did a lot of threading. I did a lot of
genetic algorithm, protein structure predictions, some molecular
dynamics. And then there was the work by David Baker which looked at
little tiny windows of protein and mapping them onto existing
structures. And that approach seems to have been quite successful.
I think the field is still moving in that direction of micro-threading.
I cannot believe I forgot Rosetta. Rosetta, that was his program. I
think the field really pushed forward with his revolutionary work in
Rosetta, and now I imagine what’s happening is much more looking at
artificial intelligence to gain information about higher structures to
even move further into what those new structures might be.
So out of initial protein structure prediction, I think the door really
opened up with David Baker’s work, but prior to that there was an awful
lot of BLAST-type based algorithms.
As we move closer to the end of the hour, the questions are getting more
futuristic. Here is one. Do you think physically writing code will be
less important in 5 to 10 years when you can use platforms like Galaxy
for basic and the translation of biological research?
Actually, I think you’re kind of right. I think that people are
producing codelets, little micro bits of code that can be swapped in and
out in a modular way. And so my old way of taking a giant code and I had
to work on Charm, which is fairly huge, and trying to add subroutines to
it, will change to microcoding codelets where you just swap out little
So that’s the idea of Galaxy, and platform-based coding is probably
going to be much more standard for many, many folks in the future. I
think that computer science is moving in really interesting and fun
directions, and I look forward to watching what you guys do.
Good. Here’s a question. Are there any online training courses that
include the biophysics branch of bioinformatics?
I would think so, but off the top of my head I don’t have those online
courses—although I do know that through NIGMS we have funded a number of
big data online training courses, so through the societies there’s
definitely training courses, so the Biophysical Society would be a great
place to look for those online training courses in biophysics.
Are there more questions? I’ll wait a little bit. Going once…going
Thank you so much, Susan. This was a fantastic hour. I hope everybody
Thank you so much. It’s been a real pleasure to tell you about my
personal journey and data science/computational science and where we are
now with COVID-19. And I hope that you will take the opportunity to look
at the online training resources that are available and also look at our
website, and do participate in any one of these training opportunities
offered through our STRIDES partnerships with AWS and Google, or through
our NCBI courses and webinars and through the NIAID bioinformatics
All right, thank you all. Stay safe and be well.
Connect With Us: