Hello, and thank you for listening to the MicroBinfeed podcast. Here we will be discussing topics in microbial bioinformatics. We hope that we can give you some insights, tips, and tricks along the way. There is so much information we all know from working in the field, but nobody writes it down. There is no manual, and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Both Andrew and Nabil work in the Quadram Institute in Norwich, UK, where they work on microbes in food and the impact on human health. I work at Centers for Disease Control and Prevention and am an adjunct member at the University of Georgia in the US. Hello and welcome to the MicroBinfeed podcast. I'm your host today, and we're going to talk about SARS-CoV-2 again. Do you miss us talking about it yet? We will be interviewing two rising stars at CDC, Lingzi Xiaoyi and Jill. Lingzi's day job is in the Pertussis Lab at CDC working on the whole genome sequencing process in both the wet and dry lab, and she's got a micro background. And Jill's day job is in the Waterborne Disease Prevention Branch at CDC working on the comparative genomics of cryptosporidium. So welcome to you both. Thank you. Hello. Yeah, thanks for having us. OK, so there are a lot of bioinformatics projects around SARS-CoV-2 and around the assembly, the characterization, phylogenomics. And what we've been seeing over the past year and a half is that a lot of the basics aren't there and a lot of basic benchmarking is being done, probably because people are firefighting all the time and trying to keep up with just the tidal wave of data coming through. And I know in the US, you guys have scaled up from what, about 50,000 genomes in December up to well over a million now, which is just mind boggling how quickly you've gone. So I was wondering, you've got this new benchmark data set out for SARS-CoV-2 genomes, which is much needed. So Lingzi, I was wondering, could you tell us about the SARS-CoV-2 benchmarking data set that has come about and where did it come from? We started the project in this April because we did a need assessment of the state's partners and they would like to have some benchmark data set and they hope to use it for building their QC metrics as well as validate their bioinformatics pipelines. So when Jill, Ming, and Li deployed to the Toast, we began this project and we hope to design like different data set, which can meet different needs from like labs at different developing stages because there are some labs which just started sequencing, but there are some labs which are more advanced. So we hope this data set can fit all the needs for different users. Have you guys been using it internally? Yes, actually we did. There were WIDG and a non-VOI VOC data set for comparing different available bioinformatics pipelines. And as far as I know, there were some other groups internally at the CDC that are using the data sets as well. So I noticed in the US that there's quite a lot of competing methods and bioinformatics pipelines, which is great. You know, we always do need different methods because some will obviously not work as expected. But I was wondering, have you got your partners, say within Sphere, to use these benchmarking data sets yet or is that coming soon? So I'll jump in also on this just because I'm part of the project too, that I put in like a little request on our various Slack message boards just to see who has used it externally yet. I did get a nice response from Todd Tregan, I hope I'm pronouncing his name right, from the Harvest suite and they said that they're starting to use it internally for theirs and I'm looking forward to what comes out of that. That's awesome. It is much needed, like I know I reviewed a paper recently for a journal I won't mention and people had used one of the data sets in your benchmark data set for benchmarking their new process. I think it was to look for inter-host variation or something silly like that, which we can't really do with Arctic, but there you go. OK, so I remember previously that we, myself, Nabil and yourself, Lee, wrote a paper benchmarking MLST callers. Do you remember that paper many years ago? It's like a lifetime ago, pre-COVID. Yeah, and we use some of the GenFS data. So is this new data set anything like the GenFS? Yeah, this whole project is actually based on that GenFS original paper from PureJ. And just a reminder or a refresher for people who might not have heard of it, it was work with Ruth Timmy as the lead author and she's a previous guest on the show also, through the GenFS collaboration, which is where my home branch at CDC has a collaboration with the FDA, CFSAN and with FSIS and other agencies under USDA and also NCBI. And we just, this is our central place where we talk about foodborne diseases and can collaborate over that kind of stuff, over the technical details. And we came up with a set of benchmarking data sets, so good memory. We came up with an E. coli, a Salmonella, Wisteria, Campylobacter data set, and an Insilico Salmonella data set. And it's just like, here guys, here are some simple outbreaks. You can run your bioinformatics pipelines on these and you should be able to recreate these phylogenies. And people have cited the paper and used it. This is kind of based on the same formatting spreadsheet, the same downloading script off of that, but we're presenting entirely new data sets. So just on that, I was, there are a number of different data sets in there. And what was the logic that you used to select these ones in particular? There's one in there called the Boston outbreak. I'm just curious why that one, for instance. So we selected the Boston outbreak data set. It's trying to help the labs to understand the transmission of the viruses during an outbreak setting. So we specifically selected like 63 samples, tried to represent like three introductions. And you will be able to have both the completed genomes as well as the raw data and the phylogenetic tree of this data set. So any users who use this data set, they will be able to see the three independent introductions in the phylogenetic tree. Yeah, it's quite interesting. There's a wide variety of data in there. There's one for a cohort of samples, the same samples being run with different sequencing platforms and different wet lab approaches. There's a group that seems to be representative of particular variants of control or variants of concern, and then another that are not variants of concern. And then you have, interestingly, a bunch of failed QC samples as well. So that's quite, quite a different set there. I mean, who would be the intended audience and who has been using the platform so far? Well, I think actually the failed QC ones are probably the most interesting, you know, from our point of view. I know, Nabil, you're looking at a QC of coronavirus and it's just this kind of Wild West at the moment where no one has really defined QC metrics properly for coronavirus genomes. It's just, you know, 90 percent, it gets into GISAID, that's great, you know. But what about all the rest of the genomes? Yeah, I was going to say, I think that there's a lot of people having those problems. I think it comes up a lot and kind of many of the conversations that we had, right, with the states of like, what do we do with all these genomes that are backed up, that you can't get them submitted for one reason or another. So I think it's been an ongoing conversation and something that a lot of people have had problems with. Yeah. But it does seem like every bioinformatics pipeline kind of has their own, you know, QC metrics that they output. And then people have a certain pipeline and then they, you know, pull off bits and pieces and then add their own stuff of what they're interested in. So I think everybody probably has their own individual QC things. And I'm not sure that there's a complete standard totally as to what is required. Absolutely not. I mean, no one agrees on anything and actually people complain. I know we've had complaints from data we've put into the ENA, into the INSTC, because people are like, oh, no, you know, you've produced these tiny little genomes or whatever, and they're useless. But then our response is, well, you know, these represent real world samples. You know, maybe they've come from one person sampled repeatedly over a few months. And this is what it actually looks like in the real world. And so what if it's a bad quality genome? It's still useful. It will have failed QC if you're just trying to build a pretty tree. But, you know, you can still answer many questions with these marginal samples as long as they're not contaminated. That's like a whole different ballgame. So what in terms of contamination, you know, do you guys focus? on that at all with your case study? Yes, we have different samples. One example is like you have the host contamination. We have certain samples with over like 65 percent of the host in the rates and there was another like failed and active controls so which you're supposed to say no like human race at all but you would be able to say that in the water samples or like non-template samples. So we get in our raw data, we often get a lot of human sample, human material. In fact, we get so much human material sometimes so we can guess like the biological sex of the person who donated a sample which is quite worrying and we've used that actually for QC where we weren't exactly certain if samples were cherry picked correctly you know like on a plate and maybe they got stuff mixed up but I wouldn't fail something necessarily if there's too much human in there because you know you might just get a load of human but you might also get a load of SARS-CoV-2 reads. It might just be low or might be a high CT sample. But ideally if you're running a pipeline that's producing data for publication you don't want to see any human reads at the end and that would be sort of here's a mixed sample like it's a lot of endogenous plus COVID and then you see if your tool works and I think that's the nice thing about these datasets is that someone looking at it certain not all of them will be interesting to everybody but at least one of them should be interesting to everyone. So how is it actually structured like how do you retrieve data from it? It's quite a lengthy readme I mean very briefly what are you actually doing to to fetch data now? So we have put all the assemblies and the raw rates into like TSV file so you can download the TSV file first and then using the script per script that Lee has developed to download them all at one time. So basically using simply using the script you will be able to get the raw rates as well as the assembly or complete genomes at the same time. Okay that's great and yeah in these tables it seems like there's the short read accession code checksums for checking the file and there's even a link to the paper and the tree related to that dataset so that's pretty comprehensive. Does it also include the consensus sequences somewhere? Yes they do except for the VOI VOC dataset. For that dataset because we did like a data mining for the important lineage so we don't we only have the SRA accession number for them but we were able to link them with the jsite assembly. But we didn't list the jsite assembly accession number in the TSV itself but you will be able to use the strand name to search for the jsite assembly in the database. Jill I know that you did a lot of work in figuring out how to how to get like the right reference genome and which one was like the right representative for each of the lineages. Do you want to go over how you did that? Sure yeah so this is kind of an iterative process with Lindsay, I and Adrian and some other TOAST members but I guess the first steps were that downloaded some upwards of like 380,000 sequences and originally right we had talked and we're like oh we'll just pick a few from each kind of lineage and then just run through QC and we'll find them really quickly. But then as it kind of turned out it was like in my brain actually I overrode you and it was like I'm gonna do 30 because I feel like that's not enough. And then in the end we ended up doing that several times and then at the end I was like that's it we're just running all the sequences because I'm tired of rerunning this. So if you have enough compute power just brute force it at some point. Yeah so the kind of process we went through was we got a set of sequences and then we ran kind of a newer version of Pangolin on them to see did the lineage line up still or had things shifted too much if we didn't want to add something that was gonna kind of be shifting depending on which Pangolin you used. But obviously there's some amount of not being able to get around that at all. So one does the best they can which is why we ended up putting a cutoff date of was it May 5th? No May 30th. But it was another thing that kept changing right. There keeps being new variants and I think this was another kind of thing that happened with the project right as more variants were cropping up and then at what point do you cut the projects? What point do you try to say oh this variant is really important when you include it? So I think at some point we just had to say stop here or else the data set would would never come out. Yeah is Delta in there? Yeah so we had at the time it wasn't this was pre I think even being called Delta but yes there are what was it six one seven point there's a couple of them in there right? Yes there was two at least two from the Delta variants. Yeah actually we already finished searching for other lineages but then Delta variant came out and then we have to go back and go through our automatic scrape again and search for the Delta variants. But I think this can you know as there was like more like variants emerge we can continue to do this work to find the representative genome. And so what is your plan then for adding in new variants of concern as they arise? Because I presume you know Delta is up the world at this point but I presume in a few weeks time we'll have another variant sweeping the world. So are you going to version it in a different way or use git commits or git releases or what what is the plan? I'll step in on this because there are actually two projects here. There's the stuff that we're putting in for the manuscript so we have to put in a cutoff date and version it and say this is the version we're putting in for publication. But then yeah this is a literally evolving project and we have to be able to have some way to accept new things. So we've already had a couple of requests on how to get new data sets or get new lineages or whatever. So the github repo is going to stay alive and stay organic. We're gonna have a cutoff a version for the publication but then the second project is basically an ongoing organic thing to continue accepting new data and and trying to curate something kind of stable. Yeah that's that's really useful actually. A lot of people ask for representatives of different lineages just to play around with different detection tools that they're working on so this is really excellent that this is available even if it is frozen in time. Yeah and I think to piggyback off of what Lee's saying I think the the nice thing about letting this be organic right is sometimes we talk about science being very collaborative but there still is like a level of competition about who's getting what paper out. And I think in this instance right it's of global importance that we work to get something out and that everybody work together. And I think there's been so much more collaboration in the last couple years within the scientific community that wouldn't have happened otherwise. And so I think it's really nice to kind of see that kind of going about it and that like we're not trying to compete any with anybody other's data set but we would like to you know just help have a repo for these things because we do think it's useful and people were asking for it. I think what we need now is a another paper right a horse race using these data sets comparing it against all of the different bioinformatics pipelines out there. I mean there's a lot of them at this point. So which one do you think will win? The one we use obviously. We hope we hope everybody would be a winner because this will help you know combat the pandemic and we hope this will end soon. I don't know that there needs to be right. I guess when we say of course I'm competitive like anyone else right there there needs to be a winner in something. But our hope of the data set right is to show that it doesn't matter if you use this pipeline or that pipeline you should be able to get the same answer right. And then the pipelines that people are deciding to use it was really more about their infrastructure and how things are set up right. Like is their system um is this work for cloud computing? Does this work for their whatever their server thing is? Have they built something off of it already? So I think it's not meant to say one pipeline is better than the other necessarily but rather to say can is it giving you a different answer and if it is giving you a different answer then you need to figure out why that is and is that you know is that a problem. I guess maybe not a competition between pipelines but we could have a like continuous integration if you think about it you know continuously running these pipelines over every new version of a pipeline that comes out sort of if there are changes you know in the code which obviously it's happening all the time then at least we'd all know about it pretty rapidly. It sounds like a great use of the CDC enterprise github um yeah space. I would definitely endorse that if somebody wanted to help us do that stuff. Let's spec it out now right? What would be required to actually get that working you know? If you think about the pipelines that have say for Illumina the main one we use is from the Connor lab which is a fork from the Arctic field bioinformatics toolkit whatever it's called then if the original Nanopore one from Arctic you have one from uh Broad don't they use Terra. And God, there's a few others. I think every group in the world has created their own at this point. What do you guys use? So we're kind of getting back to how did we go about doing the QC on them? So we ran a program called Titan, which can be run off of Terra, or it now has a Conda install available. But this was a program that was kind of like an iterative process. There was a series of things that came out of the United States public health communities. There was Monroe and Secret. Secret was written by Aaron Young in Utah. I mean, several people had collaborated to build these pipelines together. And then this is kind of ultimately what kind of spun out of it. So it's not necessarily an endorsement. Again, I think it really went back to it was the most useful thing for us with the system that we were using that we got it to run really quickly. And so that's why we ended up going in that direction. And it gave a lot of the QC metrics that we were interested in. Again, I wonder, Andrew, kind of going back to some of these questions about pipelines, right? If you really pull the pipelines apart, I wonder how different they actually are. Right? Like you can put different names on it. But if it's really just using a lot of the same programs underneath, then it's, I mean. I'm a scientist and I would argue that they're all different, even if they use the same fundamental building blocks, because people will make errors. They'll use slightly different parameters. They will join them up with using different versions of software, this kind of thing. So even minor things can cause huge problems. Like we've seen with like amplicons, particular amplicons not being correctly identified or SNPs in particular regions, because the variant caller that's being used just has a, you know, it's a bit finicky at certain points with certain data, but not in other cases. So the one concrete example I can think of was there was some back and forth between me and some other people with some data where the consensus sequences had particular bases, particular nucleotides at particular sites. And it came down to the fact that the underlying reads had slight mix of reads in the sample. Like some was at low level, but for some variant callers that was enough to trick it to pick the alternative base. And for others, it picked the dominant one. And that was just down to the thresholds that it was using. And I think another one was probably masking that base entirely as a big use. So that can happen. And that is a problem if we're talking about the difference between a single SNP becoming a new variant, then that is something we need to keep track of. And so that's why I think a data set like this is really useful because people essentially now can play around with whatever tool they want, but we can all come back to this saying, well, we can do whatever it wants, but it has to comply, or at least be able to process this kind of data. We could go even one step further, right? If we had the physical samples for these data sets, could you imagine what we could do? Because you could post out an aliquot of these samples and say, here we go. You sequence those and we've got data here. And so you can see end to end, does this match up with what's been done already? Like positive controls, but on a much larger scale. Actually, we thought about it, but it's really hard to do this right now because RNA is so easy to degrade. And then it may have like a batch to batch differences. And it has like a special requirement for the shipment as well. That's why we didn't go to that route. So we said, okay, maybe we can do something about the in silicone panels to help people right now at this moment. Surely you can ship it all at minus 80 and it'd be fine. No, it doesn't work like that. Minus 80 Celsius. RNA degrades if you look at it funny, it's just not going to work, but there is scope for that kind of something like that where we can have more control for the entire workflow and cross check it against different groups. And yeah, this is definitely a step in the right direction. Well, then going back to this or this discussion that Andrew brought up, so like if pipeline A that we use in pipeline B have like a cutoff of 4x and 5x depth and they both use Pangolin, I wasn't thinking about that. I do agree. Even if they use the same underlying software, it'd be good to continue to use continuous integration and test them out, I guess. Yeah. We found that with the Arctic workflows that there is a specified default coverage threshold. I mean, everything is customizable, but let's just say coverage threshold. And at the moment for us by default, it's 20x for nano or 10x for alumina. People do jack that up or even bring it down a bit depending on what they're doing and you can get different results. Yeah. And you go, oh, well, it's exactly the same code, it's exactly the same pipeline and you can get in these fringe areas, you can get slightly different results. And then that level of variation will then dictate how low or how poor your data can be before you say, well, this is unreliable and I can't use it. In particular with really high CT samples, some people try and sequence everything to completion. But if you've CT 36, 37, 38, the reality is there's not much in there to actually sequence. And if you sequence the hell out of it, you're going to end up at some point, if you do 10,000x for a sample or 20,000x for a sample, you're going to get a full genome, but it may not be a real genome. Yeah. I'm curious if you've included data sets that have samples that vary in CT. Yeah, that's true. You know, you're on the paper, do you remember? No, I mean, but we didn't do it like religiously as a step thing. No, we did make the conscious decision to get a range. Yeah. So another scenario, I think this can be very useful in the future is that as we move away from say academics doing sequencing and say public health labs in sequencing to more kind of commercial providers doing it for us, those are going to operate more as black boxes. And as a quality assurance step, we do need to have these kinds of benchmark data sets to make sure that these black boxes are performing as we expect them to perform. Yeah. I guess that comes back to what you were saying about the original motivation for this. That's probably why your state labs are asking for this in the first place. But globally, this is going to be definitely used globally as other people come into that space where they are trying to do this on a almost routine level. We had to think of a global pandemic as routine, but here we are. You did mention a couple of scripts, Jill, what program are they written in? And maybe we can talk about some of the tests that are developed as part of this package. You mean the stuff that we use for kind of like the data mining stuff? Let's start with the code that's presented in the repository as it is, what language was it? And let's talk about the test for the repository itself. And then, yeah, if you want to extend to the data mining that was used to generate particularly the representative set, then by all means. Yeah. So I think the stuff in the repository, I think Lee probably can answer that better because he wrote it. Yeah. So the code base is actually pretty well rooted in the original Timmy et al paper. It's written in Perl. I haven't made a lot of changes to it. And the original format for the TSV hasn't changed too much. It's a two-parter. There's a big header describing the whole dataset at the top. So like what the dataset name is, what the species is, the intended use. And there's a second part to it where it describes it sample by sample. It has accessions for either the sequence read archive or assembly identifiers or GenBank identifiers. And then there is a continuous integration part to it too. I'll speak to that too. I think it's funny. We just put in for continuous integration to download the dataset every time and just check against the hash sums. And do you guys remember this? We actually ran out of our 2,000 free minutes on GitHub because it was taking so long to download all five datasets for every single commit that we did. So I actually stopped it for a little while. But it is how we figured out that some of the hash sums were wrong on one of the datasets. And at the risk of talking too much, I'll just describe this one more thing and then I'll let you guys talk more after this. We had this conundrum where we have a contaminated dataset, right? But we're not allowed to put personally identifiable information online. So if we have human reads in the dataset and we want to show those human reads, we actually cannot show those human reads because they belong to some person. So it was probably unconsented as well as most surveillances. Yes. So even though this... person probably doesn't even understand that the sample is part of our data set and we don't even know who the person is. We are trying to figure out what to do with it. And I think we came up with an elegant solution where we're taking the reads that map to human on that data set and switching them out with the actual human reference genome. And so that way it's sort of simulated, but not really like it's realistic quantities. It's realistic positions on the genome. It's just not that person's uniquely identifiable SNPs. Yeah, that's quite elegant because that's a quite elegant solution. I was just thinking the same sort of just pull it from the human genome reference. If people don't... Sorry, go ahead. In the same description, I just need to give credit to the current Toast members who came up with that idea too. I believe it was Mike Wiegand and Chris Golvik. Do you guys know if anybody else was in on that discussion? Just wanted to make sure I give credit where it's due. It was very elegant. Sorry, Nabil. Go ahead. No, I was just going to point out that the reason that this works is because the human reference genome is a composite of separate fragments from all over the place, from a bunch of different people. So it's an average. And thanks to the law of averages, there is no one who is truly average. So it cannot be identifiable to a particular person. It's all sort of mixed together. I thought it was Craig Vandener or someone like that, you know, it's just his genome. No, he did his one on his own. I thought so too, actually. Okay. I don't know. I don't know if it's about the genome-enlisted bacteria or virus, I guess. No. The one problem with that, though, is that I think what people will point out is that the human reference genomes are sort of Eurocentric, even though they are an amalgam, which has absolutely no relevance here because you're just giving some dummy reads for people to detect anyway. And do you have any like negative controls in there in your failed QC test, QC dataset? Yes, we do have negative controls, but those are contaminated ones. I guess if you want a true negative control test, it's just touching an empty file. Well, I mean, you know, in terms of negative controls, let's say we see we have lots of negative controls, but we don't release them publicly. But I presume we could actually in the real country. Yeah, sure. But we should put it somewhere with a lot of caveats so people don't freak out that these things are just totally blank. Imagine we have negative controls with no reads that specifically map to the COVID genome, to the SARS-CoV-2 genome, but there's just other trash in there. So it's not just like the read file is empty, it's just there's a proportion of reads in there, but they just don't map to anything. It's just total garbage. So just instead of dancing around it too much, like what is in the failed QC dataset? Out of 24 samples failed basic QC metrics covering eight possible failure scenarios. It's not in the GitHub repository, so we can't actually look. But Lingzi or Jill, do you want to speak to what's in the dataset? Yeah, so there was samples, like, for example, they have like a very low coverage, or they have an applicant drop out, or they are like a failed negative controls, like with a high host contamination. So we have specifically selected the samples in that category and put them in our dataset of six. We should also mention that it's going to be part of a bio project. So when that dataset comes out, there will be like a bio sample need for that. We had a whole back and forth with NCBI about making that bio project because we didn't want to confuse people. Do you remember some of the details on that? I think just major concerns, right, is if you're putting out like, essentially just a bunch of kind of useful but crappy sequencing, right, you don't want somebody to pull it on accident thinking that these are actual useful genomes or, and try to do something else dataset wise with it, that it be pretty clear that it's supposed to be failed QC and not good sequencing. Yeah, this is an interesting problem that I've run into. So you, should you set the taxon ID in NCBI to SARS-CoV-2? Because what I've found is that people just scoop every single data point based on that taxon ID and they treat it as if it is a genuine sample that you're finding in just a like all nice and tidy. And if you've been fiddling with it and you've been putting in weird contamination and so on, it's sort of not, it's, it's, it's not what they're expecting and it'll break some of that stuff. So how do you describe it in the INSDC? And I guess that's where you're stuck is how do you describe it in the INSDC so that people are aware of it, but it doesn't just automatically get absorbed into people's routine analytical tools. You see, I would say tough luck. And if you're going to blindly ingest data and not cite anyone, you know, the 200 different studies that have produced that data and not understand where the data has come from, well then best of luck to you. But it's, it's not like GISAID where you're required to say, acknowledge everyone who's created the data on the INSDC. You don't have to acknowledge anyone and you get even more data. So it's even more dangerous. So I would say tough. Do you all remember any details on how we are noting that it's failed QC or anything? I'm remembering that at least on the bio project, we're writing down like these are failed QC genomes, but otherwise, I guess, according to Andrew, this is, this is going to be some tough love to some people, I guess. Sorry. It's only because people have complained about some of our data. You know, we put it out there for completeness in the spirit of open data and open sharing and providing not just a really good, super, you know, high quality samples, but also the ones that, you know, maybe CT36, which haven't sequenced as well, but are still valuable as a scientific resource. Yeah. So I just get really annoyed by people because ultimately they're, when they ingest, you know, all of these genomes into their pipeline, they're not going to side us and we'll get zero credit for all the time and effort we've put into this. You know, at least at GISAID, you know, we getting an acknowledgement and a supplementary along with 500 other people, but you know, so yeah, I I'm, I'm not sympathetic at all. What you could do actually is call the samples like failed QC1, failed QC2, you know, so they should know. Yeah. If you're putting it in the name and they're still skipping over it, then well, then they're not paying any attention. Okay. So that all sounds really interesting. How do people get in touch with you and contribute their own data sets to this repository? Yeah. So I think if we kind of brought up before that, we would like this to be an iterative process and even you guys talking about your negative controls, right? I was going to say, it sounds like you need a centralized repository that's open source that you could put this in. So I think in those cases, right, you can email toast at cdc.gov and we've been kind of compiling those and we need to get the paper out first and have like a first release, but the hope would be that this could be something that we can continue to add to. And if people have a data set that they think is would be useful to the community, then we would like to include it so that we have something kind of for everybody out there and that there's the most use of all the data, right? Because I think a major thing is everybody's time is so limited in these situations and resources wise, right? People can't get tips or reagents. So let's not reduplicate efforts if we don't need to, and just have data sets out there to show people what they can kind of expect. Yeah. Is it possible to put in a pull request as well? For sure, please do. Maybe one last note for me too is just, there are like, I think there are about 30 people on our authors list now, if that's right. And so I've been wary of trying to name every single person to thank them because there are so many. This has been an absolutely huge effort. So when Lingzi first came on to the technical assistance team, Toast, I didn't realize how much of a burden I was putting on her. I was like, Lingzi here, go talk to everybody and get data for us. And it turned out to be this absolute mess. So for just at least coordinating and starting it off, amazing. For analyzing and making sure we get every one of the correct reference genomes, Jill went through this huge effort. For the actual data, I mean, there have been like 20 people to talk with, including you guys, Nabil and Andrew, including Danny Park, including about 10 different people at CDC. And so just to everybody who's kind of an unsung hero on this, we really do appreciate it. Yeah, I would really like to echo what we said. It was really, I don't know that it was a joy, but I mean, it really was a joy to work on Toast. was like to be able to interact with the state partners. And even when it was just, you know, listening to peer frustration, I think it was a great learning experience. I think that, you know, it was surprising to me, the like the really small group of bioinformaticians that work across the states to kind of get all of this done and yeah, just so impressed by the community. And then also, you know, feel very thankful for them. And I don't think, you know, they maybe get enough credit for what they've had to accomplish in the last couple of years, right? I mean, we're building a benchmark dataset to kind of test these pipelines, but right, you're just like smacking a pipeline together as you go in the mess of everything, like and to be able to do all of that. And I think the other, you know, thing that I've thought about as being really important, right, as we're making datasets and if you find mistakes or things to improve, right, that it's never an intention to kind of, you know, shame somebody for a mistake in a code or something, right? That it should be that this is always just to try to make sure that our systems are working properly. And so yeah, to just keep the nature of collaboration together amongst the group. And so I just really appreciate the folks at Public Health that have been working over the past couple of years to make this happen. And they're patient with us as we got this done, because I know they asked for this a while ago. So to us, we want to try to deliver. Just want to take this opportunity to thank everybody who helped us to build this dataset, especially our, like I said, partners. So we have a very close collaboration with them and they generously shared this data with us, especially they are so busy, but they still like still time for us. And also our like collaborators outside of CVC. So sometimes I send emails to request something at midnight, but they can reply to me, you know, 10 minutes later. So everybody has worked so hard to combat the pandemic and everybody has contributed so much. So I want to take this opportunity to thank you everybody who has been getting involved in this entire process. Thank you so much. So that's all we have time for today. So thank you again to Jill and Lingzi for telling us so much about this new upcoming paper and the dataset describing SARS-CoV-2 datasets for benchmarking. And I hope you guys can join us another time to talk about more interesting projects that you've been up to. Thank you so much for listening to us at home. If you liked this podcast, please subscribe and rate us on iTunes, Spotify, SoundCloud, or the platform of your choice. Follow us on Twitter at MicroBinfy. And if you don't like this podcast, please don't do anything. This podcast was recorded by the Microbial Bioinformatics Group. The opinions expressed here are our own and do not necessarily reflect the views of CDC or the Quadram Institute.