Hello, and thank you for listening to the MicroBinfeed podcast. Here, we will be discussing topics in microbial bioinformatics. We hope that we can give you some insights, tips, and tricks along the way. There's so much information we all know from working in the field, but nobody writes it down. There is no manual, and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Both Andrew and Nabil work in the Quadram Institute in Norwich, UK, where they work on microbes in food and the impact on human health. I work at Centers for Disease Control and Prevention and am an adjunct member at the University of Georgia in the U.S. Once again, the Canadians have taken over the podcast and will continue to talk about their public health response and genome epidemiology of COVID-19. We cross back to their conversation now. So far, we've talked about tools. We've talked about variants of concern, and no podcast would be complete, especially one that Will and I are involved in, if we don't talk about metadata. As we've already discussed, there's a lot of players that are contributing information to our National Genomic Surveillance Database. There's a lot of different metadata streams coming into that database. So how is the metadata being harmonized for analysis? Will, I'm looking right at you. From my experience, I was a biomathematician and a scientist in the public health lab for a period of about eight years. So the challenges associated with data sharing in the public health system is something that's highly familiar with, but also sympathetic with, because it's not just people are not sharing data. They're not sharing data often for the right reason, in terms of privacy and in terms of logistics, right? To prepare the data for sharing often consumes a lot of valuable time, especially during pandemic. So early on in the CanCoGen effort, we divided the metadata into three different tiers. We sort of label the low-risk, de-identified information, the minimum metadata that can be released publicly and used by researchers right away. And then we have a middle tier that consists of a bit more than just the minimum metadata that can be useful for national surveillance and national tracking. So in other words, data that will be shared between the province and the national lab. And then we have what we call the local metadata, in other words, metadata that are identifiable that each of our partners are encouraged to keep for themselves. The data, of course, will later on be valuable for research as well, but we might not have time to de-identify them or process them for research, but we nevertheless encourage these data types to be kept in a standardized format and therefore can later on be de-identified and used for research purpose. So we established these three tiers of data grades and that formed the framework of the CanCoGen metadata management plan. Little did we know that the minimum metadata part actually took a few months to negotiate. We realized that what one province or one jurisdiction defined as non-identifiable information or de-identified information, namely, for example, age and gender and the date of sampling, normally considers non-identifiable given that there's sufficient number of people sharing the same, nevertheless become a point of discussion over a period of a few months within Canada to sort out. And we even had to consult legal experts to help us understand whether there's any legal barriers saying that these data types indeed constitute identifiable information. And we end up writing memos to the public house to help them understand that the global consensus is that these types of information are indeed not considered identifiable. And the release of such information, as we know by now, can help, especially the date of sampling, can really help with global tracking of the viruses and to understand the transmission of the virus. So that was sort of our metadata organization sort of framework. And in terms of the metadata harmonization, indeed it's very much Emma who's leading the work as part of the CanCoGen in the PHAGE consortia. What does PHAGE stand for? So PHAGE stands for Public Health Alliance for Genomic Epidemiology. You think a working group chair would know that, Will, don't you? You might think that. Emma. Oh, right. I think she, yeah. So she's trying to quiz me rather than Matt Norman. Yeah, yeah. It was no fun to let her get away with that like that. So- Hey Matt, I'm just asking the questions here. You guys are the ones in charge of that. Very much early on, we have established a long list of metadata that are relevant to SARS- CoV-2 sampling and have published that as part of the PHAGE consortium and of course adopted for the CanCoGen use. So within that large data set is the minimum metadata that I mentioned before, but also the overall standard consists of additional data fields that we deem relevant. And this is very much done in consultation with academic researchers, with public health workers and other types of stakeholders that are involved in the pandemic response. And over time, over the last year, we have had additional fields to keep track. For example, the reason for doing sequencing, which can help with the interpretation. In epidemiology, the reason for doing a sample collection or data collection and the so-called denominator used in the calculation is it's often very key to the analysis. So tracking information about why samples were collected and the scope of sample collection and the inclusion, exclusion criteria and so on. Much things that we over the time added into the standard to help improve the granularity and improve the details for information and the standard have been made publicly available. And indeed we work in the open access, open data framework to ensure that these standards are widely available and anyone can use it. And indeed anyone can contribute to them and help us improve these standards. And also within the CanCoGen project, we have established a group that programmers are helped to develop tools to help standardize the data collection process. So in my group, we developed a tool called the Data Harmonizer that can be used to capture the metadata in standardized format and it has built-in validation to ensure that the data is in the format that consists with the standard and also has export functions that allow the data to be exported in a format that's compatible with public repositories and other organizational needs such as within CanCoGen reporting to the national lab. There's a specific format that's needed. Metadata Harmonizer essentially handles the data conversion for you. And as I mentioned before, some of these data cleaning efforts are time consuming. So these tools are designed to help streamline the efforts for data standardization, harmonization. And also Emma leads a curation team that help stakeholders to manually clean up the raw data and do the conversions. Sometimes with the help of the Data Harmonizer and sometimes indeed the data are so incompatible that they need to be manually harmonized. And we have a designated team that help to achieve that so we can have high quality metadata. So I just want to give a quick shout out to the previous MicrobeMP Podcast episode where we discussed the phage metadata standard. We'll link to that as well as all any of the tools that we've talked about in this episode in the show notes. So that if you're interested in using that standard, you'll be able to. So at this point, I want to get maybe a little controversial. I want to address an elephant in the room. So Canada has been sometimes criticized in the media and occasionally by its own scientists for the slow pace of data sharing with public repositories. So I'm going to lob this grenade over to the both of you to comment on that, particularly why that might be the case and what are the steps that are being done to improve the situation. So I don't know who wants to start. I would say it's not just the sharing of data at all to the public databases, it's also sharing incomplete metadata, particularly in terms of incomplete date metadata. Art Poon has done some really, at Western, has done some really nice plots that demonstrate that among the Western countries that have been doing a lot of sequencing, Canada has a particularly low percentage of their genomes in GISAID that have complete date metadata. And that varies hugely by province as well. So it's really, again, it's a product. I mean, I'm an immigrant to Canada, so I can't badmouth it too much because they could still deport me. So they don't have to give me citizenship. I think it's largely a product of that kind of fragmented, having all those individual provinces that all have their own legal systems, all have their own data privacy rules. And then you have the federal government, which has different relationships with each of those provinces, has stronger and weaker ones, more contentious ones based on, you which governments have put in charge, for example. And so that kind of adds a whole other layer to that. And there's a lot of provinces that are looking forward that maybe are interested in being more independent than others from the federal government. So are keen to avoid anything they see as a reach of federal government power and enforcing the federal government power. So they're kind of looking to the future. Although Will, I think has been at the sharp end of a lot of these discussions and issues more than I have been. Yeah, so Will, I'm sure you have a few things to say about this. I think that the key challenge indeed, as Finn mentioned, is that there's different privacy laws in each province. And in addition to that, each province has its own privacy officers and lawyers interpreting these laws. So to derive a consensus that can be shared across Canada has proven to be challenged time and times again. There are previous attempt to try to come up with multilateral agreements for data sharing in public health. And all end up being held up by difference of opinions across the country. And indeed, we are seeing that again during the COVID-19 response. Some provinces are very quick to share data. Other provinces are willing to share data, but has reservations such as release of state information to its full extent. Because the potentially, well, I'm actually not sure exactly who's making the calls, but the privacy officers or the public medical health officers and so on may deem those data to be identifiable and therefore refuse to release that publicly. And Canada as a whole indeed is, what I would characterize as more risk averse when it comes to releasing of information into the public. So contrast to countries like Denmark, and some other countries that are quick to release these case related information in a de-identified manner. Canada overall just has more, taking a more cautious approach. But while that's the main problem, indeed elephant in the room, some of the considerations that came up during the discussion nevertheless, I think it's valid. For example, at the beginning of the discussion, attentions were paid to the quality of the data being released. So a lot of efforts, as Finn has already alluded to, went into the QC of Canadian datas, that the hosting of Canadian datas to ensure that the data released of high quality and that include both sequence data and metadata. So that also contribute to the, legitimately contributed to the delay. I think there's no one major reason that we overall are slow to respond, but many, many different reasons. And we have addressed them one after another as part of the CanCo chain coordinated effort. And I do think moving forward, we'll see the Canadian data being released much more quickly. And hopefully as part of this COVID-19 response, we will come up with a better system for releasing of sequence data and the minimum data for public, into the public repository for global pandemic or global outbreak response. What's kind of interesting in Canada and the kind of thing that we haven't talked about is the legal framework in the federal government does actually give the federal government the power to compel provinces and territories to share all their data. The Public Health Agency of Canada Act, set up in 2004 after the first SARS pandemic, as well as Statistics Act. So those powers exist, but the problem is so much of the function of Canada is based on there being relatively good working relationships between different provinces, different territories, as well as sovereign first nations bands across Canada. That's a whole other aspect of kind of interesting complexity, by being a settler nation that encompasses many other sovereign entities. There's a whole other layer of kind of international diplomacy within Canada, essentially. So that adds a lot of kind of complexity to that. And there's a lot of, especially amongst some of those communities, a lot of reasonable hesitation and worry about the federal government exerting power. There's kind of an interesting dance going on between, you know, we're not going to compel these provinces. We're not going to use these acts to compel the province to do this if you share the data with us. We're not going to force you if you do it. So there's an interesting, that kind of dance. But it does, like, there's a lot of criticism about this. And there has, you know, some people have gone as far as, you know, the international health regulations set up by the WHO after, you know, after the subversive pandemic. Basically, you know, that all member nations must be able to share, you know, detailed epidemiological information with WHO. It's Canada, due to this provincial-federal gap, is actually in violation. Some people have gone as far as making that argument. So it's a huge, huge problem, but it's not one that's got really one easy, quick solution because it requires all of these moving parts to kind of move in concert without anyone getting trodden on in the way. At the end of the day, the public health system is to serve the public and health officers at both federal and provincial level, making calls on behalf of what they think are the best for the population. But news medias and so on, time and times again, highlighted that the Canadians do think that the data-sharing challenges that are seen in Canada are detrimental to the pandemic response. So I think there's an opportunity there to really assess indeed what are the public opinions when it comes to sharing of some of these minimum de-identified information publicly and timely for the combat of infectious diseases. And that would allow potentially freeing the public health authorities hand a bit more because they know that they're doing it in accordance to the population's desire and wish. I think that's a critical missing piece and that's why everyone has been a bit more risk-averse, not wanting to be the one accused of breaching privacy or breaching people's trust. So to sort of look towards wrapping things up, looking at the overall picture of SARS-CoV-2 sequencing in Canada, what are the things that you would say are working well and what are the ongoing challenges? You talked a lot about these things throughout the podcast today, but in summary, what would both of you say are the things that are working well and what are the sticky wickets that we're looking at? I mean, I would say one of the things that really is working well is essentially every time we've kind of cross-cut that particular problem of the federal provincial systems and all those things, by having things like Cancun, having these different working groups and having just ad hoc conference calls where all the people, even from, say, a province that doesn't like sharing data and really doesn't like the federal government, but all the people working in the health labs there are working hard on the same problem and are eager to discuss things with their counterparts in other provinces. So there's some really great kind of discussion and conversation going on about that. And I think it's good. There seems to be a lot of discussion about kind of looking to the future, about trying to build more long-lasting infrastructure for this and really scaling up the use of genomic epidemiology, which as three people who have all spent a lot of time trying to get genomic epidemiology working, you know, more broadly across Canada for foodborne disease and so on is a good thing. And, you know, individual hospitals, like how do we ramp up our use of genomics? You know, this has been really important or really useful. So I think that's one of the really good things. I agree. And I would say it sums up to, this is a trust-building exercise and we are indeed through this process know and highlighting some of the barriers and challenges. And I think we indeed need to overcome them and build better trust within the Canadian public health system. So I think that's what's going really well, the communication and the, we all know that we spend disproportionate amount of time in meetings and so on, but in a way those are necessary in order to, for a large group of people in a large country to achieve consensus and to share expertise and share knowledge. And I see that sort of really come across in the CanCoGen activities. What I don't think working very well is we still have that gap between what the practitioners are working on and what we are sort of, you know, working tirelessly, trying to build pipelines, working tirelessly, trying to build tools for sharing data and so on. But we still have that gap of understanding in terms of what are the need to be put in place in order for data sharing to occur across Canada. What are some of the guidance documents or what are some of the regulations that need to be put in place? And as we mentioned earlier in this conversation, indeed that Canada has to improve its overall framework for public health data sharing in order to be more responsive in the future. And for that, I don't think we've done enough work to understand that process. And within CanCoGen, indeed that has to be the focus for our sort of year two of efforts, trying to understand the legal, the governance, the ethical considerations, trying to understand the public health, sorry, the public opinion on public health data sharing better across Canada. And by improving the social science aspect of data sharing, I think that's when we can really bring the technical works that we have put in place. into its full utility. Great, and so in bringing this episode to a close, I just want to ask you one final question that probably overlaps with the last question, but if you were advising a country with a decentralized health system that was starting to build capacity for SARS- CoV-2 sequencing, based on the lessons that you've learned, what advice would you give? I mean, the first thing is centralize as much of the, I think, analysis and data collection and curation as you can, just so there is at least, not necessarily even in like there being one final repository, but there just being some form of centralized data repository, some form of shared analytical platform. And there's lots of different ways of doing that. You know, there's things like IRIDA, you know, that we've talked about. There's, you know, SP3 for tuberculosis. Like there's kind of a lot of different ways of trying to do that. I would say the best way and the way that we've tried to deal with that fragmented decentralization is really trying to focus on QC metrics and having a robust set of QC metrics that are clearly communicated, but then also checked in that centralized fashion. So the genome is being checked in a centralized way. And then third, I think largely just try and do what the UK did, but try not to have a Tory government. Like the great, you know, the genomic epidemiology side of the UK, many people listening to this are involved in that effort, has been excellent. It's been really world-leading, but the gap has been in the way that that information is being used at a government level, at a political level to actually implement policy, I would say. And that's where there has been major issues. So yeah, trying not to have a conservative government tends to help. Yeah, well, indeed, you look at the countries that responded well to COVID-19 versus countries that respond poorly, there is a high correlation between the strong national leadership when it comes to, you know, disseminating expertise and policies, right? And those are actually, ironically, the strong national leadership comes from, more likely come from a liberal government than from a conservative government, right? Usually you associate strong national centralized leadership with conservative government, but it's indeed actually the liberal government that provide much more leadership in this area. And so I would echo Finn's comment, even though I don't know if that comment about the Tory government is going to be, it's either going to be made the tagline of this talk or will be cut out, so we'll see. I mean, if you have a government that's already expressed its contempt towards starving people and allowing ongoing use of food banks and unprotection of vulnerable people, such as the homeless, is it surprising that on a policy level, when there's a pandemic, those same vulnerable populations are going to be particularly left exposed? And I'm not saying Canada's done a particularly great job at this overall either. You know, there's been a huge issue of not protecting vulnerable populations and not really factoring in SES. There is a lot of work trying to go into that, but one of the problems we have in Canada is environmental. So we have an issue of, you know, very cold winters. So when you have shelter systems that then have to reduce their capacity to increase social distancing, you have to have people making a choice between a vulnerable population with comorbidities, that's case fatality rate around 10 plus percent for COVID-19, versus what's the case fatality rate for freezing to death? So really, that is where, like, you know, we're a German company always, we're bioinformaticians. Like, we're in the data, but like, there is a point where that's actually directly impacting policy. And that final link of the chain is something I think we should be trying to get more involved in. And I think that's actually a key part of trying to build a decentralized health system using genomics, is the domain experts actually have to get far more involved in that political side, far more involved in that implementation side. Because otherwise, A, we can get divorced from the entire process. We can lose track of what we're actually doing and whether it's, like, the impact of even our mistakes. Right. And we're gonna have made mistakes in this process when we do all the policy work. Yeah, okay. And using current technology, indeed, the strengths of web-based technology is that you can have highly connected decentralized system that still communicate and functions well. But the key there is to have, you know, a trust framework that enables data to be shared in that decentralized system, rather than setting up silos in decentralized system. So my advice, and it doesn't have to be top three things, is that if you have decentralized healthcare system, you should have a well- connected decentralized healthcare system, the one that functions through sharing of data, sharing of information, and sharing of expertise. Okay, so I'm gonna end things on those calls to action. I want to thank Will and Finn for sharing all their expertise and their knowledge and their great banter here today. I'd also like to thank Lee, Nabeel, and Andrew for inviting us on the show, as well as all of our Kinkagen partners and all the hardworking frontline workers everywhere who sacrificed their time, energy, and safety to keep us safe. So if you want to get ahold of us to continue these discussions, you can tap us on the shoulder at the microbenfi slack, or you can contact us by email, where we'll be sure to pop all of the links to any of the tools that we mentioned today in the show notes. And with that, thanks everyone, and stay safe. Thank you all so much for listening to us at home. If you like this podcast, please subscribe and like us on iTunes, Spotify, SoundCloud, or the platform of your choice. And if you don't like this podcast, please don't do anything. This podcast was recorded by the Microbial Bioinformatics Group and edited by Nick Waters. The opinions expressed here are our own and do not necessarily reflect the views of CDC or the Quadrant Institute.