Episode 104: The Kraken software suite

📅6 April 2023

⏱️00:21:50

🎙️Microbial Bioinformatics

👥Guests

Jennifer Lu

Staff Scientist, Johns Hopkins University Center for Computational Biology

Natalia Rincon

PhD Student, Biomedical Engineering, Johns Hopkins University

Listen on SoundCloud Download MP3 📝View Transcript

This episode explores the Kraken software suite, a powerful taxonomic classification tool used for metagenomic analysis and pathogen detection. Experts from Johns Hopkins University discuss the evolution of Kraken and its various versions, highlighting its significance in bioinformatics research.

Kraken Versions and Features

The discussion began with an overview of the original Kraken, which employs an exact k-mer matching process. Interestingly, the design is inspired by jellyfish, using a k-mer size of 31. Kraken Unique is a variant that introduces a unique k-mer counting column, allowing users to determine how many unique k-mers are covered by each read. This provides an additional method for verifying microbial identification.

Kraken 2

Kraken 2 was created to handle larger databases efficiently. It does so by using a probabilistic data structure and minimizers, which help map k-mers to shorter sequence sizes. This advancement enables more robust analyses, particularly useful in microbiome research and pathogen detection.

Applications in Microbiome Analysis

Kraken is recognized for its utility in microbiome analysis, notably in pathogen detection. However, the accuracy of its results heavily depends on the genomic data available in its database. This underscores the emphasis on bacterial and viral data. For infectious pathogen detection, Kraken 1 Unique is combined with Bracken to estimate the abundance of species present.

Importance of Genomic Data

The developers highlighted the importance of understanding the availability of genomic data within the database. The accuracy of the results is contingent upon this data, making it crucial for users to ensure the database is comprehensive and up-to-date.

Wider Usage in Bioinformatics

Kraken is widely used in bioinformatics beyond metagenomics. For instance, it can treat a single genome as a metagenome to conduct quality control analyses. In scenarios with conflicting taxa in the reads, Kraken's results help identify the presence of contamination, thus proving essential in sample analysis.

Contamination Detection

The team elaborated on uses of Kraken in contamination work. They detect contamination in pathogen genomes by comparing them against bacteria, human genomes, and databases of vertebrates and plants. For example, they have identified sequences contaminating eukaryotic pathogen genomes, originating from hosts like chicken or cow.

Future Developments

Looking ahead, the Kraken team intends to:

Maintain all Kraken repositories,
Enhance accuracy, speed, and usefulness,
Develop new scripts and downstream analysis tools within the Kraken Tools suite.

They recognize the growing need to reduce database sizes as more genomes become available and are exploring indexing and sketching techniques to address this.

Conclusion

Kraken remains an indispensable tool for metagenomic analysis and pathogen detection. As it continues to evolve, the Kraken team advises users to prioritize accurate data for effective pathogen identification and classification.

[Note: Terms such as k-mer, probabilistic data structure, and specific tools like Bracken have been italicized to indicate their scientific and technical nature.]

Extra notes

The podcast discusses the Kraken software suite, a taxonomic classification tool used for metagenomic analysis.
Kraken Versions and Tools:
- The original Kraken (developed by Derek Wood) introduced exact k-mer matching using a k-mer size of 31.
- Kraken Unique: Builds upon Kraken 1 by adding unique k-mer counting, which aids in verifying read classification by providing a unique k-mer coverage metric.
- Kraken 2: Utilizes a probabilistic data structure to significantly reduce database size, trading off a slight reduction in accuracy for faster performance and lower computational resource requirements. It also employs minimizers to balance database size and accuracy.
- Kraken 2 Unique: Combines Kraken Unique's functionality with Kraken 2's minimizers for unique minimizer counting instead of k-mers.
- Chunking for Kraken Unique: Enables the use of Kraken with large databases without needing to load them entirely into memory by comparing reads against portions of the database at a time.
Additional Tools:
- Bracken: Uses a Bayesian algorithm to provide abundance estimation, quantifying species or genus-level abundance from Kraken classifications.
- Pavian: A graphical interface for analyzing Kraken reports, facilitating data interpretation and visualization.
- Kraken Tools: A set of scripts for downstream analysis, providing users with statistical metrics and visualizations.
Challenges and Methodologies:
- The increase in genome numbers necessitates smaller, more efficient databases.
- Different tools for pathogen detection versus microbiome diversity estimation, with tips on choosing tools depending on the accuracy and resource availability.
- Emphasis on the reliance on existing genomic databases, such as NCBI RefSeq, and their limitations in representing certain pathogen types.
Applications:
- Pathogen detection in clinical samples, allowing identification of infectious agents.
- Microbiome analysis across various environments, detecting prevalent microbes.
- Contamination detection in genomic data, leveraging Kraken's strengths in identifying unexpected taxonomic sequences.
Future Directions:
- Continuous development and maintenance of the Kraken suite, including updates to GitHub repositories.
- Exploration of new methods for reducing database sizes and improving classification efficiency.
- Potential hosting of databases on compute resources to manage the increasing number of available genomes.

This discussion highlights how Kraken and its associated tools are pivotal in microbial bioinformatics for metagenomic and pathogen detection analyses, while also outlining ongoing challenges in data management and accuracy enhancements.