REAPR (Recognising Errors in Assemblies using Paired Reads) is a recent tool (published May 2013) released from the Berriman group at the the Sanger Institute. It's used to look for assembly errors in genome assemblies. There's a paper about it in Genome Biology.
REAPR is a tool that evaluates the accuracy of a genome assembly using mapped paired end reads, without the use of a reference genome for comparison. It can be used in any stage of an assembly pipeline to automatically break incorrect scaffolds and flag other errors in an assembly for manual inspection. It reports mis-assemblies and other warnings, and produces a new broken assembly based on the error calls.
It's good. Hunt, Otto et al. have done a great job in addressing real problem in bacterial genome assembly and their approach presented in the paper makes a lot of sense. I decided to try it out.
By the by, You may remember Otto from other bioinformatics tools like RATT, which I also really like.
I've always been taught that looking at the underlying reads that make up an assembly is really the only way to detect misassemblies. The easiest way to do this is to map sequencing reads back onto the resulting assembly and look for mapping errors. These errors might include:
- Low mapping coverage
- Abnormal insert sizes in paired end data
- Whether Reads of a particular direction have mapped to a region (all forward or all reverse) i.e. read strand coverage.
- Reads in the wrong orientation
- Reads clipping i.e. the edges of a read haven't mapped. CIGAR strings might look like 20S80M for a 100bp read.
According to the paper, REAPR uses the absence for each of the errors in bold listed above to score each base position of the assembly. If the combined metrics produce a poor score then the region is flagged with a warning. If the error occurs over a scaffolding gap, that is a run of Ns ('NNNNNNN'), then this is flagged as a major error.
I haven't explored REAPR's performance on any data I can show at the moment but I have run into a couple of pitfalls installing it, which I've reported below.
REAPR dependency issues
I installed and ran the program on my laptop with Ubuntu 13.04, perl version 5.10.1 (that's what my distro came with) and I noticed a few things.
REAPR relies heavily on third party tools, including samtools, tabix, smalt (a read mapper) & cmake. The source code for these have been bundled with the program and they are compiled on your computer as part of the installation process.
I had no problems compiling from source on my laptop but there were two dependencies that are required by REAPR that I had to install myself:
- Some file compression library that was provided by 'zlib-dev' (I grabbed it with apt-get).
I tried running this on our computing cluster and found that cluster did not like the REAPR version I had compiled on my laptop (It was specifically tabix that was throwing the error). I tried to compile it from source on the cluster but ran into the same dependency errors I listed above. I also found that the cluster had the exact same versions of samtools and tabix already installed.
Yes, I could add the missing C library to my compile path but I'm not fond of thrashing about in dependency hell so I decided it would be easier to alter the REAPR code to point to the precompiled versions of samtools and tabix on the cluster.
I did this by altering the locations of the various tools in the reapr.pl and task_preprocess.pl scripts, specfically in the code blocks similar to the one below.
$this_script = File::Spec->rel2abs($this_script); my ($scriptname, $scriptdir) = fileparse($this_script); $scriptdir = File::Spec->rel2abs($scriptdir); my $tabix = File::Spec->catfile($scriptdir, 'tabix/tabix'); my $bgzip = File::Spec->catfile($scriptdir, 'tabix/bgzip'); my $samtools = File::Spec->catfile($scriptdir, 'samtools'); my $version = '1.0.16'; my $ERROR_PREFIX = '[REAPR pipeline]';
As you can see in the code above, REAPR is hardcoded to point to the compiled versions bundled with the program itself, which works great IF the third party programs compile.
A long-term fix would be change to install script to first validate that a suitable version of third party programs are available on the $PATH. The program should compile and run the bundled third party software only if they are not present.
Samtools has very strict input
Samtools seems to have very strict rules for the kind of input that it will accept in faidx and similar commands.
REAPR authors are aware of this and have kindly written a module 'facheck', that can detect and fix your input files. Sadly, I found that this module did not catch all the errors in my input files and my analysis failed halfway through with complaints about file format. I think REAPR facheck wasn't catching blank lines properly ('^$'), or maybe it was just me.
From a little testing I found that samtools, tabix, or something in that pipeline does not like:
- Special characters in FASTA headers.
- Whitespace. Any whitespace. Anywhere.
- Blank lines i.e. '^$'
- Variable line length of FASTA sequence.
The 'sample' FASTA file below has examples of these errors.:
>SCAFFOLD420_BLAZEIT_XxX#$@-_-cov23. < crazy headers ATGATGATGATATGATGATGATGATGATGATGATGATGATGAT ATGATGATGATGATGATGATGTTGNNATATGATAGATAAAATTAAGATATATGt < variable length ATGATGATGATATGATGATGATGATGATGATGATGATGATGAT ATGATGATGATATGATGATGATGATGATGATGATGATGATGAT ATGATGATGATATGATGATGATGATGATGATGATGATGATGAT < Blank lines >SCAFFOLD423_BLAZEIT_#$@-_-YOLO25. ATGATGATGATATGATGATGATGATGATGATGATGATGATGAT ATGATGATGATATGATGATGATGATGATGATGATGATGATGAT < trailing whitespace ATGATGATGATATGATGATGATGATGATGATGATGATGATGAT .....
Ideally you want:
>scaffold1 ATATATATGATAGTAGATGATGATGATGATGATGATGATGATG ATATATATGATAGTAGATGATGATGATGATGATGATGATGATG ATATATATGATAGTAGATGATGATGATGATGATGATGATGATG ATATATATGATAGTAGATGATGATGATGATGATGATGATGATG >scaffold2 ATATATATGATAGTAGATGATGATGATGATGATGATGATGATG ATATATATGATAGTAGATGATGATGATGATGATGATGATGATG ATATATATGATAGTAGATGATGATGATGATGATGATGATGATG ATATATATGATAGTAGATGATGATGATGATGATGATGATGATG .....
With these errors aside, which are really my fault for trying to run it on crazy cluster architecture and bad input files, the program runs really smoothly.
I'll have to explore REAPR's accuracy on some test data (I'm thinking German outbreak strains E. coli O104) in a follow up post.