What does it take to switch over to BAM from FASTQ?

uBAM is a strictly better storage format for your sequencing data, and you should switch to it today.

claymation style people sit on the grass around a large computer with ai generated text displayed. UBAM is visible in the text.

Background

It has been over 14 years since the formalization of the FASTQ format (Cock et al. 2009), which describes sequences and their quality scores. For paired end reads, sequences are encoded in separate files, usually called R1 and R2 ¹. Unfortunately despite the publication, FASTQ format is not entirely standardized! For example, it is possible to have a valid FASTQ format in either 4-line-per-entry format, or splitting sequences into multiple lines. Additionally, the defline itself is not entirely standardized and is basically free text. However, we have had a lot of innovations in sequence formats since then. One of those innovations is the SAM/BAM format, the (binary) alignment/mapping format. While this file format is typically used to store information about alignments of sequencing reads, it can also just store the unaligned sequencing data. Crucially, both reads from paired-end sequencing (i.e., R1 and R2) are stored in the same single file and it allows for metadata as explained in this GATK post. We found at least one other blog post with this same sentiment from 2011. This idea isn’t new. It is frustratingly old.

FASTQ files, as a means for storing primary sequencing data before any downstream analysis, are integral to genomic epidemiology. State health labs sequence genomes, transfer the FASTQ files to an internal repository, run quality checks (QC), usually through a quality assurance (QA) pipeline.

So what would it take to change this whole process to BAM files instead?
BAM files have many advantages including having only one file per sample instead of two, encoding extra information such as alignment data, and being indexed for random access. For our purposes here, BAM files will be unaligned BAM (uBAM) because they are not aligned against anything.

Sequencing

First, the sequencers would have to output uBAM files. Can they? Illumina will not output uBAM natively, but it does allow Local Run Manager modules. If one of these modules aligns against a reference, then an Illumina platform would at least produce a BAM. The Ion Torrent platforms do produce a uBAM automatically. After calling with at least Dorado, ONT sequencing outputs uBAM files. Pacbio does generate BAM as native format (they discontinued HDF5). For platforms that do not have this automation, we would need an easy conversion.

One such easy conversion is with samtools, e.g.,

samtools import -1 R1.fastq.gz -2 R2.fastq.gz --order ro -O bam,level=0 | \
  samtools sort -M - -o sorted.bam

Repository

Usually the FASTQ files end up in some kind of repository, organized by run or by organism. Instead of FASTQ files, it is easy to imagine that now the repository consists of the uBAM files alone. One might be concerned about taking additional space, but actually uBAM files may offer a storage space savings over individual FASTQ files, besides reduction in complexity gained by combining forward and reverse reads. The above command uses -M in samtools sort, which sorts the reads by minimizers, which allows for a much better compression and therefore reduced size of the uBAM file. In our example dataset, we transformed a 63M R1 and 55M R2 file into an 81M unmapped sorted uBAM file. A 31% storage reduction, in this case, can represent huge file storage savings across an entire sequencing data repository.

QA/QC

Now that the files are in the repository, how do you run QA on them? Some examples of a QA system include

Which of these pipelines can read a uBAM file? To our knowledge, none of them! This is one area where we need to see more adaptation of uBAM files.

Primary analysis

There are several primary analyses that can be performed right after QA/QC. For example, reads might need to be transformed into an assembly. Or also, they might need to have multilocus sequence typing (MLST) run so that all alleles are known in future comparisons. Some labs are sketching genomes as they are sequenced and so certain tools can create those sketches. Finally, there might be individual genotyping operations such as Salmonella serotyping or virulence factors detection.

Assembly

For genome assembly, many labs use Shovill. However, Shovill does not natively read uBAM files. Therefore, this workflow breaks slightly unless there is some conversion step.

Other assemblers that people commonly use for bacterial genomes are SPAdes and SKESA. SPAdes does read uBAM natively and so that is good. However, it does not appear that SKESA can read uBAM natively.

MLST

MLST software usually takes FASTA or FASTQ files. At this point there are a million classic MLST software packages and for some additional information, please check out Page et al. 2017. For whole genome MLST software tools, we could also not find any packages that natively read uBAM. Please see lskatz’s previous blog post for an in depth view into three of them.

Sketches

If you’re like us, you want to have a directory of at least some sketches from Mash. (see the mashpit project for an exciting project!) It appears that Mash natively does not read uBAM according to the v2.3 usage menu. However it is promising that the Sourmash library does read uBAM natively since version 2.

Genotyping

Generally in our experience, people base genotyping on either KMA, SRST2, SAUTE, or ARIBA. Looking at each of these software packages, we could not find any documentation that uBAM is natively read. However, we could find that FASTA and FASTQ were valid inputs. There are other software packages in the world for specific pathogens like Salmonella but for this generalized blog post, we did not investigate further.

Other compression methods

This article discusses the uBAM format, but there have been many attempts over the years to make other compression formats ². The CRAM format is probably the best example. CRAM has an even better lossless compression of sequences than uBAM, and we even confirmed it on our own sequence.

samtools import -1 1.fastq.gz -2 2.fastq.gz --order ro -O bam,level=0 | \
  samtools sort -O cram --output-fmt-option archive -M - -o archive.cram

When viewing the same sequences in FASTQ, BAM, or CRAM, we get an astonishing reduction.

-rw-------. 1 user users 81M May 30 20:03 unmapped.bam
-rw-------. 1 user users 55M May 31 09:18 unmapped.cram
-rw-------. 1 user users 55M Dec  6  2019 1.fastq.gz
-rw-------. 1 user users 63M Dec  6  2019 2.fastq.gz

The CRAM format is seemlessly incorporated into samtools, allowing for freely converting between formats. In fact, EBI stores a ton of CRAM files already. So why wouldn’t we recommend CRAM upfront? Probably because it is a bigger lift that would involve convincing many sequencing companies to adopt it. We can check a box on our nanopore that makes BAMs; we can’t do the same for CRAM. That said, given an ideal world, we would encourage the sequencing companies to consider that check box.

Conclusion

The good part is that many but not all sequencing platforms output uBAM format natively. For those that don’t have this capability, we have a way to convert FASTQ to uBAM. However even after aquiring a uBAM, we need software to natively read them.

Bioinformatics software developers should be looking ahead to future proof their software. They need to accept uBAM natively as input. Although it might seem straightforward, there are several links in the genomic epidemiology chain that need to be updated. These include updating QA/QC pipelines, primary analyses, and secondary analyses. As for Lee Katz , I should also say that I am guilty of this. For some of my own popular software such as Mashtree and Lyve-SET, they do not natively read uBAM files! I wish I could say that I will address this right away, but with all my other responsibilities, it will be further down the road. Therefore we can say from our own observations and personal experience, there is some work up ahead to get us moved over to uBAM files!

scroll of truth meme, the adventurer has finally found it after 15 years, the scroll of truth, uBAM is written on the scroll. he throws it away in disgust. Nyehh!

Footnotes

To maintain focus in this article, I will gloss past interleaved reads. ↩
One such example of an organization trying to standardize is here. ↩