White Paper Proposes Method to Reduce Sequencing Data Storage

Latest News

In sequencing, the quality of individual bases is assessed through logarithmic quality scores, which are essential for accurate data analysis but significantly increase storage needs. Traditionally, these quality scores, expressed on the Phred scale, require more storage space than the actual base calls.

Researchers at Illumina examine a method to reduce the resolution of quality scores to just eight levels or fewer. Testing this approach revealed that it is virtually lossless, meaning it retains the accuracy needed for reliable variant calling and other analyses. This reduction in quality score resolution resulted in no significant differences compared to using the full quality scale.

Base quality scores are crucial for measuring confidence in base calls and improving the accuracy of sequencing data analysis. However, they currently require substantial storage space. For instance, while a base call uses 2 bits of storage, its corresponding quality score can use up to 5.3 bits. With sequencing output increasing, the associated storage and transfer costs have become a larger portion of the total sequencing cost.

The research demonstrated that reducing the quality score resolution does not compromise the accuracy of standard analyses or variant calling. By replacing a range of scores (e.g., scores between 19 and 25) with a single representative score, significant storage space can be saved. This simplification allows compression algorithms to work more efficiently due to the reduced complexity of the data.

The proposed method is an improvement over existing approaches like the CRAM, cSRA, and SlimGene formats, which often employ lossy compression based on post-alignment data. In contrast, the new method reduces quality score resolution before alignments, streamlining the storage process from the outset.

Download the full whitepaper from Illumina here.

Events & Webinars