Comparison of RNA Reconstruction Assemblers for Long-Read RNA Sequences

Open Access
- Author:
- Rao, Saadya
- Area of Honors:
- Computer Science
- Degree:
- Bachelor of Science
- Document Type:
- Thesis
- Thesis Supervisors:
- Mingfu Shao, Thesis Supervisor
Ting He, Thesis Honors Advisor - Keywords:
- RNA Transcript Assembly
Long-read RNA Sequencing
StringTie
StringTie2
Scallop
Scallop2
PsiCLASS
IsoQuant - Abstract:
- Ribonucleic acid (RNA) plays a crucial role in gene expression and protein synthesis, facilitating the regulation of biological processes through splicing. Accurately deciphering RNA sequences is essential for understanding gene activity and its implications in genetic diseases. This study focuses on evaluating the performance of six RNA transcript assemblers—StringTie, Scallop, PsiCLASS, StringTie2, Scallop2, and IsoQuant—on long-read RNA sequencing datasets generated from Oxford Nanopore Technology and Pacific Biosequences platforms. The datasets, comprising both annotated and unannotated samples, were processed using default parameters to ensure comparability. The assemblies were then assessed against GENCODE v36 annotations, utilizing GffCompare to calculate metrics sensitivity, or recall, and precision across various levels such as base, exon, intron, intron chain, transcript, and locus. The results demonstrate that StringTie performed well, especially at the exon and intron levels, but showed some limitations in its sensitivity. StringTie2 improves upon its predecessor's precision, effectively reducing the number of false positives in transcript identification. In contrast, Scallop showed notably low sensitivity across all datasets, indicating significant challenges in capturing true transcripts. However, Scallop2 achieves a balance between sensitivity and precision by maintaining relatively high levels of both metrics, making it a more reliable option. IsoQuant consistently exhibits high precision across all levels, particularly when compared to Scallop2 and StringTie2, highlighting its utility for researchers prioritizing accuracy in transcript identification. Although the overall sensitivity of the assemblers varies, the analysis reveals that the performance metrics improve significantly with the use of annotated datasets. Additionally, the study underscores the importance of utilizing comprehensive reference genomes to enhance assembly quality.