Random Forest Classification in Copy Number Variation Discovery

Open Access
- Author:
- Jayakar, Gopal
- Area of Honors:
- Biochemistry and Molecular Biology
- Degree:
- Bachelor of Science
- Document Type:
- Thesis
- Thesis Supervisors:
- Santhosh Girirajan, Thesis Supervisor
Santhosh Girirajan, Thesis Honors Advisor
Shaun Mahony, Faculty Reader - Keywords:
- CNV
Machine Learning
Sequencing
Genomics
Bioinformatics
Random Forest - Abstract:
- As sequencing technologies and machine learning methods advance, the potential to diagnose genetic diseases and conditions increases. The leading genetic sequencing platforms, Illumina included, all generate their sequencing output in the form of short genetic sequences on the order of tens to hundreds of base pairs called “reads”. When generating a sequence of the entire human genome, specialized software is used to stitch shorter reads together until a large contiguous genetic sequence can be output. The current approach to genetic sequence elucidation has several weaknesses, including difficulty detecting a type of genetic anomaly called aCopy Number Variation (CNV).CNVs are either duplications or deletions in the genome and are larger than Single Nucleotide Polymorphisms (SNP). CNVs have been implicated in the etiology of a wide range of conditions, including Intellectual Disability (ID), Autism, and Schizophrenia. The listed conditions here are all neurodevelopmental, CNVs have the potential to affect any area of health. Accurate identification of CNVs from available sequencing data has the allure of providing diagnostic potential(Clancy, 2008).Currently available CNV identification algorithms have large false positive rates, potentially suggesting incorrect diagnoses. The individual algorithms additionally display little concordance which makes correct CNV determination (termed a “CNV call”) difficult without an external source of validation. This project attempted to create a higher quality CNV calling algorithm by first polling several extant CNV algorithms, comparing and combining their outputs, and using various quality-control metrics to generate a summary of their results. These results are used as features in a random forest machine learning model. Machine learning is a process by which computers can be trained to perform arbitrary tasks, and the random forest model is a machine learning approach designed to assign categorical values to each input using a gold standard as a reference to reinforce correct predictions. The “gold standard” used for comparison was microarray SNP data. In this instance, the random forest model is assigned the task of deciding whether the input is a duplication, deletion, or there is no CNV. Using this approach, a higher quantity of CNV calls with greater precision and recall was recovered than any of the individual algorithms could produce. With further refinement, the methods used in developing this algorithm could be used in medical practice to diagnose a variety of conditions with genetic origins.