Snps/indels - gatk/freebayes
The confidence is based on several parameters of the variants calling and considers several bias and quality metrics, some of the metrics included are:
Quality - variant calling quality based on the likelihood of reads supporting the alternate allele vs the reference allele.
Strand bias fisher test - accommodating for possible bias for calls on one strand.
Mapping quality - rank sum test of mapping quality of read supporting the alternate allele.
Read pos bias - accommodating for possible bias towards the position of the alternate allele in the read.
Total depth in the called region.
Base quality - the base quality rank sum of all the bases that supports the alternate allele.
Allele balance - fraction of reads supporting alternate allele vs reference allele - should match hom/het ratios ( this is turned off on somatic analysis).
There are several other metrics which are specific to the different variant callers.
In addition to the variant caller specific metrics we also use:
VQSR - Done on a cohort of samples using ML to give a score to each variant based on its different parameters. Basically it learns by itself the thresholds for the values above.
Region annotation to modify confidence for example on repetitive regions, homopolymers (e.g. CCCCC..), noisy regions, hard to sequence, internal/genoox frequency etc.
Joint genotyping on families.
Whether or not the variant is false in public or internal databases (gnomad for example).
CNVs
We use an ML based model prediction score and the confidence as part of the model building and Genoox proprietary algorithms. In addition we use the fold change log of the coverage between the sample and the reference, after taking into consideration other factors like GC content or repetitive regions.
On somatic cnv calling, tumor purity (estimated + given) is also considered in the confidence score, as well as subclone population.
Fusions
In general confidence is based on the number, mapping-quality and alignment -score of reads/pairs that support the fusion vs the ref. It's broken down to several types:
Number of paired reads that are mapped to each side of the fusion.
Number of single reads that are split between each side of the fusion (hard clipped).
Number of soft clipped reads - soft clipped reads whose clipped region matches the mate's sequence.
In addition we also take into consideration if there's noise in the breakend region, such as examples of the above read types that map to other genome positions, clipped or just partially mapped. Large noise reduces the confidence in the breakend.
Re-assembly of the reads above and instead of counting reads we count assemblies on each side of the fusion.
Still have questions? Reach out to our Support Team, they'll be happy to help!