Understanding QC Metrics in Franklin

Sex Detection

Genetic sex is inferred from sequencing data using either the presence of Y chromosome markers (such as the SRY gene) or the zygosity distribution of variants on the X chromosome. Coverage of the SRY gene enables direct classification, while in its absence, zygosity analysis is used. If the data do not meet the reliability threshold, the result is reported as Unknown.

When coverage-based data is available:

If the SRY gene is covered by sequencing → Male
If the SRY gene is not covered → Female

When coverage-based data is not available:

If the number of variants on chromosome X is insufficient for zygosity-based analysis → Unknown
If chromosome X has mostly homozygous SNPs (homozygous-to-heterozygous SNP ratio > threshold) → XY
If chromosome X appears heterozygous → XX
If the sample qualifies for zygosity-based analysis but does not meet any criteria above → Unknown

Note: For small targeted panels, sex chromosome ploidy is not calculated; only sex detection is provided.

Average Depth (RefSeq Exome)

Represents the average sequencing depth across the coding regions of the RefSeq exome, with a padding of ±10 base pairs around each exon. This metric excludes untranslated regions (UTRs) and provides an overall view of how well the key exonic regions have been covered.

Including reads: aligned reads with mapping quality (MQ) ≥ 1;

excluding duplicate reads.

Average Depth (Panel)

Represents the average sequencing depth across the regions defined in the hard panel: gene list or BED file. This provides a targeted view of coverage for genes or regions defined as a clinical target.

Including reads: aligned reads with mapping quality (MQ) ≥ 1;

excluding duplicate reads.

Average Variant Depth

Represents the average sequencing depth, determined by calculating the mean read depth of all variants detected within exons or within ±10 base pairs of exon–intron junctions.

Hom/Het Ratio

The ratio of the homozygous to heterozygous variants. This ratio is used as a measurement for the variant calling accuracy.

The range for this ratio is expected to be between 0.45 and 0.8 for whole-exome analysis (diploid autosomes).

High hom/het ratio might indicate consanguinity, population-specific low diversity, or sample contamination filtering artifacts. Low hom/het ratio suggests contamination or sequencing artifacts that artificially inflate heterozygous calls.

Ti/Tv Ratio

Ratio of the number of transition to transversion substitutions. This ratio is used as a measurement for the variant calling accuracy.

It reflects the ratio between two types of single nucleotide substitutions:

Transitions (Ti): Substitutions between purines (A ↔ G) or between pyrimidines (C ↔ T). These are more common due to similar molecular structures.

Transversions (Tv): Substitutions between a purine and a pyrimidine (A ↔ C, A ↔ T, G ↔ C, G ↔ T). These occur less frequently and are more likely to be sequencing or calling artifacts when in excess.

Variant Quality

Shows the percentage of reported variants in a sample with a Phred quality score>40*, representing extremely high confidence calls.

*Phred score 40 = 1 in 10,000 error probability (≤0.01%).

Number of SNPs (Excluding Low Quality)

Number of SNPs calculated by counting all SNPs that are classified as High, Medium or Low Confidence. This includes variants located in any region of the panel (not limited to coding or splice sites).

Note: This metric is distinct from those used to calculate Variant Quality or Average Variant Depth, which only include variants in exonic and splice regions.

Percent of Panel Covered (Depth ≥X)

Indicates the proportion of bases in the hard panel covered by at least X reads. The threshold could be defined for ≥5, ≥10, ≥30 or other. Calculation includes reads with MQ ≥1, and excludes duplicate reads. The panel regions are defined by a preconfigured BED file or gene list with exon padding, depending on the panel configuration.

Median Coverage Depth

Reflects the median sequencing depth across all target bases in the assay’s BED file. Half of bases are covered at higher depth, half at lower. Including reads MQ ≥1, excluding duplicates.

Total Reads

The total number of sequencing reads generated for the sample, before filtering or alignment. Includes all reads, regardless of mapping quality or duplication status.

Reads Mapped to Reference

Shows how many reads successfully aligned to the reference genome, serving as an indicator of mapping efficiency.

On-Target Reads and Unique On-Target Reads

On target reads: represents the number of reads that mapped to assay target regions defined by the assay BED file.
Unique On-Target Reads: the non-duplicate reads that mapped to the assay target regions, a more accurate measure of usable sequencing data.
Threshold default settings: ≥1 base overlap.

Duplicate Reads

Represents reads flagged as duplicates during alignment: reads with identical start positions and sequence content, flagged with 0x400 by BWA. Typically indicate PCR artifacts or redundant fragments.

Percent Reads Mapped to Reference

Indicates the proportion of all reads that successfully aligned to the reference genome. (Reads Mapped to Reference ÷ Total Reads) × 100

Percent Unique Reads Mapped On-Target

Represents the fraction of total reads that are both non-duplicate and mapped to assay target regions. (Unique On-Target Reads ÷ Total Reads) × 100

Percent Reads Mapped On-Target

Indicates the proportion of all reads (including duplicates) that map to assay target regions : (On-Target Reads ÷ Total Reads) × 100

Percent Duplication

Measures the proportion of sequencing reads marked as duplicates, indicating potential PCR bias.

Mean Insert Size

Represents the average fragment length of sequenced DNA inserts, by assessing the distance between paired-end reads. Can be useful for evaluating library prep consistency.

Median Insert Size

Shows the median fragment length of sequenced DNA, less affected by extreme values and outliers. This QC metrics could be helpful to detect fragmentation or adapter contamination issues.

Contamination

Estimates the level of DNA contamination in a sample — often due to cross-sample mixing or handling issues. Elevated contamination can reduce variant calling accuracy. The analysis is performed by GATK Calculate Contamination. See GATK documentation for full details.

To learn more about CNV calling QC metrics, please refer to this article: CNV calling QC metrics.

Assessment tools overview

CNV detection by Franklin

Franklin's Advanced Filtering System