Normalization

Certain insertion or deletion alleles may be represented ambiguously when using conventional sequence normalization, resulting in significant challenges when comparing such alleles.

The VR-Spec describes a “fully-justified” normalization algorithm inspired by NCBI’s Variant Overprecision Correction Algorithm [1]. Fully-justified normalization expands such ambiguous representation over the entire region of ambiguity, resulting in an unambiguous representation that may be readily compared with other alleles.

The VR-Spec RECOMMENDS that Alleles at precise locations are normalized to a fully justified form unless there is a compelling reason to do otherwise.

The process for fully justifying two alleles (reference sequence and alternate sequence) at an interval is outlined below.

  1. Trim sequences:
    • Remove suffixes common to all alleles, if any. Decrement the interval end position by the length of the trimmed suffix.
    • Remove prefixes common to all alleles, if any. Increment the interval start position by the length of the trimmed prefix.
    • If neither allele is empty, the allele pairs represent a alleles that do not have common prefixes or suffixes. Normalization is not applicable and the trimmed alleles are returned.
  2. Determine bounds of ambiguity:
    • Left roll: While the terminal base of all non-empty alleles is equal to the base prior to the current position, circularly permute all alleles rightward and move the current position leftward. When terminating, return left_roll, the number of steps rolled leftward.
    • Right roll: Symmetric case of left roll, returning right_roll, the number of steps rolled rightward.
  3. Update position and alleles:
    • To each trimmed allele, prepend the left_roll bases prior to the trimmed allele position and append the right_roll bases after the trimmed allele position.
    • Expand the trimmed allele position by decrementing the start by left_roll and incrementing the end by right_roll.
VR Justified Normalization A demonstration of fully justifying an insertion allele.
Steps
Interbase Position
and Alleles
Resulting Allele Set
(All alleles in this column result
in the same empirical sequence change.)
  1. Given allele S:g.5_6delinsCAGCA defined on reference sequence S=TCAGCAGCT
(4,6)
(“CA”, “CAGCA”)
\[TCAG \Bigl[ \frac{CA}{CAGCA} \Bigr] GCT\]
  1. Trimming

    Remove prefix common to all alleles, if any, and update start position. Remove suffix common to all alleles, if any, and update end position.

    Note: This example shows removing C prefix and A suffix. Equivalently in this case, CA prefix or CA suffix could be removed.

(5,5)
(“”, “AGC”)
\[TCAGC \Bigl[ \frac{}{AGC} \Bigr] AGCT ①\]
  1. Condition: One allele must be empty.

    If the reference allele is empty, the allele set represents an insertion in the reference.

    If the alternate allele is empty, the allele set represents a deletion in the reference.

    If neither is true, the allele set represents a substitution, which is not subject to further normalization.

   
  1. Roll Left

    Begin with trimmed alleles ①.

    While the terminal base of all non-empty alleles equals the base prior to the current position, circularly permute all alleles right one step and move the start left one position.

    Shown: The 4 incremental steps of rolling left.

(1,1)
(“”, “CAG”)
\[\begin{split}TCAGC \Bigl[ \frac{}{AGC} \Bigr] AGCT ①\\ TCAG \Bigl[ \frac{}{CAG} \Bigr] CAGCT \\ TCA \Bigl[ \frac{}{GCA} \Bigr] GCAGCT \\ TC \Bigl[ \frac{}{AGC} \Bigr] AGCAGCT \\ T \Bigl[ \frac{}{CAG} \Bigr] CAGCAGCT \\ \Rightarrow left\_roll = 4\end{split}\]
  1. Roll Right

    Symmetric case of step 3.

(8,8)
(“”, “AGC”)
\[\begin{split}TCAGC \Bigl[ \frac{}{AGC} \Bigr] AGCT ①\\ TCAGCA \Bigl[ \frac{}{GCA} \Bigr] GCT \\ TCAGCAG \Bigl[ \frac{}{CAG} \Bigr] CT \\ TCAGCAGC \Bigl[ \frac{}{AGC} \Bigr] T \\ \Rightarrow right\_roll = 3\end{split}\]
  1. Update position and alleles to fully justify within region of ambiguity.

    To each trimmed allele (①), prepend the left_roll preceding reference bases and append the right_roll following reference bases (corresponding to the interbase reference spans (1,5) and (5,8) respectively).

    Decrement the start position by left_roll, and increment the end position by right_roll.

(1,8)
(“CAGCAGC”,
“CAGCAGCAGC”)
\[\begin{split}TCAGC \Bigl[ \frac{}{AGC} \Bigr] AGCT ①\\ T \Bigl[ \frac{CAGCAGC}{CAGCAGCAGC} \Bigr] T\end{split}\]

References

[1]Holmes, J. B., Moyer, E., Phan, L., Maglott, D. & Kattman, B. L. SPDI: Data Model for Variants and Applications at NCBI. bioRxiv 537449 (2019). doi:10.1101/537449