VRS covers a fundamental subset of data types to represent variation, thus far predominantly related to the replacement of a subsequence in a reference sequence. Increasing its applicability will require supporting more complex types of variation, including:
- alternative coordinate types such as nested ranges
- feature-based coordinates such as genes, cytogenetic bands, and exons
- copy number variation
- structural variation
- mosaicism and chimerism
- rule-based variation
The following sections provide a preview of planned concepts under way to address a broader representation of variation.
Intervals and Locations¶
VRS uses Interval (Abstract Class) and Location (Abstract Class) subclasses to define where variation occurs. The schema is designed to be extensible to new kinds of Intervals and Locations in order to support, for example, fuzzy coordinates or feature-based locations.
An Interval (Abstract Class) comprised of an inner and outer SimpleInterval. The NestedInterval allows for the definition of “fuzzy” range endpoints by designating a potentially included region (the outer SimpleInterval) and required included region (the inner SimpleInterval).
|type||string||1..1||Interval type; MUST be set to ‘NestedInterval’|
- Implementations MUST enforce values 0 ≤ outer.start ≤ inner.start ≤ inner.end ≤ outer.end. In the case of double-stranded DNA, this constraint holds even when a feature is on the complementary strand.
Representation of complex coordinates based on relative locations or offsets from a known location. Examples include “left of” a given position and intronic positions measured from intron-exon junctions.
Imprecise chromosomal locations based on chromosomal staining.
Cytogenetic bands are defined by a chromosome name, band, and sub-band. In VRS, a cytogenetic location is an interval on a single chromsome with a start and end band and subband.
The symbolic location of a gene.
A gene location is made by reference to a gene identifier from NCBI, Ensembl, HGNC, or other public trusted authority.
|_id||CURIE||0..1||Location Id; MUST be unique within document|
|type||string||1..1||Location type; MUST be set to ‘GeneLocation’|
|gene_id||CURIE||1..1||CURIE-formatted gene identifier using NCBI numeric gene id.|
- gene_id MUST be specified as a CURIE, using a CURIE prefix of “NCBI” and CURIE reference with the numeric gene id. Other trusted authorities MAY be permitted in future releases.
- GeneLocations MAY be converted to SequenceLocation using external data. The source of such data and mechanism for implementation is not defined by this specification.
Additional State (Abstract Class) concepts that are being planned for future consideration in the specification.
This concept is being refined. Please comment at https://github.com/ga4gh/vr-spec/issues/46.
Variations in the number of copies of a segment of DNA. Copy number variations cover copy losses or gains and at known or unknown locations (including tandem repeats). Variations MAY occur at precise SequenceLocations, within nested intervals, or at GeneLocations. There is no lower or upper bound on CNV sizes.
|type||string||1..1||State type; MUST be set to ‘CNVState’|
|location||Location (Abstract Class)||1..1||the Location of the copy (‘null’ if unknown)|
|min_copies||int||1..1||The minimum number of copies|
|max_copies||int||1..1||The maximum number of copies|
This concept is being refined. Please comment at https://github.com/ga4gh/vr-spec/issues/103
The aberrant joining of two segments of DNA that are not typically contiguous. In the context of joining two distinct coding sequences, translocations result in a gene fusion, which is also covered by this VRS definition.
A joining of two sequences is defined by two Location (Abstract Class) objects and an indication of the join “pattern” (advice needed on conventional terminology, if any).
Under consideration. See https://github.com/ga4gh/vr-spec/issues/28.
t(9;22)(q34;q11) in BCR-ABL
Some variations are defined by categorical concepts, rather than specific locations and states. These variations go by many terms, including categorical variants, bucket variants, container variants, or variant classes. These forms of variation are not described by any broadly-recognized variation format, but modeling them is a key requirement for the representation of aggregate variation descriptions as commonly found in biomedical literature. Our future work will focus on the formal specification for representing these variations with sets of rules, which we currently call Rule-based Variation.
RuleLocation is a subclass of Location (Abstract Class) intended to capture locations defined by rules instead of specific contiguous sequences. This includes locations defined by sequence characteristics, e.g. microsatellite regions.