Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Pangenomic genotyping with the marker array

Pangenomic genotyping with the marker array We present a new method and software tool called rowbowt that applies a pangenome index to the problem of inferring genotypes from short-read sequencing data. The method uses a novel indexing structure called the marker array. Using the marker array, we can genotype variants with respect from large panels like the 1000 Genomes Project while reducing the reference bias that results when aligning to a single linear reference. rowbowt can infer accurate genotypes in less time and memory compared to existing graph-based methods. The method is implemented in the open source software tool rowbowt available at https:// github. com/ alshai/ rowbo wt. Keywords Sequence alignment, Indexing, Genotyping This was shown in studies of archaic hominids [5], HLA Introduction genotypes [6] and structural variants [7]. A similar bias Given DNA sequencing reads from a donor individual, exists for methods that extract polymorphic sites along genotyping is the task of determining which alleles the with genomic context, and search for these sequences individual has at polymorphic sites. Genotyping from in the reads [8, 9]. In particular, the bias remains if the sequencing data, sometimes using low-coverage sequenc- flanking sequences are extracted from the reference and ing data together with genotype imputation, is a common so contain only reference alleles. task in human genetics [1] and agriculture [2]. In contrast Reference bias can be reduced by using a pangenome to variant calling, genotyping is performed with respect reference instead of a single linear reference. A pange- to a catalog of known polymorphic sites. For instance, nome can take various forms; it can be (a) a generating genotyping of a human can be performed with respect to graph for combinations of alleles, (b) a small collection of the 1000 Genomes Project call set, which catalogs posi- linear references indexed separately, or (c) a larger collec- tions, alleles and allele frequencies for tens of millions of tion of linear reference indexed together in a compressed sites [3]. way. Pangenome graphs (option a) and small collections Many existing genotypers start by aligning reads to a of linear references (option b) have been studied in recent single linear reference genome, e.g.  the human GRCh38 literature [10–14]. reference [4]. Because this reference is simply one exam- Two existing methods that use pangenome graphs are ple of an individual’s genome, genotyping is subject to BayesTyper [15] and PanGenie [16]. BayesTyper reference bias, the tendency to make mistakes in places works by matching k-mers extracted from the input reads where the donor differs genetically from the reference. to k-mers stored in a de Bruijn graph representing known polymorphisms and their surrounding contexts. After *Correspondence: tallying this evidence, BayesTyper calls genotypes Ben Langmead based on a generative model. PanGenie uses a graph langmea@cs.jhu.edu built from haplotypes, collapsed so that variants become Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA bubbles. PanGenie then scans the reads and tallies k-mers that appear in the graph. It then makes genotype © The Author(s) 2023. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/. The Creative Commons Public Domain Dedication waiver (http:// creat iveco mmons. org/ publi cdoma in/ zero/1. 0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Mun et al. Algorithms for Molecular Biology (2023) 18:2 Page 2 of 17 calls based on the tallies of how often read k-mers match array while retaining its ability to deduce when a read- to k-mers along the alternate paths that make up the dis- to-pangenome match provides evidence for a particular tinct REF and ALT alleles. allele at a polymorphic site. The design of the marker Variant graphs like the ones used by BayesTyper and array flows from three observations. First, we can save PanGenie are effective for genotyping, but have draw - space by storing auxiliary information about polymor- backs when the goal is to reduce reference bias. First, phic sites (“markers”) only at the sites themselves. There haplotype information might be removed when adding are often far fewer sites harboring polymorphism than variants to the graph, or might be included in the graph there are BWT runs. Second, pangenome suffixes start - but not considered during the read mapping process. ing with the same allele tend to group together in the suf- This can cause graph-based tools to consider many extra - fix array, which can be exploited to compress the marker neous haplotype paths through the graph during geno- array structure. Third, while a suffix array entry is an off - typing, increasing running time. Second, variant graphs set into the pangenome requiring O(log n) bits, a marker can grow exponentially—in terms of the number of paths need only distinguish markers and alleles, and so requires through the graph—as variants are added, leading to a just O(logM) bits where M is the number of polymorphic rapid increase in resource usage and likelihood of ambig- sites. uous alignments. We sought a way to reduce reference bias by indexing Methods and querying many linear references at once while keep- Preliminaries ing index size and query time low. Such an approach can Consider a string S of length n from ordered alphabet  , take full advantage of linkage disequilibrium information with operator ≺ denoting lexicographical order. Assume in the panel, allowing no recombination events except S’s last character is lexicographically less than the others. those occurring in the panel. This avoids mapping ambi - Let F be an array of S’s characters sorted lexicographically guity from spurious recombination events between poly- by the suffixes starting at those characters, and let L be morphic sites [10]. an array of S’s characters sorted lexicographically by the We propose a new structure called the marker array suffixes starting immediately after them. The list L is the that replaces the suffix-array-sample component of the Burrows-Wheeler Transform [20] of S, abbreviated BWT. r-index with a structure tailored to the problem of col- The BWT can function as an index of S [21]. Given a lecting genotype evidence. Here we describe the marker pattern P of length m < n , we seek the number and loca- array structure in detail. We compare its space usage and tion of all occurrences of P in S. If we know the range query time to those of the standard r-index and explore BWT(S)[i..j] occupied by characters immediately preced- how accurately both structures are able to capture mark- ing occurrences of a pattern Q in S, we can compute the ers from a sequencing dataset. Finally, we benchmark ′ ′ range BWT(S)[i ..j ] containing characters immediately it using real whole-genome human sequencing data preceding occurrences of cQ in S, for any character c ∈  , and compare it to the BayesTyper and PanGenie since genotyping tools in terms of both genotyping accuracy and computational efficiency. We do this for variants ′ i = |{h : S[h]≺ c}| + |{h : BWT(S)[h] = c, h < i}|. genomewide and for variants in genes that are medically (1) relevant. j = |{h : S[h]≺ c}| + |{h : BWT(S)[h] = c, h ≤ j}| − 1. (2) Background The FM Index is a collection of data structures for exe - r‑index cuting such queries efficiently. It consists of an array C The r-index [17] is a compressed repeat-aware text index storing |{h : S[h] ≺ c}| for each character c, plus a rank that scales with the non-redundant content of a sequence data structure for BWT(S) , e.g.  a wavelet tree, that can collection. It uses O(r) space where r is the number of quickly tally the occurrences of a character c up to a posi- same-character runs in the Burrows-Wheeler Trans- tion of BWT . To locate the offsets of occurrences of P in form (BWT) of the input text. Past work shows that the S, the FM-index can additionally include some form of r-index can efficiently index collections of long-read- S’s suffix array. The suffix array SA is an array parallel to derived human genome assemblies [18] as well as large F containing the offsets of the characters in F. To save collections of bacterial genomes [19]. space, the FM-index typically keeps only a sample of SA , While the main contribution of the r-index was its e.g. a subset spaced regularly across SA or across S. strategy for storing and using a sample of the suffix array Let T ={T , T , . . . , T } be a collection of n similar 0 1 n [17], even this sample is large in practice. We propose texts where T is the reference sequence, and T , . . . , T 0 1 n a new marker array structure that replaces the suffix M un et al. Algorithms for Molecular Biology (2023) 18:2 Page 3 of 17 Fig. 1 Top left: A multiple alignment for a collection of alternate haplotypes (H1–H9), and a reference sequence (R). Marked bases are in bold and alternate alleles are colored. Middle left: The text T, formed by concatenating rows of the multiple alignment (eliding gaps). Bottom left: The edit table E, with alternate-allele coloring. Right: A partial illustration of the marker array in relation to SA , the relevant suffixes themselves (truncated to fit), and the BWT . Colors and bolding highlight where marked bases and alternate alleles end up in the suffixes are alternative sequences. In the scenarios studied here, a genomes in rows and columns representing genomic off - T represents a human haplotype sequence, with all chro- sets. The elements are either bases or gaps. We call a col - mosomes concatenated, and T represents the GRCh38 umn consisting of identical bases and lacking any gaps a primary assembly of the human genome. Each T with uniform column. Any other column is a polymorphic col- i > 0 is an alternate haplotype taken either from the 1000 umn. Figure  1 illustrates a multiply-aligned collection of Genomes project call set [3] or from the HGSVC pro- haplotypes and the concatenated text T. ject [22, 23], each with chromosomes concatenated in the same order as T ’s. We use the terms “haplotype” and Marker array “genome” interchangeably here. Let the “marker array” M be an array parallel to the con- We assume that all the T ’s are interrelated through a catenated sequence T marking positions that fall in a multiple alignment, e.g. as provided in a Variant Call For- polymorphic column in the multiple alignment. Each ele- mat (VCF) file. The multiple alignment is a matrix with ment of M is a tuple recording the offset i with respect Mun et al. Algorithms for Molecular Biology (2023) 18:2 Page 4 of 17 to T where the polymorphism begins, as well as the edit starting from each row x ∈[i..j] . An FL step is the inverse operation describing how the sequence differs from the of an LF step. That is, if we write an LF mapping step in reference at this locus. Distinct edit operations are given terms of a rank query distinct integer identifiers, which are decoded using a i = |{h : S[h]≺ c}| + BWT.rank (i), BWT[i] (3) separate table E. Identifier 0 is the null operation, denot - ing that the reference allele appears without edits. For where S.rank (i) denotes the number of occurrences of example, say E ={1 : X → C} , where X → C denotes a c in S up to but not including offset i, then an FL step substitution that replaces the reference base with C. Then inverts this using a select query a marker array record m = (500, 0) marks a locus with no ′ ′ edit with respect to reference position T [500] . A record ′ 0 i =BWT.select (i − |{h : S[h]≺ F[i ]}|), F[i ] (4) m = (500, 1) denotes that a substitution changes that th where S.select (i) returns the offset of the i + 1 occur- base at T [500] to a C. An example is shown in Fig. 1 (bot- rence of c in S, i.e.  the c of rank i. Whereas LF takes a tom left). leftward step with respect to T, FL takes a rightward step. Consecutive substitutions are collapsed into a single By starting in each row x ∈[i..j] and performing a edit in the E table. Insertions and deletions (“indels” for sequence of |q|− 1 FL steps for each, we can visit each short) are treated somewhat differently; the offset car - offset of T overlapped by an occurrence of q. Checking rying the “mark” is the one just preceding the indel (just MA[k] at each step, where k is the current row, tells us to its left) in the multiple alignment. Importantly, the which marker is overlapped, if any. This is slow in prac - mark covers exactly one position in the genome, even tice, both because it requires O(|q|(j − i + 1)) FL steps if the insertion/deletion spans many bases. The marked in total, and because each step requires a select query, position must come to the left of the indel to ensure which is more costly in practice than a rank query. that suffixes starting at the marked position include the allele itself. In the multiple alignment in Fig.  1 (top left), for example, a deletion with respect to R occurs in the Heuristic backward‑search approach with smearing fourth-from-left column, but the marked position is in Say we perform a backward search starting with the the third-from-left column. rightmost character of q. At each step we are consid- The marker array MA is a permutation of M such that ering a range [i..j] of SA having a suffix of q as a prefix. marks appear in suffix-rank order: Using i and j, we can query MA[i..j] . However, this tells us instances where a suffix of q overlaps a marker, whereas Definition 1 The marker array MA is the mapping such our goal is to find where the whole query q overlaps a that MA[i] = M[SA[i]]. marker. If we report overlaps involving trivially short suf- fixes of q, many would be false positives. We propose to Thanks to suffix-rank order, identical M[i] ’s are often allow but reduce the number of such false positives by grouped into runs in MA , as seen in Fig. 1 (right). augmenting MA: A marker query for pattern string q returns all m ∈ M overlapped by an occurrence of q in T. We can begin to Definition 2 The augmented marker array MA is a answer this query using backward search (Eq.  1) with multimap such that MA [i] =[M[SA[i]],M[SA[i]+ 1], ..., P = q , giving the maximal SA range [i..j] such that q is a M[SA[i] + w]] prefix of the suffixes in the range. Having computed [i..j], we know that {MA[i], ...,MA[j]} contain markers over- That is, MA [i] is a (possibly empty) list containing lapped by q’s leftmost character. To recover the markers markers overlapping any of the positions T[i..i + w] . W e overlapped by the rest of q’s characters, two approaches call this a “smeared” marker array, since the marks are can be considered, detailed in the following subsections. extended (smeared) to the left by w additional positions. The FL approach recovers the overlapped markers in a Note that a length-w extension can overlap one or more straightforward way but uses O(|q|· occ) time, where other marked variants to the left. For this reason, MA occ is the number of times q occurs in T. The heuristic must be a multimap, i.e. it might associate up to w mark- backward-search approach requires only O(|q|) time but ers with a given position. is not fully sensitive, i.e. it can miss some overlaps. Using MA , we adjust the backward-search strategy so that instead of querying MA at each step, we query MA every w steps. If w is large enough—e.g.  longer than the FL approach length at which we see random-chance matches—we Say [i..j] is the maximal SA range such that all rows have can avoid many false positives. More space is required q as a prefix. We can perform a sequence of FL steps, M un et al. Algorithms for Molecular Biology (2023) 18:2 Page 5 of 17 w previous section. The algorithm proceeds right to left to represent MA compared to MA since it is less sparse. w through the read, growing the match by one character if However, we expect MA to remain run-length com- possible. When we can no longer grow the match (i.e. the pressible for the same reason that MA is. range [i..j] becomes empty), we reset the range to the all- inclusive range [0..|T |− 1] and restart the matching pro- Genotyping a read cess at the next character. We use the term “extension” to Given a sequencing read, we would like to extract as refer to a consecutive sequence of steps that successfully much genotype information as possible while minimiz- extend a match. Note that this is a heuristic algorithm ing computational cost and false-positive genotype evi- that does not exhaustively find all half-MEMs between dence. Here we give a heuristic algorithm (Algorithm  1) w the read and the index, as the MONI algorithm does [24]. that handles entire sequencing reads, querying MA during the backward-search process as proposed in the Mun et al. Algorithms for Molecular Biology (2023) 18:2 Page 6 of 17 As discussed above, the algorithm only checks the (and non-empty) elements. Our sparse encoding for A marker array every w steps (line 14). As an additional consists of three structures. S is a length-|A| bit vector filter, the algorithm only performs a marker-array query with 1s at the positions where a run of identical entries when the current suffix-array range size is no larger than in A begins, and 0s elsewhere. E is a similar bit vector the number of haplotypes in the index ( N ). A range marking the last position of each of the x runs. (This exceeding that size indicates that we are seeing more variable E is distinct from the E table defined above than one distinct match in at least one haplotype, mean- in “Marker Array.”) To find whether an element A [i] is ing that the evidence is ambiguous. non-empty, we can ask whether we are between two The algorithm tallies evidence as it goes (line 18), but such marks; that is, A[i] is non-empty if and only if might later choose to ignore that evidence if certain S.rank (i + 1)> E.rank (i). 1 1 conditions are not satisfied (lines 10 and 27). For exam - X is a length-x array containing the element that is ple, if the evidence has a conflict—i.e.  one match indi - repeated in each of A’s non-empty runs, in the order cates a reference allele at a site but another match found they appear in A. If A[i] is not empty, the element during the same extension finds an alternate allele at appearing there is given by X [E.rank (i)]. that site—then all the evidence is discarded for that When encoding M or MA , the elements of X are simply extension. Similarly, evidence from an extension is dis- tuples. A complication exists for MA , since elements are carded if the tallied sites span multiple chromosomes. lists of up to w tuples. In this case, we keep an additional Finally, evidence from extensions failing to match at bit-vector B of size |X| where 1s denote left-hand bound- least 80 bp of the read (adjustable with --min-seed- aries in X that correspond to runs in A. E and B can be length option) is discarded. used together to access an element in A (Algorithm 2). We employ other heuristics to minimize mapping time not shown in Algorithm 1. For instance, we avoid wasted effort spent querying the wrong read strand. Extracting markers from VCF Specifically: rowbowt randomly selects an initial A Variant Calling Format (VCF) [25] file is used to strand of the read to investigate: forward or reverse encode a collection of haplotypes with the variants complement. If an extension from this strand meets the arranged in order according to a reference genome. In minimum seed-length threshold (80 by default), then the case of human and other diploid genomes, haplo- the other strand is not considered and analysis of the types are grouped as pairs. We refer to such a collection read is complete. Otherwise, rowbowt then goes on to of haplotypes as a “panel” and a single haplotype as a examine the opposite strand of the read. “panelist.” An VCF entry encodes a variant as a tuple consisting of a chromosome, offset, the allele found in the reference, the alternate allele found in one or more Sparse marker encoding panelists, and a sequence of flags indicating whether We encode the sparse arrays M , MA and MA in the each panelist has the reference or alternate version. We following way. Say that array A consists of empty and start from a VCF file to determine how to populate the non-empty elements. We consider A’s non-empty ele- marker arrays M , MA and/or MA , as well as the edit ments as falling into one of x maximal runs of identical array E. M un et al. Algorithms for Molecular Biology (2023) 18:2 Page 7 of 17 A single element of M is a tuple (r, e), where r is an off - This software is written in C++17, uses the open-source set in T and e is the edit operation describing how the MIT license, and builds on the Succinct Data Structure sequence differs from the reference. As a practical mat - Library (SDSL) v3.0 [27]. ter, we represent these tuples in a different way that more For querying the marker array, we use the rowbowt closely resembles the corresponding VCF records. Spe- implementation at https:// github. com/ alshai/ rowbo wt. cifically, a marker is encoded in a 64-bit word divided This repository contains the open source C++17 imple - into three fields. First, a 12 bit field identifies the chromo - mentation of rowbowt, distributed under the MIT some containing the marker. The chromosome ordering license. It is also a library, containing algorithms for is given at the beginning of the VCF file in the “header” building and querying indexes containing various struc- section. For example, if “chr1” is the first chromosome in tures discussed here, including the run-sampled suffix the header, then this chromosome is encoded as 0x000 array, marker array, and others. (using hexadecimal), and if “chr2” is the second chromo- some, it is encoded as 0x001. Second is a 54 bit field Results encoding the marker’s offset within the chromosome. We evaluated the efficiency and accuracy of our marker- Third is a 4-bit field storing which version of the variant array method for compiling genotype evidence. We first is present, with 0 indicating the reference allele, 1 indicat- generated multiple series of rowbowt indexes covering ing the 1st alternate allele, 2 the second alternate allele, various settings for three parameters: the window size w etc. This 64-bit representation allows for compact storage for the smeared marker array MA , the number of hap- of markers and easier random access to the marker array. lotypes indexed, and the minimum allele frequency for marked alleles. The rowbowt index consisted of three components: the run-length encoded BWT, the run-sam- pled suffix array, and the marker array. While we built the Diploid genotyping sampled suffix array for these experiments, the stand - In a diploid genome, it is possible for both alleles to ard marker-array-based method in rowbowt does not occur, i.e. for the genotype to be heterozygous. We use an require this array. existing approach [26] to compute genotype likelihoods We generated indexes for collections of 200,  400,  800, considering all possible diploid genotypes: homozygous or 1000 human chromosome-21 haplotypes from the reference (2 reference alleles), homozygous alternate (2 1000 Genomes Phase 3 reference panel [3] based on the alternate alleles), or heterozygous (1 reference, 1 alter- GRCh37 reference. We generated two sets of indexes: nate). Let g ∈{0, 1, 2} denote the number of reference- one where the marker array marks all polymorphic allele copies at the marked site; e.g.  g = 1 corresponds sites regardless of frequency (denoted “ AF > 0”), and to a heterozygous site and g = 2 to a homozygous ref- another where the marker array marks only those sites erence site. Let l be the number of times the reference where the less common allele occurs in greater than 1% allele was observed in the reads overlapping a particular of haplotypes, i.e.  has allele frequency over 1% (denoted marked site and let k be the count of all alleles (reference “ AF > 0.01”). In all cases, the marker array window size or alternate) observed. Let ǫ be the sequencing error rate. w was set to 19. Each haplotype collection was drawn We calculate the genotype likelihood as follows, adopting from a random subset of 500 individuals from the 1000 equation 2 of [26] while setting the ploidy to 2 and adopt- Genomes Phase 3 panel. The AF > 0 panel of 500 hap- ing a global rather than a per-base error rate: lotypes contained 1,  097,  388 polymorphic sites. The AF > 0.01 panel of the same haplotypes contained k k−l L(g) = [(2 − g)ǫ + g(1 − ǫ)] [(2 − g)(1 − ǫ) + gǫ] . 193, 438 polymorphic sites with allele frequency over 1%. We also included the GRCh37 reference sequence, con- To choose the most likely genotype , we compute: max sisting of all reference alleles, in each collection, corre- g = argmax L(g). sponding to the reference sequence called T above. max g∈{0,1,2} We generated a series of indexes with window size By default, rowbowt uses ǫ = 0.01. w ∈{13, 15, 17, 19, 21, 23, 25} . We generated two such series: one with no minimum allele frequency ( AF > 0 ) Implementation details and another with a 1% minimum frequency ( AF > 0.01 ). The code for constructing the marker array is imple - Each index was over the same set of 100 haplotypes. mented in the pfbwt-f spoftware package, with repository at https:// github. com/ alshai/ pfbwt-f. This Index size repository also contains an efficient implementation of We measured the size of the three main components the prefix-free-parse BWT construction algorithm [18]. of the rowbowt index: the augmented marker array, Mun et al. Algorithms for Molecular Biology (2023) 18:2 Page 8 of 17 Fig. 3 Mean time over 10 trials of aligning 10,000 simulated reads Fig. 2 Left: Size of rowbowt data structures as a function of the from HG01498 against the augmented marker array (marker) and number of haplotypes indexed, and with w = 19 . “Mark er array ” w the r-index suffix array (locate). Experiments are repeated for marker refers to the augmented marker array, MA . Right: Size of rowbowt collections including all alleles ( AF > 0 ) and for alleles having data structures as a function of the “smearing” window size w, with frequency at least 1% ( AF > 0.01 ). Left: The experiment is repeated number of haplotypes fixed at 100. Separate results are shown for for various window sizes w, and for 100 haplotypes. Right: The when there is no minimum allowed allele frequency ( AF > 0 ) and experiment is repeated for different numbers of indexed haplotypes, when the minimum frequency is 1% ( AF > 0.01) with w = 19 the run-length encoded BWT (RLE BWT) [28] and the runs in the augmented marker array, negatively affecting run-sampled suffix array (“r-index SA”) [17]. Figure  2 run-length compression. plots this measurement for collections of 200, 400, 800 In the right portion of Fig. 2, the RLE BWT and r-index and 1000 haplotypes for both AF > 0 and AF > 0.01 . SA have constant size because the w parameter does not All grow linearly with the number haplotypes grows, affect those data structures. In the left portion of Fig.  2, as expected. For AF > 0 , the augmented marker array showing size as a function of number of haplotypes, the is consistently larger than the run-sampled suffix array augmented marker array is almost always larger than (“r-index SA”). For AF > 0.01 , the augmented marker the r-index SA for AF > 0 as opposed to AF > 0.01 , array is much smaller, approaching the size of the RLE except at w = 13 . The slope of the array size is smaller for BWT. The AF > 0 array is larger because it contains pol- AF > 0.01 than for AF > 0. ymorphic sites with infrequent alleles; about 85% of the Overall, both the value of w and the number of hap- marked sites in the AF > 0 array have allele frequency lotypes in the index cause the augmented marker array under 1%. Further, rare alleles are less likely to form long to increase in size, but the inclusion of rare alleles (< 1% allele frequency) has the largest effect on its size. M un et al. Algorithms for Molecular Biology (2023) 18:2 Page 9 of 17 Query time We next measured query time for the augmented marker array strategy versus the locate-query strategy which uses the run-sampled suffix array. 150 bp simulated reads of 25x coverage were generated for one haplotype of HG01498, an individual that is part of the 1000-Genomes panel, but which we excluded from all our indexes. We simulated reads using Mason 2 mason_simulator [29] with default options. In the case of the marker-array strategy, we measured the time required to analyze the reads using the algo- rithm described above in “Genotyping a read.” In the case of the locate-query strategy, the MA query was replaced with a two-step process that first performed a locate query with respect to the run-sampled suffix array, then performed a lookup in the M array. To enable this mode, we further augmented the r-index with a representa- tion of M using the sparse encoding described above. To emphasize: the rowbowt strategy does not require the run-sampled suffix array or the M array; the MA effec - tively replaces them both. We repeatedly sampled 10,000 simulated reads and recorded the mean query time over 10 trials. As seen in Fig.  3 the augmented marker-array method (labeled “marker”) was consistently faster than locate method. This was true for all allele frequencies and window sizes tested. We found that the effect of w and allele-frequency cutoff was more pronounced with the larger reference panel AF > 0 . For the smaller panel ( AF > 0.01 ), quer y time was mostly invariant to both window size and allele Fig. 4 Precision (left) and recall (right) of the calls made when frequency. querying 25x simulated reads from HG01498 against the augmented marker array (marker) and the r-index (locate). Stratified by minimum allowed allele frequency (AF) in the index Genotyping accuracy We next measured the accuracy of the genotype infor- mation gathered using the augmented marker array. We simulated sequencing reads from one haplotype of marked sites that truly have the alternate allele, while the HG01498 to an average depth of 25-fold coverage. Indi- negative class consists of marked sites that truly have the vidual HG01498 was excluded from the indexes. As our reference allele. We measure: “truth” set for evaluation, we use the variant calls in the 1000 Genomes project callset for the same haplotype we TPs TPs Precision = Recall = , simulated from. For simplicity, this experiment treats the TPs + FPs TPs + FNs genome as haploid. Experiments in the next section will where TP stands for True Positive, FN stands for False account for the diploid nature of human genomes. Negative, etc. A single marked site can have conflicting evidence, due, Figure 4 shows precision and recall with respect to the for instance, to mismapped reads or sequencing errors. number of haplotypes in the index and the minimum For this evaluation, we make calls simply by finding the allele frequency of the haplotype collection. We observed frequently observed allele at the site. We ignore any that the AF > 0.01 indexes generally had better precision instances of alleles other than the ones noted in the VCF compared to the AF > 0 indexes, though at the expense file as the reference and alternate alleles. If the reference of recall. Precision and recall generally improve with the and alternate alleles have equal evidence, the reference addition of more haplotypes to the index. The augmented allele is called. marker array has similar recall to the locate procedure We calculate precision and recall according to the fol- across all haplotype sizes at the loss of precision. When lowing formulas. Here, the positive class consists of Mun et al. Algorithms for Molecular Biology (2023) 18:2 Page 10 of 17 rare variants are removed from the index ( AF > 0.01 ), substitutions, “Indel” includes indels no more than 50bp the gap in precision between the marker array and the long, and “SV” includes insertion or deletions longer than locate procedure lessens. This mild (less than 0.1%) loss 50bp, and “All” includes all variant types. More complex of precision is expected since algorithm described above structural variants like inversions and chromosomal rear- in “Genotyping reads” is still prone to some false posi- rangements are ignored. tives in the earlier part of the extension process. We analyzed the accuracy of rowbowt ’s diploid gen- otypes in two ways. First we considered allele-by-allele Diploid genotyping assessment precision and recall, considering the alternate (ALT) To assess diploid genotyping accuracy, we used data allele calls to be the positive class. Specifically, every dip - from the Human Genome Structural Variation Consor- loid genotype called by a method is considered as a pair tium (HGSVC) [22, 23]. The HGSVC called both simple of individual allele calls. If a given allele call is an alternate and complex genetic variants across a panel of 64 human (ALT) allele and there is at least one ALT allele present in genomes. Calls were made with respect to the GRCh38 the true diploid genotype at that site, it counts as a true primary assembly [4]. For input reads, we subsampled positive (TP). If the given allele is a reference allele (REF) reads from a 30-fold average coverage PCR-free read set and there is at least one REF allele in the true diploid gen- provided by the 1000 Genomes Project [3] (accession otype, this is a true negative (TN). If the given allele is an SRR622457). To create more challenging scenarios for ALT but the true genotype is homozygous REF, we count the genotypers, we randomly subsampled the read sets it as a false positive (FP). Finally, if the given allele is REF to make smaller datasets of 0.01, 0.05, 0.1, 0.5, 1, 2 and but the true genotype is homozygous ALT, this is a a false 5-fold average coverage. negative (FN). We assessed four genotyping methods. The first Second, we considered precision and recall with (bowtie2+bcftools) used Bowtie 2 [30] v2.4.2 to respect to sites that were either truly heterozygous or align reads to a standard linear reference genome, then called heterozygous. If a heterozygous call made by a used BCFtools v1.13 to call variants (i.e.  genotypes) method is truly heterozygous, this was counted as a true at the marked sites [26]. The second method used the positive (TP). False positives, false negatives, and true graph-based genotyper BayesTyper v1.5. The third negatives are defined accordingly. was PanGenie [16], a pangenome-based genotyper As seen in Fig.  5, rowbowt ’s ALT and HET preci- for short reads. PanGenie calls genotypes for variants sion were generally the highest of all the methods across that are represented in its index as bubbles in a pange- all variant categories, though BayesTyper sometimes nome. PanGenie represents known haplotypes as paths achieved higher ALT/HET precision for indels in the through the graph, accounting for these paths during the higher-coverage datasets and PanGenie had the highest genotyping process. PanGenie takes directed acyclic HET precision for SNVs and Indels for some coverages. pangenome graph as input, represented as a VCF file At the highest coverage level examined (5-fold), PanG- containing phased genotypes for many samples. PanGe- enie also achieved similar ALT precision to rowbowt nie further requires that the VCF have non-overlapping on SNVs and Indels. rowbowt dominates on precision variants. We generated a compliant input VCF using a of SV calls, and also exhibits the best recall for ALTs at workflow provided by the PanGenie project We used higher coverage. PanGenie exhibits slightly higher the “high-gq” quality filter also used in the PanGenie recall for HET SVs. study: genotype quality (GQ) ≥ 200 . The fourth method Figure  6 compares the methods on their F1 measure. assessed was rowbowt. rowbowt achieved the highest F1 scores for ALT and Prior to applying BayesTyper, we built a Bayes- HET SNBs. For HET SNV variants, rowbowt had the Typer-compatible VCF file containing all relevant vari - highest F1 score of 0.71, while for ALT SNV variants, the ants from the HGSVC haplotype panel. For rowbowt, F1 was 0.95. In the case of Indel variants, rowbowt pro- we created a rowbowt index from the genomes in the duced the highest F1 score of 0.81 for the ALT class and HGSVC haplotype panel. In both cases we excluded tied with PanGenie for the HET class, with an aggregate NA12878’s haplotypes from the panel prior to building score of 0.48. Fr SVs, rowbowt had the highest F1 score of the index. 0.55 in the ALT class and a comparable score of 0.21 in When evaluating, we stratified variants by complex - the HET class, just below PanGenie ’s score of 0.23. ity: the “SNV” category includes single-nucleotide Diploid genotyping assessment on medically relevant genes We obtained a list of medically relevant autosomal genes, The workflow can be found in the pipelines/run-from-callset collected from medical gene databases by the Genome subdirectory of the PanGenie repository. M un et al. Algorithms for Molecular Biology (2023) 18:2 Page 11 of 17 Fig. 5 Precision and recall for the four tested genotyping methods, both at the level of individual alleles (ALT precision/recall) and at the level of heterozygous variants (HET precision/recall). Note that the bowtie2+bcftools approach is generally unable to align reads across variants in the “SV” category, leading to low precision and recall Mun et al. Algorithms for Molecular Biology (2023) 18:2 Page 12 of 17 Fig. 6 F1 scores across variant types and tools. Computed genomewide (top) and over only those variants in medically relevant autosomal genes (bottom) in a Bottle (GiaB) project [31]. Among the 5,026 genes case of Indel variants, rowbowt was the leader with an included are the major MHC Class I genes (HLA-A, F1 score of 0.82 for the ALT class, and tied with PanG- HLA-B, HLA-C), MHC Class II genes (HLA-DP, HLA- enie with a score of 0.47 for the HET class. As for SVs, DQ, HLA-DR) and KIR genes. A complete list including rowbowt had the highest F1 score of 0.56 in the ALT coordinates can be found at the CMRG GitHub repo [32]. class, and a comparable score of 0.19 in the HET class, As seen in Fig.  7, rowbowt ’s performance relative to closely following PanGenie ’s score of 0.22. the other methods is very similar for these genes com- pared to the full set of genes assessed in Fig. 5. As seen in Computational performance Fig.  6, rowbowt generally outperformed other methods We compared the time and memory usage of each geno- on F1 score for medically relevant genes. For HET SNV typing method, dividing the computations performed variants, rowbowt achieved the highest F1 score of 0.72, by each method into a few distinct categories. For each while for ALT SNV variants, the F1 score was 0.96. In the phase, we measured both the wall-clock time elapsed and M un et al. Algorithms for Molecular Biology (2023) 18:2 Page 13 of 17 Fig. 7 Precision and recall presented as in Fig. 5 but filtered to only the medically relevant autosomal genes, collected from medical gene databases by the Genome in a Bottle (GiaB) project Mun et al. Algorithms for Molecular Biology (2023) 18:2 Page 14 of 17 Fig. 8 Wall clock time and peak memory footprint for each phase of the genotyping workflow for four methods tested maximum memory (“maximum resident set”) used. Both the case of BayesTyper, this specifically consists of were measured with the Snakemake tool’s benchmark (a) using the KMC3 software [34] to count k-mers in the directive [33]. input reads, (b) using the bayestyper makeBloom We categorized and benchmarked three distinct types command to convert k-mer counts to Bloom filters for of computation. For bowtie2+bcftools, we defined each sample, and (c) using the bayestyper clus- the “Alignment” step as the process of using bowtie2 to ter command to identify variant clusters. In the case align reads to the linear reference genome. For rowbowt, of PanGenie, “Pre-genotyping” consists of: (a) read- we defined the “Alignment” step as the process of using ing input files, (b) k-mer counting, (c) path selection, (c) the rb_markers command to genotype the reads using determining unique k-mers. Also for PanGenie, the ini- the algorithm described in Methods. For the alignment- tial input VCF file must be preprocessed into a new VCF free methods PanGenie and BayesTyper, we use the file containing additional fields, accounting for about 1 h term “Pre-genotyping” for the initial phase that includes of computation, not included in our measurements. k-mer counting and other index building procedures. In M un et al. Algorithms for Molecular Biology (2023) 18:2 Page 15 of 17 16 threads were used during the Alignment phase for graph-indexing approaches, which might consider all bowtie2+bcftools, rowbowt and PanGenie, while possible combinations of nearby alleles to be “valid,” even 32 threads were used for BayesTyper. if most combinations never co-occur in nature. For bowtie2+bcftools, we define the Genotyping While we examined only simple structural variants in step as the process of using bcftools call to call the form of insertions and deletions longer than 50 bp, variants from the BAM file output by bowtie2. For the genotyping method is readily extensible to more rowbowt We define the Genotyping phase as the pro - complex differences as well. Indeed, as long as we can cess of running the vc_from_markers.py script on mark the base or bases just to the left of the variant, we the output from rb_markers. For BayesTyper, we can mark any variant in a way that we can later genotype. define the genotyping step as the process of running the Our method will mark only a single base that is to the bayesTyper genotype command. PanGenie was left of the boundary (“breakpoint”) of a variant. This is run by providing 16 threads. Figure  8 shows the time- true whether the variant is large (i.e.  is a large insertion taken and peak memory footprint for each method and or deletion) or small (i.e.  an SNV). We choose to mark on defined phases. the base to the left of the boundary rather than the right The Genotype phases for both bowtie2+ bcftools because suffixes starting from the base to the right will be and rowbowt do not support multi-threading, so a sin- unaffected by the REF and ALT alleles; therefore, matches gle thread was used. For the BayesTyper and Pange- spanning that position do not necessarily carry evidence nome Genotype phases, we used 16 threads. for a particular allele. That said, an alternative approach Figure  8 shows the time taken and peak memory would be to consider “suffixes” extending in either direc - footprint for each method and each category. We tion—either left-to-right or right-to-left—by additionally observed that rowbowt was consistently faster than indexing the reverse of the reference sequences. If both the other methods in the Genotyping phase, sometimes forward and reverse versions of the index are available, by a large margin. We also observed that while row- our method can be made more robust by combining bowt has a higher memory footprint compared to the the evidence obtained from matches spanning both the bowtie2+bcftools method, it uses substantially left and right boundaries of the polymorphic variant. In less memory than BayesTyper and PanGenie, the the future, it will be important to measure whether the other pangenome-based methods. PanGenie has a increase in index size and genotyping memory footprint particularly high (>100 gigabyte) memory footprint in can justify this refinement. both the Pre-genotyping and Genotyping steps. We also note that our strategy marks only a single left-flank base position per variant, even in the case of Discussion an insertion or deletion where the alleles span different We proposed a family of novel marker array structures, numbers of bases. For instance, for an insertion where M , MA and MA that, together with a pangenome index the REF allele is A and the ALT allele is AGG . If we were like the r-index, allow for rapid and memory-efficient instead to mark both the left-flank base (A in this exam - genotyping with respect to large pan-genome refer- ple) and every base within the insertion (the two Gs in the ences. The augmented marker array is smaller and ALT allele), we are vulnerable to a bias resulting from the faster to query than the run-sampled suffix array — the fact that there are more opportunities to observe a match usual way to establish where matches fall when query- overlapping the longer allele than the shorter one. With- ing a run-length compressed index — especially when out any correction, this artificially inflates the evidence we limit the set of markers to just alleles at frequency from the longer allele. That said, it should be possible to 1% or higher. We further showed that the augmented correct for the bias, e.g. by penalizing the evidence from marker array can replace the sampled suffix array in the longer allele in a length-weighted fashion. In the simple genotyping experiments with moderate sacrifice future, it will be important to investigate whether this of precision, and that a marker array based genotyping enhanced marking strategy yields an overall improve- method outperforms the graph-based BayesTyper ment in genotyping accuracy. method. Since this work first appeared, other groups have pur - Pan-genome indexes allow for rapid analysis of reads sued related ideas. In particular, we note that the new while reducing reference bias. The indexes used in our MARIA index is capable list all the distinct columns of experiments consisted of many (up to 65) haplotypes, the multiple alignment overlapped by a match in the with none having a higher priority over the others, except r-index [35]. Further, that study defines a quantity called in the sense that results were expressed in terms of the r that captures something similar to the number of dis- standard reference. Our approach preserves all linkage tinct runs we expected to find in the MA or MA arrays. disequilibrium information. This is in contrast to some In the future it will be important to directly relate r to the Mun et al. Algorithms for Molecular Biology (2023) 18:2 Page 16 of 17 8. Denti L, Previtali M, Bernardini G, Schönhuth A, Bonizzoni P. MALVA: geno- quantities discussed here and to compare the advantages typing by mapping-free ALlele detection of known VAriants. iScience. and disadvantages of our genotyping-centric approach to 2019;18:20–7. the more general approach of MARIA. 9. Shajii A, Yorukoglu D, William Yu Y, Berger B. Fast genotyping of known SNPs through approximate k-mer matching. Bioinformatics. The rowbowt method can lead to future methods that 2016;32(17):538–44. use information about genotypes to build a personal- 10. Pritt J, Chen NC, Langmead B. FORGe: prioritizing variants for graph ized reference genome, containing exactly the genotyped genomes. Genome Biol. 2018;19(1):220. 11. Chen S, Krusche P, Dolzhenko E, Sherman RM, Petrovski R, Schlesinger F, alleles. Alignment to a personalized reference have been Kirsche M, Bentley DR, Schatz MC, Sedlazeck FJ, Eberle MA. Paragraph: a shown previously to be the best way to reduce reference graph-based structural variant genotyper for short-read sequence data. bias, even more effective than the best pangenome meth - Genome Biol. 2019;20(1):291. 12. Garrison E, Sirén J, Novak AM, Hickey G, Eizenga JM, Dawson ET, Jones ods [10, 13]. W, Garg S, Markello C, Lin MF, Paten B, Durbin R. Variation graph toolkit improves read mapping by representing genetic variation in the refer- Acknowledgements ence. Nat Biotechnol. 2018;36(9):875–9. We thank Massimiliano Rossi and Travis Gagie for many helpful discussions. 13. Chen NC, Solomon B, Mun T, Iyer S, Langmead B. Reference flow: reduc- We thank Margaret Gagie for her careful editing.Part of this research project ing reference bias using multiple population genomes. Genome Biol. was conducted using computational resources at the Maryland Advanced 2021;22(1):8. Research Computing Center (MARCC). 14. n J, Monlong J, Chang X, Novak AM, Eizenga JM, Markello C, Sibbesen JA, Hickey G, Chang PC, Carroll A, Gupta N, Gabriel S, Blackwell TW, Ratan A, Author contributions Taylor KD, Rich SS, Rotter JI, Haussler D, Garrison E, Paten B. Pangenom- TM and BL conceived the method. TM wrote the Rowbowt software tool. TM, ics enables genotyping of known structural variants in 5202 diverse NGKV and BL designed and ran the experiments. TM, NGKV and BL wrote and genomes. Science. 2021;374(6574):8871. revised the manuscript. All authors reviewed the manuscript. All authors read 15. Sibbesen JA, Maretty L, Krogh A. Accurate genotyping across variant and approved the final manuscript. classes and lengths using variant graphs. Nat Genet. 2018;50(7):1054–9. 16. Ebler J, Ebert P, Clarke WE, Rausch T, Audano PA, Houwaart T, Mao Y, Funding Korbel JO, Eichler EE, Zody MC, et al. Pangenome-based genome infer- N.S.K.V., T.M. and B.L. were supported by NIH grants R01HG011392 and ence allows efficient and accurate genotyping across a wide spectrum of R35GM139602 to BL. T.M. and B.L. were also supported by NSF DBI Grant variant classes. Nat Genet. 2022;54(4):518–25. 17. Gagie T, Navarro G, Prezza N. Optimal-Time Text Indexing in BWT-runs Bounded Space. In: Proceedings of the 29th Annual Symposium on Availability of data and materials Discrete Algorithms (SODA), pp. 1459–1477; 2018. rowbowt is available at https:// github. com/ alshai/ rowbo wt. 18. Kuhnle A, Mun T, Boucher C, Gagie T, Langmead B, Manzini G. Efficient construction of a complete index for pan-genomics read alignment. J Declarations Comput Biol. 2020;27(4):500–13. 19. Ahmed O, Rossi M, Kovaka S, Schatz MC, Gagie T, Boucher C, Langmead Competing interests B. Pan-genomic matching statistics for targeted nanopore sequencing. The authors declare that they have no competing interests. iScience. 2021;24(6): 102696. 20. Burrows M, Wheeler DJ. A block sorting lossless data compression algo- rithm. Technical Report 124, Digital Equipment Corporation 1994. Received: 31 March 2023 Accepted: 22 April 2023 21. Ferragina P, Manzini G. Opportunistic data structures with applications. In: Proceedings of the 41st Annual Symposium on Foundations of Computer Science (FOCS), pp. 390–398; 2000. 22. Chaisson MJP, Sanders AD, Zhao X, Malhotra A, Porubsky D, Rausch T, Gardner EJ, Rodriguez OL, Guo L, Collins RL, et al. Multi-platform discov- References ery of haplotype-resolved structural variation in human genomes. Nat 1. Davies RW, Kucka M, Su D, Shi S, Flanagan M, Cunniff CM, Chan YF, Myers Commun. 2019;10(1):1784. S. Rapid genotype imputation from sequence with reference panels. Nat 23. Ebert P, Audano PA, Zhu Q, Rodriguez-Martin B, Porubsky D, Bonder MJ, Genet. 2021;53(7):1104–11. Sulovari A, Ebler J, Zhou W, Serra Mari R, et al. Haplotype-resolved diverse 2. Kim C, Guo H, Kong W, Chandnani R, Shuang LS, Paterson AH. Application human genomes and integrated analysis of structural variation. Science. of genotyping by sequencing technology to a variety of crop breeding 2021;372:6537. programs. Plant Sci. 2016;242:14–22. 24. Rossi M, Oliva M, Langmead B, Gagie T, Boucher C. MONI: a 3. Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini pangenomic index for finding maximal exact matches. J Comput Biol. JL, McCarthy S, McVean GA, Abecasis GR, et al. A global reference for 2022;29(2):169–87. human genetic variation. Nature. 2015;526(7571):68–74. 25. Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Hand- 4. Schneider VA, Graves-Lindsay T, Howe K, Bouk N, Chen HC, Kitts PA, saker RE, Lunter G, Marth GT, Sherry ST, et al. The variant call format and Murphy TD, Pruitt KD, Thibaud-Nissen F, Albracht D, et al. Evalua- VCFtools. Bioinformatics. 2011;27(15):2156–8. tion of GRCh38 and de novo haploid genome assemblies demon- 26. Li H. A statistical framework for SNP calling, mutation discovery, associa- strates the enduring quality of the reference assembly. Genome Res. tion mapping and population genetical parameter estimation from 2017;27(5):849–64. sequencing data. Bioinformatics. 2011;27(21):2987–93. 5. Günther T, Nettelblad C. The presence and impact of reference bias on 27. Gog S, Beller T, Moffat A, Petri M. From theory to practice: Plug and play population genomic studies of prehistoric human populations. PLoS with succinct data structures. In: 13th International Symposium on Genet. 2019;15(7):1008302. Experimental Algorithms, (SEA 2014), pp. 326–337; 2014. 6. Brandt DY, Aguiar VR, Bitarello BD, Nunes K, Goudet J, Meyer D. Mapping 28. Belazzougui D, Cunial F, Gagie T, Prezza N, Raffinot M. Flexible indexing bias overestimates reference allele frequencies at the HLA genes in the of repetitive collections. In: Kari J, Manea F, Petre I., editors. Unveiling 1000 genomes project phase I data. G3 (Bethesda). 2015;5(5):931–41. dynamics and complexity. vol. 10307, pp. 162–174. Springer, Cham; 2017. 7. Sherman RM, Forman J, Antonescu V, Puiu D, Daya M, Rafaels N, Boorgula Series Title: Lecture Notes in Computer Science. MP, Chavan S, Vergara C, Ortega VE, et al. Assembly of a pan-genome 29. Reinert K, Dadi TH, Ehrhardt M, Hauswedell H, Mehringer S, Rahn R, Kim from deep sequencing of 910 humans of African descent. Nat Genet. J, Pockrandt C, Winkler J, Siragusa E, Urgese G, Weese D. The SeqAn C++ 2019;51(1):30–5. M un et al. Algorithms for Molecular Biology (2023) 18:2 Page 17 of 17 template library for efficient sequence analysis: A resource for program- mers. J Biotechnol. 2017;261:157–68. 30. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357–9. 31. Wagner J, Olson ND, Harris L, McDaniel J, Cheng H, Fungtammasan A, Hwang Y-C, Gupta R, Wenger AM, Rowell WJ, et al. Towards a comprehen- sive variation benchmark for challenging medically-relevant autosomal genes. 2021. 32. NIST: Medically Relevant Genes. [Online]. Available from: https:// github. com/ usnis tgov/ cmrg- bench marks et- manus cript/ tree/ master/ data/ gene_ coords/ unsor ted/ GRCh38_ mrg_ full_ gene. bed. Accessed 19 Mar 33. Mölder F, Jablonski KP, Letcher B, Hall MB, Tomkins-Tinch CH, Sochat V, Forster J, Lee S, Twardziok SO, Kanitz A, Wilm A, Holtgrewe M, Rahmann S, Nahnsen S, Köster J. Sustainable data analysis with Snakemake. F1000 Res. 2021;10:33. 34. Kokot M, Dlugosz M, Deorowicz S. KMC 3: counting and manipulating k-mer statistics. Bioinformatics. 2017;33(17):2759–61. 35. Goga A, Baláž A, Petescia A, Gagie T. MARIA: multiple-alignment r -index with aggregation. 2022. arXiv 2209.09218. Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in pub- lished maps and institutional affiliations. Re Read ady y to to submit y submit your our re researc search h ? Choose BMC and benefit fr ? Choose BMC and benefit from om: : fast, convenient online submission thorough peer review by experienced researchers in your field rapid publication on acceptance support for research data, including large and complex data types • gold Open Access which fosters wider collaboration and increased citations maximum visibility for your research: over 100M website views per year At BMC, research is always in progress. Learn more biomedcentral.com/submissions http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Algorithms for Molecular Biology Springer Journals

Pangenomic genotyping with the marker array

Loading next page...
 
/lp/springer-journals/pangenomic-genotyping-with-the-marker-array-YtsNR1osUv
Publisher
Springer Journals
Copyright
Copyright © The Author(s) 2023
eISSN
1748-7188
DOI
10.1186/s13015-023-00225-3
Publisher site
See Article on Publisher Site

Abstract

We present a new method and software tool called rowbowt that applies a pangenome index to the problem of inferring genotypes from short-read sequencing data. The method uses a novel indexing structure called the marker array. Using the marker array, we can genotype variants with respect from large panels like the 1000 Genomes Project while reducing the reference bias that results when aligning to a single linear reference. rowbowt can infer accurate genotypes in less time and memory compared to existing graph-based methods. The method is implemented in the open source software tool rowbowt available at https:// github. com/ alshai/ rowbo wt. Keywords Sequence alignment, Indexing, Genotyping This was shown in studies of archaic hominids [5], HLA Introduction genotypes [6] and structural variants [7]. A similar bias Given DNA sequencing reads from a donor individual, exists for methods that extract polymorphic sites along genotyping is the task of determining which alleles the with genomic context, and search for these sequences individual has at polymorphic sites. Genotyping from in the reads [8, 9]. In particular, the bias remains if the sequencing data, sometimes using low-coverage sequenc- flanking sequences are extracted from the reference and ing data together with genotype imputation, is a common so contain only reference alleles. task in human genetics [1] and agriculture [2]. In contrast Reference bias can be reduced by using a pangenome to variant calling, genotyping is performed with respect reference instead of a single linear reference. A pange- to a catalog of known polymorphic sites. For instance, nome can take various forms; it can be (a) a generating genotyping of a human can be performed with respect to graph for combinations of alleles, (b) a small collection of the 1000 Genomes Project call set, which catalogs posi- linear references indexed separately, or (c) a larger collec- tions, alleles and allele frequencies for tens of millions of tion of linear reference indexed together in a compressed sites [3]. way. Pangenome graphs (option a) and small collections Many existing genotypers start by aligning reads to a of linear references (option b) have been studied in recent single linear reference genome, e.g.  the human GRCh38 literature [10–14]. reference [4]. Because this reference is simply one exam- Two existing methods that use pangenome graphs are ple of an individual’s genome, genotyping is subject to BayesTyper [15] and PanGenie [16]. BayesTyper reference bias, the tendency to make mistakes in places works by matching k-mers extracted from the input reads where the donor differs genetically from the reference. to k-mers stored in a de Bruijn graph representing known polymorphisms and their surrounding contexts. After *Correspondence: tallying this evidence, BayesTyper calls genotypes Ben Langmead based on a generative model. PanGenie uses a graph langmea@cs.jhu.edu built from haplotypes, collapsed so that variants become Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA bubbles. PanGenie then scans the reads and tallies k-mers that appear in the graph. It then makes genotype © The Author(s) 2023. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/. The Creative Commons Public Domain Dedication waiver (http:// creat iveco mmons. org/ publi cdoma in/ zero/1. 0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Mun et al. Algorithms for Molecular Biology (2023) 18:2 Page 2 of 17 calls based on the tallies of how often read k-mers match array while retaining its ability to deduce when a read- to k-mers along the alternate paths that make up the dis- to-pangenome match provides evidence for a particular tinct REF and ALT alleles. allele at a polymorphic site. The design of the marker Variant graphs like the ones used by BayesTyper and array flows from three observations. First, we can save PanGenie are effective for genotyping, but have draw - space by storing auxiliary information about polymor- backs when the goal is to reduce reference bias. First, phic sites (“markers”) only at the sites themselves. There haplotype information might be removed when adding are often far fewer sites harboring polymorphism than variants to the graph, or might be included in the graph there are BWT runs. Second, pangenome suffixes start - but not considered during the read mapping process. ing with the same allele tend to group together in the suf- This can cause graph-based tools to consider many extra - fix array, which can be exploited to compress the marker neous haplotype paths through the graph during geno- array structure. Third, while a suffix array entry is an off - typing, increasing running time. Second, variant graphs set into the pangenome requiring O(log n) bits, a marker can grow exponentially—in terms of the number of paths need only distinguish markers and alleles, and so requires through the graph—as variants are added, leading to a just O(logM) bits where M is the number of polymorphic rapid increase in resource usage and likelihood of ambig- sites. uous alignments. We sought a way to reduce reference bias by indexing Methods and querying many linear references at once while keep- Preliminaries ing index size and query time low. Such an approach can Consider a string S of length n from ordered alphabet  , take full advantage of linkage disequilibrium information with operator ≺ denoting lexicographical order. Assume in the panel, allowing no recombination events except S’s last character is lexicographically less than the others. those occurring in the panel. This avoids mapping ambi - Let F be an array of S’s characters sorted lexicographically guity from spurious recombination events between poly- by the suffixes starting at those characters, and let L be morphic sites [10]. an array of S’s characters sorted lexicographically by the We propose a new structure called the marker array suffixes starting immediately after them. The list L is the that replaces the suffix-array-sample component of the Burrows-Wheeler Transform [20] of S, abbreviated BWT. r-index with a structure tailored to the problem of col- The BWT can function as an index of S [21]. Given a lecting genotype evidence. Here we describe the marker pattern P of length m < n , we seek the number and loca- array structure in detail. We compare its space usage and tion of all occurrences of P in S. If we know the range query time to those of the standard r-index and explore BWT(S)[i..j] occupied by characters immediately preced- how accurately both structures are able to capture mark- ing occurrences of a pattern Q in S, we can compute the ers from a sequencing dataset. Finally, we benchmark ′ ′ range BWT(S)[i ..j ] containing characters immediately it using real whole-genome human sequencing data preceding occurrences of cQ in S, for any character c ∈  , and compare it to the BayesTyper and PanGenie since genotyping tools in terms of both genotyping accuracy and computational efficiency. We do this for variants ′ i = |{h : S[h]≺ c}| + |{h : BWT(S)[h] = c, h < i}|. genomewide and for variants in genes that are medically (1) relevant. j = |{h : S[h]≺ c}| + |{h : BWT(S)[h] = c, h ≤ j}| − 1. (2) Background The FM Index is a collection of data structures for exe - r‑index cuting such queries efficiently. It consists of an array C The r-index [17] is a compressed repeat-aware text index storing |{h : S[h] ≺ c}| for each character c, plus a rank that scales with the non-redundant content of a sequence data structure for BWT(S) , e.g.  a wavelet tree, that can collection. It uses O(r) space where r is the number of quickly tally the occurrences of a character c up to a posi- same-character runs in the Burrows-Wheeler Trans- tion of BWT . To locate the offsets of occurrences of P in form (BWT) of the input text. Past work shows that the S, the FM-index can additionally include some form of r-index can efficiently index collections of long-read- S’s suffix array. The suffix array SA is an array parallel to derived human genome assemblies [18] as well as large F containing the offsets of the characters in F. To save collections of bacterial genomes [19]. space, the FM-index typically keeps only a sample of SA , While the main contribution of the r-index was its e.g. a subset spaced regularly across SA or across S. strategy for storing and using a sample of the suffix array Let T ={T , T , . . . , T } be a collection of n similar 0 1 n [17], even this sample is large in practice. We propose texts where T is the reference sequence, and T , . . . , T 0 1 n a new marker array structure that replaces the suffix M un et al. Algorithms for Molecular Biology (2023) 18:2 Page 3 of 17 Fig. 1 Top left: A multiple alignment for a collection of alternate haplotypes (H1–H9), and a reference sequence (R). Marked bases are in bold and alternate alleles are colored. Middle left: The text T, formed by concatenating rows of the multiple alignment (eliding gaps). Bottom left: The edit table E, with alternate-allele coloring. Right: A partial illustration of the marker array in relation to SA , the relevant suffixes themselves (truncated to fit), and the BWT . Colors and bolding highlight where marked bases and alternate alleles end up in the suffixes are alternative sequences. In the scenarios studied here, a genomes in rows and columns representing genomic off - T represents a human haplotype sequence, with all chro- sets. The elements are either bases or gaps. We call a col - mosomes concatenated, and T represents the GRCh38 umn consisting of identical bases and lacking any gaps a primary assembly of the human genome. Each T with uniform column. Any other column is a polymorphic col- i > 0 is an alternate haplotype taken either from the 1000 umn. Figure  1 illustrates a multiply-aligned collection of Genomes project call set [3] or from the HGSVC pro- haplotypes and the concatenated text T. ject [22, 23], each with chromosomes concatenated in the same order as T ’s. We use the terms “haplotype” and Marker array “genome” interchangeably here. Let the “marker array” M be an array parallel to the con- We assume that all the T ’s are interrelated through a catenated sequence T marking positions that fall in a multiple alignment, e.g. as provided in a Variant Call For- polymorphic column in the multiple alignment. Each ele- mat (VCF) file. The multiple alignment is a matrix with ment of M is a tuple recording the offset i with respect Mun et al. Algorithms for Molecular Biology (2023) 18:2 Page 4 of 17 to T where the polymorphism begins, as well as the edit starting from each row x ∈[i..j] . An FL step is the inverse operation describing how the sequence differs from the of an LF step. That is, if we write an LF mapping step in reference at this locus. Distinct edit operations are given terms of a rank query distinct integer identifiers, which are decoded using a i = |{h : S[h]≺ c}| + BWT.rank (i), BWT[i] (3) separate table E. Identifier 0 is the null operation, denot - ing that the reference allele appears without edits. For where S.rank (i) denotes the number of occurrences of example, say E ={1 : X → C} , where X → C denotes a c in S up to but not including offset i, then an FL step substitution that replaces the reference base with C. Then inverts this using a select query a marker array record m = (500, 0) marks a locus with no ′ ′ edit with respect to reference position T [500] . A record ′ 0 i =BWT.select (i − |{h : S[h]≺ F[i ]}|), F[i ] (4) m = (500, 1) denotes that a substitution changes that th where S.select (i) returns the offset of the i + 1 occur- base at T [500] to a C. An example is shown in Fig. 1 (bot- rence of c in S, i.e.  the c of rank i. Whereas LF takes a tom left). leftward step with respect to T, FL takes a rightward step. Consecutive substitutions are collapsed into a single By starting in each row x ∈[i..j] and performing a edit in the E table. Insertions and deletions (“indels” for sequence of |q|− 1 FL steps for each, we can visit each short) are treated somewhat differently; the offset car - offset of T overlapped by an occurrence of q. Checking rying the “mark” is the one just preceding the indel (just MA[k] at each step, where k is the current row, tells us to its left) in the multiple alignment. Importantly, the which marker is overlapped, if any. This is slow in prac - mark covers exactly one position in the genome, even tice, both because it requires O(|q|(j − i + 1)) FL steps if the insertion/deletion spans many bases. The marked in total, and because each step requires a select query, position must come to the left of the indel to ensure which is more costly in practice than a rank query. that suffixes starting at the marked position include the allele itself. In the multiple alignment in Fig.  1 (top left), for example, a deletion with respect to R occurs in the Heuristic backward‑search approach with smearing fourth-from-left column, but the marked position is in Say we perform a backward search starting with the the third-from-left column. rightmost character of q. At each step we are consid- The marker array MA is a permutation of M such that ering a range [i..j] of SA having a suffix of q as a prefix. marks appear in suffix-rank order: Using i and j, we can query MA[i..j] . However, this tells us instances where a suffix of q overlaps a marker, whereas Definition 1 The marker array MA is the mapping such our goal is to find where the whole query q overlaps a that MA[i] = M[SA[i]]. marker. If we report overlaps involving trivially short suf- fixes of q, many would be false positives. We propose to Thanks to suffix-rank order, identical M[i] ’s are often allow but reduce the number of such false positives by grouped into runs in MA , as seen in Fig. 1 (right). augmenting MA: A marker query for pattern string q returns all m ∈ M overlapped by an occurrence of q in T. We can begin to Definition 2 The augmented marker array MA is a answer this query using backward search (Eq.  1) with multimap such that MA [i] =[M[SA[i]],M[SA[i]+ 1], ..., P = q , giving the maximal SA range [i..j] such that q is a M[SA[i] + w]] prefix of the suffixes in the range. Having computed [i..j], we know that {MA[i], ...,MA[j]} contain markers over- That is, MA [i] is a (possibly empty) list containing lapped by q’s leftmost character. To recover the markers markers overlapping any of the positions T[i..i + w] . W e overlapped by the rest of q’s characters, two approaches call this a “smeared” marker array, since the marks are can be considered, detailed in the following subsections. extended (smeared) to the left by w additional positions. The FL approach recovers the overlapped markers in a Note that a length-w extension can overlap one or more straightforward way but uses O(|q|· occ) time, where other marked variants to the left. For this reason, MA occ is the number of times q occurs in T. The heuristic must be a multimap, i.e. it might associate up to w mark- backward-search approach requires only O(|q|) time but ers with a given position. is not fully sensitive, i.e. it can miss some overlaps. Using MA , we adjust the backward-search strategy so that instead of querying MA at each step, we query MA every w steps. If w is large enough—e.g.  longer than the FL approach length at which we see random-chance matches—we Say [i..j] is the maximal SA range such that all rows have can avoid many false positives. More space is required q as a prefix. We can perform a sequence of FL steps, M un et al. Algorithms for Molecular Biology (2023) 18:2 Page 5 of 17 w previous section. The algorithm proceeds right to left to represent MA compared to MA since it is less sparse. w through the read, growing the match by one character if However, we expect MA to remain run-length com- possible. When we can no longer grow the match (i.e. the pressible for the same reason that MA is. range [i..j] becomes empty), we reset the range to the all- inclusive range [0..|T |− 1] and restart the matching pro- Genotyping a read cess at the next character. We use the term “extension” to Given a sequencing read, we would like to extract as refer to a consecutive sequence of steps that successfully much genotype information as possible while minimiz- extend a match. Note that this is a heuristic algorithm ing computational cost and false-positive genotype evi- that does not exhaustively find all half-MEMs between dence. Here we give a heuristic algorithm (Algorithm  1) w the read and the index, as the MONI algorithm does [24]. that handles entire sequencing reads, querying MA during the backward-search process as proposed in the Mun et al. Algorithms for Molecular Biology (2023) 18:2 Page 6 of 17 As discussed above, the algorithm only checks the (and non-empty) elements. Our sparse encoding for A marker array every w steps (line 14). As an additional consists of three structures. S is a length-|A| bit vector filter, the algorithm only performs a marker-array query with 1s at the positions where a run of identical entries when the current suffix-array range size is no larger than in A begins, and 0s elsewhere. E is a similar bit vector the number of haplotypes in the index ( N ). A range marking the last position of each of the x runs. (This exceeding that size indicates that we are seeing more variable E is distinct from the E table defined above than one distinct match in at least one haplotype, mean- in “Marker Array.”) To find whether an element A [i] is ing that the evidence is ambiguous. non-empty, we can ask whether we are between two The algorithm tallies evidence as it goes (line 18), but such marks; that is, A[i] is non-empty if and only if might later choose to ignore that evidence if certain S.rank (i + 1)> E.rank (i). 1 1 conditions are not satisfied (lines 10 and 27). For exam - X is a length-x array containing the element that is ple, if the evidence has a conflict—i.e.  one match indi - repeated in each of A’s non-empty runs, in the order cates a reference allele at a site but another match found they appear in A. If A[i] is not empty, the element during the same extension finds an alternate allele at appearing there is given by X [E.rank (i)]. that site—then all the evidence is discarded for that When encoding M or MA , the elements of X are simply extension. Similarly, evidence from an extension is dis- tuples. A complication exists for MA , since elements are carded if the tallied sites span multiple chromosomes. lists of up to w tuples. In this case, we keep an additional Finally, evidence from extensions failing to match at bit-vector B of size |X| where 1s denote left-hand bound- least 80 bp of the read (adjustable with --min-seed- aries in X that correspond to runs in A. E and B can be length option) is discarded. used together to access an element in A (Algorithm 2). We employ other heuristics to minimize mapping time not shown in Algorithm 1. For instance, we avoid wasted effort spent querying the wrong read strand. Extracting markers from VCF Specifically: rowbowt randomly selects an initial A Variant Calling Format (VCF) [25] file is used to strand of the read to investigate: forward or reverse encode a collection of haplotypes with the variants complement. If an extension from this strand meets the arranged in order according to a reference genome. In minimum seed-length threshold (80 by default), then the case of human and other diploid genomes, haplo- the other strand is not considered and analysis of the types are grouped as pairs. We refer to such a collection read is complete. Otherwise, rowbowt then goes on to of haplotypes as a “panel” and a single haplotype as a examine the opposite strand of the read. “panelist.” An VCF entry encodes a variant as a tuple consisting of a chromosome, offset, the allele found in the reference, the alternate allele found in one or more Sparse marker encoding panelists, and a sequence of flags indicating whether We encode the sparse arrays M , MA and MA in the each panelist has the reference or alternate version. We following way. Say that array A consists of empty and start from a VCF file to determine how to populate the non-empty elements. We consider A’s non-empty ele- marker arrays M , MA and/or MA , as well as the edit ments as falling into one of x maximal runs of identical array E. M un et al. Algorithms for Molecular Biology (2023) 18:2 Page 7 of 17 A single element of M is a tuple (r, e), where r is an off - This software is written in C++17, uses the open-source set in T and e is the edit operation describing how the MIT license, and builds on the Succinct Data Structure sequence differs from the reference. As a practical mat - Library (SDSL) v3.0 [27]. ter, we represent these tuples in a different way that more For querying the marker array, we use the rowbowt closely resembles the corresponding VCF records. Spe- implementation at https:// github. com/ alshai/ rowbo wt. cifically, a marker is encoded in a 64-bit word divided This repository contains the open source C++17 imple - into three fields. First, a 12 bit field identifies the chromo - mentation of rowbowt, distributed under the MIT some containing the marker. The chromosome ordering license. It is also a library, containing algorithms for is given at the beginning of the VCF file in the “header” building and querying indexes containing various struc- section. For example, if “chr1” is the first chromosome in tures discussed here, including the run-sampled suffix the header, then this chromosome is encoded as 0x000 array, marker array, and others. (using hexadecimal), and if “chr2” is the second chromo- some, it is encoded as 0x001. Second is a 54 bit field Results encoding the marker’s offset within the chromosome. We evaluated the efficiency and accuracy of our marker- Third is a 4-bit field storing which version of the variant array method for compiling genotype evidence. We first is present, with 0 indicating the reference allele, 1 indicat- generated multiple series of rowbowt indexes covering ing the 1st alternate allele, 2 the second alternate allele, various settings for three parameters: the window size w etc. This 64-bit representation allows for compact storage for the smeared marker array MA , the number of hap- of markers and easier random access to the marker array. lotypes indexed, and the minimum allele frequency for marked alleles. The rowbowt index consisted of three components: the run-length encoded BWT, the run-sam- pled suffix array, and the marker array. While we built the Diploid genotyping sampled suffix array for these experiments, the stand - In a diploid genome, it is possible for both alleles to ard marker-array-based method in rowbowt does not occur, i.e. for the genotype to be heterozygous. We use an require this array. existing approach [26] to compute genotype likelihoods We generated indexes for collections of 200,  400,  800, considering all possible diploid genotypes: homozygous or 1000 human chromosome-21 haplotypes from the reference (2 reference alleles), homozygous alternate (2 1000 Genomes Phase 3 reference panel [3] based on the alternate alleles), or heterozygous (1 reference, 1 alter- GRCh37 reference. We generated two sets of indexes: nate). Let g ∈{0, 1, 2} denote the number of reference- one where the marker array marks all polymorphic allele copies at the marked site; e.g.  g = 1 corresponds sites regardless of frequency (denoted “ AF > 0”), and to a heterozygous site and g = 2 to a homozygous ref- another where the marker array marks only those sites erence site. Let l be the number of times the reference where the less common allele occurs in greater than 1% allele was observed in the reads overlapping a particular of haplotypes, i.e.  has allele frequency over 1% (denoted marked site and let k be the count of all alleles (reference “ AF > 0.01”). In all cases, the marker array window size or alternate) observed. Let ǫ be the sequencing error rate. w was set to 19. Each haplotype collection was drawn We calculate the genotype likelihood as follows, adopting from a random subset of 500 individuals from the 1000 equation 2 of [26] while setting the ploidy to 2 and adopt- Genomes Phase 3 panel. The AF > 0 panel of 500 hap- ing a global rather than a per-base error rate: lotypes contained 1,  097,  388 polymorphic sites. The AF > 0.01 panel of the same haplotypes contained k k−l L(g) = [(2 − g)ǫ + g(1 − ǫ)] [(2 − g)(1 − ǫ) + gǫ] . 193, 438 polymorphic sites with allele frequency over 1%. We also included the GRCh37 reference sequence, con- To choose the most likely genotype , we compute: max sisting of all reference alleles, in each collection, corre- g = argmax L(g). sponding to the reference sequence called T above. max g∈{0,1,2} We generated a series of indexes with window size By default, rowbowt uses ǫ = 0.01. w ∈{13, 15, 17, 19, 21, 23, 25} . We generated two such series: one with no minimum allele frequency ( AF > 0 ) Implementation details and another with a 1% minimum frequency ( AF > 0.01 ). The code for constructing the marker array is imple - Each index was over the same set of 100 haplotypes. mented in the pfbwt-f spoftware package, with repository at https:// github. com/ alshai/ pfbwt-f. This Index size repository also contains an efficient implementation of We measured the size of the three main components the prefix-free-parse BWT construction algorithm [18]. of the rowbowt index: the augmented marker array, Mun et al. Algorithms for Molecular Biology (2023) 18:2 Page 8 of 17 Fig. 3 Mean time over 10 trials of aligning 10,000 simulated reads Fig. 2 Left: Size of rowbowt data structures as a function of the from HG01498 against the augmented marker array (marker) and number of haplotypes indexed, and with w = 19 . “Mark er array ” w the r-index suffix array (locate). Experiments are repeated for marker refers to the augmented marker array, MA . Right: Size of rowbowt collections including all alleles ( AF > 0 ) and for alleles having data structures as a function of the “smearing” window size w, with frequency at least 1% ( AF > 0.01 ). Left: The experiment is repeated number of haplotypes fixed at 100. Separate results are shown for for various window sizes w, and for 100 haplotypes. Right: The when there is no minimum allowed allele frequency ( AF > 0 ) and experiment is repeated for different numbers of indexed haplotypes, when the minimum frequency is 1% ( AF > 0.01) with w = 19 the run-length encoded BWT (RLE BWT) [28] and the runs in the augmented marker array, negatively affecting run-sampled suffix array (“r-index SA”) [17]. Figure  2 run-length compression. plots this measurement for collections of 200, 400, 800 In the right portion of Fig. 2, the RLE BWT and r-index and 1000 haplotypes for both AF > 0 and AF > 0.01 . SA have constant size because the w parameter does not All grow linearly with the number haplotypes grows, affect those data structures. In the left portion of Fig.  2, as expected. For AF > 0 , the augmented marker array showing size as a function of number of haplotypes, the is consistently larger than the run-sampled suffix array augmented marker array is almost always larger than (“r-index SA”). For AF > 0.01 , the augmented marker the r-index SA for AF > 0 as opposed to AF > 0.01 , array is much smaller, approaching the size of the RLE except at w = 13 . The slope of the array size is smaller for BWT. The AF > 0 array is larger because it contains pol- AF > 0.01 than for AF > 0. ymorphic sites with infrequent alleles; about 85% of the Overall, both the value of w and the number of hap- marked sites in the AF > 0 array have allele frequency lotypes in the index cause the augmented marker array under 1%. Further, rare alleles are less likely to form long to increase in size, but the inclusion of rare alleles (< 1% allele frequency) has the largest effect on its size. M un et al. Algorithms for Molecular Biology (2023) 18:2 Page 9 of 17 Query time We next measured query time for the augmented marker array strategy versus the locate-query strategy which uses the run-sampled suffix array. 150 bp simulated reads of 25x coverage were generated for one haplotype of HG01498, an individual that is part of the 1000-Genomes panel, but which we excluded from all our indexes. We simulated reads using Mason 2 mason_simulator [29] with default options. In the case of the marker-array strategy, we measured the time required to analyze the reads using the algo- rithm described above in “Genotyping a read.” In the case of the locate-query strategy, the MA query was replaced with a two-step process that first performed a locate query with respect to the run-sampled suffix array, then performed a lookup in the M array. To enable this mode, we further augmented the r-index with a representa- tion of M using the sparse encoding described above. To emphasize: the rowbowt strategy does not require the run-sampled suffix array or the M array; the MA effec - tively replaces them both. We repeatedly sampled 10,000 simulated reads and recorded the mean query time over 10 trials. As seen in Fig.  3 the augmented marker-array method (labeled “marker”) was consistently faster than locate method. This was true for all allele frequencies and window sizes tested. We found that the effect of w and allele-frequency cutoff was more pronounced with the larger reference panel AF > 0 . For the smaller panel ( AF > 0.01 ), quer y time was mostly invariant to both window size and allele Fig. 4 Precision (left) and recall (right) of the calls made when frequency. querying 25x simulated reads from HG01498 against the augmented marker array (marker) and the r-index (locate). Stratified by minimum allowed allele frequency (AF) in the index Genotyping accuracy We next measured the accuracy of the genotype infor- mation gathered using the augmented marker array. We simulated sequencing reads from one haplotype of marked sites that truly have the alternate allele, while the HG01498 to an average depth of 25-fold coverage. Indi- negative class consists of marked sites that truly have the vidual HG01498 was excluded from the indexes. As our reference allele. We measure: “truth” set for evaluation, we use the variant calls in the 1000 Genomes project callset for the same haplotype we TPs TPs Precision = Recall = , simulated from. For simplicity, this experiment treats the TPs + FPs TPs + FNs genome as haploid. Experiments in the next section will where TP stands for True Positive, FN stands for False account for the diploid nature of human genomes. Negative, etc. A single marked site can have conflicting evidence, due, Figure 4 shows precision and recall with respect to the for instance, to mismapped reads or sequencing errors. number of haplotypes in the index and the minimum For this evaluation, we make calls simply by finding the allele frequency of the haplotype collection. We observed frequently observed allele at the site. We ignore any that the AF > 0.01 indexes generally had better precision instances of alleles other than the ones noted in the VCF compared to the AF > 0 indexes, though at the expense file as the reference and alternate alleles. If the reference of recall. Precision and recall generally improve with the and alternate alleles have equal evidence, the reference addition of more haplotypes to the index. The augmented allele is called. marker array has similar recall to the locate procedure We calculate precision and recall according to the fol- across all haplotype sizes at the loss of precision. When lowing formulas. Here, the positive class consists of Mun et al. Algorithms for Molecular Biology (2023) 18:2 Page 10 of 17 rare variants are removed from the index ( AF > 0.01 ), substitutions, “Indel” includes indels no more than 50bp the gap in precision between the marker array and the long, and “SV” includes insertion or deletions longer than locate procedure lessens. This mild (less than 0.1%) loss 50bp, and “All” includes all variant types. More complex of precision is expected since algorithm described above structural variants like inversions and chromosomal rear- in “Genotyping reads” is still prone to some false posi- rangements are ignored. tives in the earlier part of the extension process. We analyzed the accuracy of rowbowt ’s diploid gen- otypes in two ways. First we considered allele-by-allele Diploid genotyping assessment precision and recall, considering the alternate (ALT) To assess diploid genotyping accuracy, we used data allele calls to be the positive class. Specifically, every dip - from the Human Genome Structural Variation Consor- loid genotype called by a method is considered as a pair tium (HGSVC) [22, 23]. The HGSVC called both simple of individual allele calls. If a given allele call is an alternate and complex genetic variants across a panel of 64 human (ALT) allele and there is at least one ALT allele present in genomes. Calls were made with respect to the GRCh38 the true diploid genotype at that site, it counts as a true primary assembly [4]. For input reads, we subsampled positive (TP). If the given allele is a reference allele (REF) reads from a 30-fold average coverage PCR-free read set and there is at least one REF allele in the true diploid gen- provided by the 1000 Genomes Project [3] (accession otype, this is a true negative (TN). If the given allele is an SRR622457). To create more challenging scenarios for ALT but the true genotype is homozygous REF, we count the genotypers, we randomly subsampled the read sets it as a false positive (FP). Finally, if the given allele is REF to make smaller datasets of 0.01, 0.05, 0.1, 0.5, 1, 2 and but the true genotype is homozygous ALT, this is a a false 5-fold average coverage. negative (FN). We assessed four genotyping methods. The first Second, we considered precision and recall with (bowtie2+bcftools) used Bowtie 2 [30] v2.4.2 to respect to sites that were either truly heterozygous or align reads to a standard linear reference genome, then called heterozygous. If a heterozygous call made by a used BCFtools v1.13 to call variants (i.e.  genotypes) method is truly heterozygous, this was counted as a true at the marked sites [26]. The second method used the positive (TP). False positives, false negatives, and true graph-based genotyper BayesTyper v1.5. The third negatives are defined accordingly. was PanGenie [16], a pangenome-based genotyper As seen in Fig.  5, rowbowt ’s ALT and HET preci- for short reads. PanGenie calls genotypes for variants sion were generally the highest of all the methods across that are represented in its index as bubbles in a pange- all variant categories, though BayesTyper sometimes nome. PanGenie represents known haplotypes as paths achieved higher ALT/HET precision for indels in the through the graph, accounting for these paths during the higher-coverage datasets and PanGenie had the highest genotyping process. PanGenie takes directed acyclic HET precision for SNVs and Indels for some coverages. pangenome graph as input, represented as a VCF file At the highest coverage level examined (5-fold), PanG- containing phased genotypes for many samples. PanGe- enie also achieved similar ALT precision to rowbowt nie further requires that the VCF have non-overlapping on SNVs and Indels. rowbowt dominates on precision variants. We generated a compliant input VCF using a of SV calls, and also exhibits the best recall for ALTs at workflow provided by the PanGenie project We used higher coverage. PanGenie exhibits slightly higher the “high-gq” quality filter also used in the PanGenie recall for HET SVs. study: genotype quality (GQ) ≥ 200 . The fourth method Figure  6 compares the methods on their F1 measure. assessed was rowbowt. rowbowt achieved the highest F1 scores for ALT and Prior to applying BayesTyper, we built a Bayes- HET SNBs. For HET SNV variants, rowbowt had the Typer-compatible VCF file containing all relevant vari - highest F1 score of 0.71, while for ALT SNV variants, the ants from the HGSVC haplotype panel. For rowbowt, F1 was 0.95. In the case of Indel variants, rowbowt pro- we created a rowbowt index from the genomes in the duced the highest F1 score of 0.81 for the ALT class and HGSVC haplotype panel. In both cases we excluded tied with PanGenie for the HET class, with an aggregate NA12878’s haplotypes from the panel prior to building score of 0.48. Fr SVs, rowbowt had the highest F1 score of the index. 0.55 in the ALT class and a comparable score of 0.21 in When evaluating, we stratified variants by complex - the HET class, just below PanGenie ’s score of 0.23. ity: the “SNV” category includes single-nucleotide Diploid genotyping assessment on medically relevant genes We obtained a list of medically relevant autosomal genes, The workflow can be found in the pipelines/run-from-callset collected from medical gene databases by the Genome subdirectory of the PanGenie repository. M un et al. Algorithms for Molecular Biology (2023) 18:2 Page 11 of 17 Fig. 5 Precision and recall for the four tested genotyping methods, both at the level of individual alleles (ALT precision/recall) and at the level of heterozygous variants (HET precision/recall). Note that the bowtie2+bcftools approach is generally unable to align reads across variants in the “SV” category, leading to low precision and recall Mun et al. Algorithms for Molecular Biology (2023) 18:2 Page 12 of 17 Fig. 6 F1 scores across variant types and tools. Computed genomewide (top) and over only those variants in medically relevant autosomal genes (bottom) in a Bottle (GiaB) project [31]. Among the 5,026 genes case of Indel variants, rowbowt was the leader with an included are the major MHC Class I genes (HLA-A, F1 score of 0.82 for the ALT class, and tied with PanG- HLA-B, HLA-C), MHC Class II genes (HLA-DP, HLA- enie with a score of 0.47 for the HET class. As for SVs, DQ, HLA-DR) and KIR genes. A complete list including rowbowt had the highest F1 score of 0.56 in the ALT coordinates can be found at the CMRG GitHub repo [32]. class, and a comparable score of 0.19 in the HET class, As seen in Fig.  7, rowbowt ’s performance relative to closely following PanGenie ’s score of 0.22. the other methods is very similar for these genes com- pared to the full set of genes assessed in Fig. 5. As seen in Computational performance Fig.  6, rowbowt generally outperformed other methods We compared the time and memory usage of each geno- on F1 score for medically relevant genes. For HET SNV typing method, dividing the computations performed variants, rowbowt achieved the highest F1 score of 0.72, by each method into a few distinct categories. For each while for ALT SNV variants, the F1 score was 0.96. In the phase, we measured both the wall-clock time elapsed and M un et al. Algorithms for Molecular Biology (2023) 18:2 Page 13 of 17 Fig. 7 Precision and recall presented as in Fig. 5 but filtered to only the medically relevant autosomal genes, collected from medical gene databases by the Genome in a Bottle (GiaB) project Mun et al. Algorithms for Molecular Biology (2023) 18:2 Page 14 of 17 Fig. 8 Wall clock time and peak memory footprint for each phase of the genotyping workflow for four methods tested maximum memory (“maximum resident set”) used. Both the case of BayesTyper, this specifically consists of were measured with the Snakemake tool’s benchmark (a) using the KMC3 software [34] to count k-mers in the directive [33]. input reads, (b) using the bayestyper makeBloom We categorized and benchmarked three distinct types command to convert k-mer counts to Bloom filters for of computation. For bowtie2+bcftools, we defined each sample, and (c) using the bayestyper clus- the “Alignment” step as the process of using bowtie2 to ter command to identify variant clusters. In the case align reads to the linear reference genome. For rowbowt, of PanGenie, “Pre-genotyping” consists of: (a) read- we defined the “Alignment” step as the process of using ing input files, (b) k-mer counting, (c) path selection, (c) the rb_markers command to genotype the reads using determining unique k-mers. Also for PanGenie, the ini- the algorithm described in Methods. For the alignment- tial input VCF file must be preprocessed into a new VCF free methods PanGenie and BayesTyper, we use the file containing additional fields, accounting for about 1 h term “Pre-genotyping” for the initial phase that includes of computation, not included in our measurements. k-mer counting and other index building procedures. In M un et al. Algorithms for Molecular Biology (2023) 18:2 Page 15 of 17 16 threads were used during the Alignment phase for graph-indexing approaches, which might consider all bowtie2+bcftools, rowbowt and PanGenie, while possible combinations of nearby alleles to be “valid,” even 32 threads were used for BayesTyper. if most combinations never co-occur in nature. For bowtie2+bcftools, we define the Genotyping While we examined only simple structural variants in step as the process of using bcftools call to call the form of insertions and deletions longer than 50 bp, variants from the BAM file output by bowtie2. For the genotyping method is readily extensible to more rowbowt We define the Genotyping phase as the pro - complex differences as well. Indeed, as long as we can cess of running the vc_from_markers.py script on mark the base or bases just to the left of the variant, we the output from rb_markers. For BayesTyper, we can mark any variant in a way that we can later genotype. define the genotyping step as the process of running the Our method will mark only a single base that is to the bayesTyper genotype command. PanGenie was left of the boundary (“breakpoint”) of a variant. This is run by providing 16 threads. Figure  8 shows the time- true whether the variant is large (i.e.  is a large insertion taken and peak memory footprint for each method and or deletion) or small (i.e.  an SNV). We choose to mark on defined phases. the base to the left of the boundary rather than the right The Genotype phases for both bowtie2+ bcftools because suffixes starting from the base to the right will be and rowbowt do not support multi-threading, so a sin- unaffected by the REF and ALT alleles; therefore, matches gle thread was used. For the BayesTyper and Pange- spanning that position do not necessarily carry evidence nome Genotype phases, we used 16 threads. for a particular allele. That said, an alternative approach Figure  8 shows the time taken and peak memory would be to consider “suffixes” extending in either direc - footprint for each method and each category. We tion—either left-to-right or right-to-left—by additionally observed that rowbowt was consistently faster than indexing the reverse of the reference sequences. If both the other methods in the Genotyping phase, sometimes forward and reverse versions of the index are available, by a large margin. We also observed that while row- our method can be made more robust by combining bowt has a higher memory footprint compared to the the evidence obtained from matches spanning both the bowtie2+bcftools method, it uses substantially left and right boundaries of the polymorphic variant. In less memory than BayesTyper and PanGenie, the the future, it will be important to measure whether the other pangenome-based methods. PanGenie has a increase in index size and genotyping memory footprint particularly high (>100 gigabyte) memory footprint in can justify this refinement. both the Pre-genotyping and Genotyping steps. We also note that our strategy marks only a single left-flank base position per variant, even in the case of Discussion an insertion or deletion where the alleles span different We proposed a family of novel marker array structures, numbers of bases. For instance, for an insertion where M , MA and MA that, together with a pangenome index the REF allele is A and the ALT allele is AGG . If we were like the r-index, allow for rapid and memory-efficient instead to mark both the left-flank base (A in this exam - genotyping with respect to large pan-genome refer- ple) and every base within the insertion (the two Gs in the ences. The augmented marker array is smaller and ALT allele), we are vulnerable to a bias resulting from the faster to query than the run-sampled suffix array — the fact that there are more opportunities to observe a match usual way to establish where matches fall when query- overlapping the longer allele than the shorter one. With- ing a run-length compressed index — especially when out any correction, this artificially inflates the evidence we limit the set of markers to just alleles at frequency from the longer allele. That said, it should be possible to 1% or higher. We further showed that the augmented correct for the bias, e.g. by penalizing the evidence from marker array can replace the sampled suffix array in the longer allele in a length-weighted fashion. In the simple genotyping experiments with moderate sacrifice future, it will be important to investigate whether this of precision, and that a marker array based genotyping enhanced marking strategy yields an overall improve- method outperforms the graph-based BayesTyper ment in genotyping accuracy. method. Since this work first appeared, other groups have pur - Pan-genome indexes allow for rapid analysis of reads sued related ideas. In particular, we note that the new while reducing reference bias. The indexes used in our MARIA index is capable list all the distinct columns of experiments consisted of many (up to 65) haplotypes, the multiple alignment overlapped by a match in the with none having a higher priority over the others, except r-index [35]. Further, that study defines a quantity called in the sense that results were expressed in terms of the r that captures something similar to the number of dis- standard reference. Our approach preserves all linkage tinct runs we expected to find in the MA or MA arrays. disequilibrium information. This is in contrast to some In the future it will be important to directly relate r to the Mun et al. Algorithms for Molecular Biology (2023) 18:2 Page 16 of 17 8. Denti L, Previtali M, Bernardini G, Schönhuth A, Bonizzoni P. MALVA: geno- quantities discussed here and to compare the advantages typing by mapping-free ALlele detection of known VAriants. iScience. and disadvantages of our genotyping-centric approach to 2019;18:20–7. the more general approach of MARIA. 9. Shajii A, Yorukoglu D, William Yu Y, Berger B. Fast genotyping of known SNPs through approximate k-mer matching. Bioinformatics. The rowbowt method can lead to future methods that 2016;32(17):538–44. use information about genotypes to build a personal- 10. Pritt J, Chen NC, Langmead B. FORGe: prioritizing variants for graph ized reference genome, containing exactly the genotyped genomes. Genome Biol. 2018;19(1):220. 11. Chen S, Krusche P, Dolzhenko E, Sherman RM, Petrovski R, Schlesinger F, alleles. Alignment to a personalized reference have been Kirsche M, Bentley DR, Schatz MC, Sedlazeck FJ, Eberle MA. Paragraph: a shown previously to be the best way to reduce reference graph-based structural variant genotyper for short-read sequence data. bias, even more effective than the best pangenome meth - Genome Biol. 2019;20(1):291. 12. Garrison E, Sirén J, Novak AM, Hickey G, Eizenga JM, Dawson ET, Jones ods [10, 13]. W, Garg S, Markello C, Lin MF, Paten B, Durbin R. Variation graph toolkit improves read mapping by representing genetic variation in the refer- Acknowledgements ence. Nat Biotechnol. 2018;36(9):875–9. We thank Massimiliano Rossi and Travis Gagie for many helpful discussions. 13. Chen NC, Solomon B, Mun T, Iyer S, Langmead B. Reference flow: reduc- We thank Margaret Gagie for her careful editing.Part of this research project ing reference bias using multiple population genomes. Genome Biol. was conducted using computational resources at the Maryland Advanced 2021;22(1):8. Research Computing Center (MARCC). 14. n J, Monlong J, Chang X, Novak AM, Eizenga JM, Markello C, Sibbesen JA, Hickey G, Chang PC, Carroll A, Gupta N, Gabriel S, Blackwell TW, Ratan A, Author contributions Taylor KD, Rich SS, Rotter JI, Haussler D, Garrison E, Paten B. Pangenom- TM and BL conceived the method. TM wrote the Rowbowt software tool. TM, ics enables genotyping of known structural variants in 5202 diverse NGKV and BL designed and ran the experiments. TM, NGKV and BL wrote and genomes. Science. 2021;374(6574):8871. revised the manuscript. All authors reviewed the manuscript. All authors read 15. Sibbesen JA, Maretty L, Krogh A. Accurate genotyping across variant and approved the final manuscript. classes and lengths using variant graphs. Nat Genet. 2018;50(7):1054–9. 16. Ebler J, Ebert P, Clarke WE, Rausch T, Audano PA, Houwaart T, Mao Y, Funding Korbel JO, Eichler EE, Zody MC, et al. Pangenome-based genome infer- N.S.K.V., T.M. and B.L. were supported by NIH grants R01HG011392 and ence allows efficient and accurate genotyping across a wide spectrum of R35GM139602 to BL. T.M. and B.L. were also supported by NSF DBI Grant variant classes. Nat Genet. 2022;54(4):518–25. 17. Gagie T, Navarro G, Prezza N. Optimal-Time Text Indexing in BWT-runs Bounded Space. In: Proceedings of the 29th Annual Symposium on Availability of data and materials Discrete Algorithms (SODA), pp. 1459–1477; 2018. rowbowt is available at https:// github. com/ alshai/ rowbo wt. 18. Kuhnle A, Mun T, Boucher C, Gagie T, Langmead B, Manzini G. Efficient construction of a complete index for pan-genomics read alignment. J Declarations Comput Biol. 2020;27(4):500–13. 19. Ahmed O, Rossi M, Kovaka S, Schatz MC, Gagie T, Boucher C, Langmead Competing interests B. Pan-genomic matching statistics for targeted nanopore sequencing. The authors declare that they have no competing interests. iScience. 2021;24(6): 102696. 20. Burrows M, Wheeler DJ. A block sorting lossless data compression algo- rithm. Technical Report 124, Digital Equipment Corporation 1994. Received: 31 March 2023 Accepted: 22 April 2023 21. Ferragina P, Manzini G. Opportunistic data structures with applications. In: Proceedings of the 41st Annual Symposium on Foundations of Computer Science (FOCS), pp. 390–398; 2000. 22. Chaisson MJP, Sanders AD, Zhao X, Malhotra A, Porubsky D, Rausch T, Gardner EJ, Rodriguez OL, Guo L, Collins RL, et al. Multi-platform discov- References ery of haplotype-resolved structural variation in human genomes. Nat 1. Davies RW, Kucka M, Su D, Shi S, Flanagan M, Cunniff CM, Chan YF, Myers Commun. 2019;10(1):1784. S. Rapid genotype imputation from sequence with reference panels. Nat 23. Ebert P, Audano PA, Zhu Q, Rodriguez-Martin B, Porubsky D, Bonder MJ, Genet. 2021;53(7):1104–11. Sulovari A, Ebler J, Zhou W, Serra Mari R, et al. Haplotype-resolved diverse 2. Kim C, Guo H, Kong W, Chandnani R, Shuang LS, Paterson AH. Application human genomes and integrated analysis of structural variation. Science. of genotyping by sequencing technology to a variety of crop breeding 2021;372:6537. programs. Plant Sci. 2016;242:14–22. 24. Rossi M, Oliva M, Langmead B, Gagie T, Boucher C. MONI: a 3. Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini pangenomic index for finding maximal exact matches. J Comput Biol. JL, McCarthy S, McVean GA, Abecasis GR, et al. A global reference for 2022;29(2):169–87. human genetic variation. Nature. 2015;526(7571):68–74. 25. Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Hand- 4. Schneider VA, Graves-Lindsay T, Howe K, Bouk N, Chen HC, Kitts PA, saker RE, Lunter G, Marth GT, Sherry ST, et al. The variant call format and Murphy TD, Pruitt KD, Thibaud-Nissen F, Albracht D, et al. Evalua- VCFtools. Bioinformatics. 2011;27(15):2156–8. tion of GRCh38 and de novo haploid genome assemblies demon- 26. Li H. A statistical framework for SNP calling, mutation discovery, associa- strates the enduring quality of the reference assembly. Genome Res. tion mapping and population genetical parameter estimation from 2017;27(5):849–64. sequencing data. Bioinformatics. 2011;27(21):2987–93. 5. Günther T, Nettelblad C. The presence and impact of reference bias on 27. Gog S, Beller T, Moffat A, Petri M. From theory to practice: Plug and play population genomic studies of prehistoric human populations. PLoS with succinct data structures. In: 13th International Symposium on Genet. 2019;15(7):1008302. Experimental Algorithms, (SEA 2014), pp. 326–337; 2014. 6. Brandt DY, Aguiar VR, Bitarello BD, Nunes K, Goudet J, Meyer D. Mapping 28. Belazzougui D, Cunial F, Gagie T, Prezza N, Raffinot M. Flexible indexing bias overestimates reference allele frequencies at the HLA genes in the of repetitive collections. In: Kari J, Manea F, Petre I., editors. Unveiling 1000 genomes project phase I data. G3 (Bethesda). 2015;5(5):931–41. dynamics and complexity. vol. 10307, pp. 162–174. Springer, Cham; 2017. 7. Sherman RM, Forman J, Antonescu V, Puiu D, Daya M, Rafaels N, Boorgula Series Title: Lecture Notes in Computer Science. MP, Chavan S, Vergara C, Ortega VE, et al. Assembly of a pan-genome 29. Reinert K, Dadi TH, Ehrhardt M, Hauswedell H, Mehringer S, Rahn R, Kim from deep sequencing of 910 humans of African descent. Nat Genet. J, Pockrandt C, Winkler J, Siragusa E, Urgese G, Weese D. The SeqAn C++ 2019;51(1):30–5. M un et al. Algorithms for Molecular Biology (2023) 18:2 Page 17 of 17 template library for efficient sequence analysis: A resource for program- mers. J Biotechnol. 2017;261:157–68. 30. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357–9. 31. Wagner J, Olson ND, Harris L, McDaniel J, Cheng H, Fungtammasan A, Hwang Y-C, Gupta R, Wenger AM, Rowell WJ, et al. Towards a comprehen- sive variation benchmark for challenging medically-relevant autosomal genes. 2021. 32. NIST: Medically Relevant Genes. [Online]. Available from: https:// github. com/ usnis tgov/ cmrg- bench marks et- manus cript/ tree/ master/ data/ gene_ coords/ unsor ted/ GRCh38_ mrg_ full_ gene. bed. Accessed 19 Mar 33. Mölder F, Jablonski KP, Letcher B, Hall MB, Tomkins-Tinch CH, Sochat V, Forster J, Lee S, Twardziok SO, Kanitz A, Wilm A, Holtgrewe M, Rahmann S, Nahnsen S, Köster J. Sustainable data analysis with Snakemake. F1000 Res. 2021;10:33. 34. Kokot M, Dlugosz M, Deorowicz S. KMC 3: counting and manipulating k-mer statistics. Bioinformatics. 2017;33(17):2759–61. 35. Goga A, Baláž A, Petescia A, Gagie T. MARIA: multiple-alignment r -index with aggregation. 2022. arXiv 2209.09218. Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in pub- lished maps and institutional affiliations. Re Read ady y to to submit y submit your our re researc search h ? Choose BMC and benefit fr ? Choose BMC and benefit from om: : fast, convenient online submission thorough peer review by experienced researchers in your field rapid publication on acceptance support for research data, including large and complex data types • gold Open Access which fosters wider collaboration and increased citations maximum visibility for your research: over 100M website views per year At BMC, research is always in progress. Learn more biomedcentral.com/submissions

Journal

Algorithms for Molecular BiologySpringer Journals

Published: May 5, 2023

Keywords: Sequence alignment; Indexing; Genotyping

References