Introduction

Small insertions/deletions (indels < 50 bp) can be represented differently between mappers and variant callers, or in the flanking sequence context. indelPost harmonizes these inter-mapper, inter-caller, and inter-read alignment ambiguities via realignment and read-based phasing to facilitate indel processing.

As an example, consider an inter-mapper ambiguity problem in DNA/RNA mappings. Suppose that you want to test the expression of the DNA indels:

_images/dna.svg

Matching the DNA indels to the corresponding transcriptome data is non-trivial as the indels are aligned differently in the RNA-Seq:

_images/rna.svg

To test the expression, first prepare inputs:

import pysam
from indelpost import Variant, VariantAlignment

# reference as pysam's FastaFile object
reference = pysam.FastaFile("/path/to/reference.fa")

# RNA-Seq BAM as pysam's AlignmentFile object
bam = pysam.AlignmentFile("/path/to/RNASeq.bam")

Make a Variant object to represent the target indel (the 2nd indel as example):

v = Variant("chrN", 9, "T", "TCCGTA", reference)

Next, make a VariantAlignment object. Upon creation, the VariantAlignment object realigns the RNA-Seq reads to find the target indel:

valn = VariantAlignment(v, bam)

Now, count_alleles() method counts the RNA-Seq reads supporing the target to check the expression:

cnt = valn.count_alleles()
print(cnt)

The alignment equivalency of the first DNA indel, the second DNA indel, and the RNA indel can be tested:

# first DNA indel
first = Variant("chrN", 8, "G", "GCCAC", reference)
valn_by_first = VariantAlignment(first, bam)

# indel in RNA-Seq
rna = Variant("chrN", 8, "G", "GCCACTCCGT", reference)
valn_by_rna = VariantAlignment(rna, bam)

# True
valn == valn_by_first == valn_by_rna

These indels are equivalent because they have the same phased indel representation. phase() returns a Variant representing the complex indel representation:

v_phased = valn.phase() # same result for the other 2 objects.

print(v_phased.chrom, v_phased.pos, v_phased.ref, v_phased.alt)

The last line outputs:

"chrN", 9, "T", "CCACTCCGTA"

See more Usage Examples .