API

indel processing on the reference

class indelpost.Variant

This class accepts a VCF-style variant representation as input. Equality holds between Variant objects if they are indentical in the normalized form.

Parameters:
  • chrom (string) – contig name.

  • pos (integer) – 1-based genomic position.

  • ref (string) – VCF-style reference allele.

  • alt (string) – VCF-style alternative allele.

  • reference (pysam.FastaFile) – reference FASTA file supplied as pysam.FastaFile object.

Raises:

ValueError – if the input locus is not defined in the reference or the input alleles contain letters other than A/a, C/c, G/g, T/t, and N/n.

count_repeats(by_repeat_unit=True)

counts indel repeats in the flanking reference sequences. The search window is defined by left_flank() and right_flank().

Parameters:

by_repeat_unit (bool) – count by the smallest tandem repeat unit. For example, the indel sequence “ATATATAT” has tandem units “ATAT” and “AT”. The occurrence of “AT” will be counted if True (default).

decompose_complex_variant(match_score=3, mismatch_penalty=2, gap_open_penalty=4, gap_extension_penalty=0)

returns a list of non-complex Variant objects decomposed by the Smith-Waterman local alignment with a given set of score/penalty.

Parameters:
  • match_score (integer) – default to 3.

  • mismatch_penalty (integer) – default to 2.

  • gap_open_penalty (integer) – default to 4.

  • gap_extension_penalty (integer) – default to 0.

generate_equivalents()

generates non left-aligned copies of Variant object.

indel_seq

returns the inserted/deleted sequence for non-complex indels. None for substitutions.

is_del

evaluates if variant_type is “D”.

is_indel

returns True if variant_type is “I” or “D”.

is_ins

evaluates if variant_type is “I”.

is_leftaligned

returns True if the indel position is the smallest possible value (left-aligned).

is_non_complex_indel()

returns True only if non-complex indel (False if complex indel or substitution).

is_normalized

returns True if left-aligned and the allele representations are minimal.

left_flank(window=50, normalize=False)

extracts the left-flanking reference sequence. See also right_flank().

Parameters:
  • window (integer) – extract the reference sequence [variant_pos - window, variant_pos].

  • normalize (bool) – if True, the normalized indel position is used as the end of the flanking sequence.

normalize(inplace=False)

normalizes Variant object.

Parameters:

inplace (bool) – returns None and normalizes this Variant object if True. Otherwise, returns a normalized copy of this object. Default to False.

query_vcf(vcf, matchby='normalization', window=50, indel_only=True, as_dict=True)

returns a list of VCF records matching this Variant object as dictionary (default).

Parameters:
  • vcf (pysam.VariantFile) – VCF file to be queried. Supply as pysam.VariantFile object.

  • matchby (string) –

    • “normalization” (default) matches by normalizing VCF records.

    • ”locus” matches by the normalized genomic locus.

    • ”exact” finds the exact matche without normalization.

  • window (integer) – searches the VCF records in indel position +/- window.

  • indel_only (bool) – returns matched indel and SNV records if False. Meaningful when matchby is “locus”.

  • as_dict (bool) – returns pysam.VariantRecord if False.

right_flank(window=50, normalize=False)

extracts the right-flanking reference sequence. See also left_flank().

Parameters:
  • window (integer) – extract the reference sequence [variant_end_pos, variant_end_pos + window].

  • normalize (bool) – if True, the normalized indel position is used as the start of the flanking sequence.

variant_type

returns “I” if the net allele-length change is gain, “D” if loss, “S” if signle-nucleotide substitution (zero net change), “M” if multi-nucleotide substitution (zero net change).

class indelpost.NullVariant

This class is returned when VariantAlignment cannot find the target indel in the BAM file. Boolean expression evaluates to False. Ref and Alt alleles are the reference base at the locus.

Parameters:
  • chrom (string) – contig name

  • pos (integer) – 1-based genomic position

  • reference (pysam.FastaFile) – reference FASTA file supplied as pysam.FastaFile object.

indel processing on alignment files (BAM)

class indelpost.VariantAlignment

This class accepts the target indel as Variant and the BAM file as pysam.AlignmentFile . Upon creation, the target indel is searched through realignment. Evaluation of the equality between VariantAlignment objects triggers local-phasing around the indels to test the alignment identity.

Parameters:
  • variant (Variant) – Variant object representing the target indel.

  • bam (pysam.AlignmentFile) – BAM file supplied as pysam.AlignmentFile object.

  • window (integer) – analyzes region input_indel_pos +/- window (defalt 50).

  • exclude_duplicates (bool) – True (default) to exclude reads marked duplicate by MarkDuplicate.

  • downsample_threshold (integer) – downsamples to the threshold if the covearge at the input locus exceeds the threshold (default to 1000).

  • mapping_quality_threshold (integer) – reads with a mapping quality (MAPQ) < the threshold (default 1) will be filtered.

  • base_quality_threshold (integer) – non-reference base-calls with a quality score < the threshold (default 20) will be masked as “N” in the analysis.

  • retarget_search_window (integer) – attempts to re-target in input_indel_pos +/- window (defalt 30) if the exact match is not found after realignment.

  • retarget_similarity_cutoff (float) – re-targets to the indel sequence with a highest Ratcliff/Obershelp similarity score > cutoff (default 0.7).

  • exact_match_for_shiftable (bool) – True to require exact (equivalent) match for indels that are equivalently alignable by shifting the position (default True)

  • match_score (integer) – score (default 3) for matched bases in Smith-Waterman local alignment.

  • mismatch_penalty (interger) – penalty (default 2) for mismatched bases in Smith-Waterman local alignment.

  • gap_open_penalty (integer) – penalty (default 3) to create a gap in Smith-Waterman local alignment.

  • gap_extension_penalty (integer) – penalty (default 1) to expend a created gap by one base in Smith-Waterman local alignment.

  • auto_adjust_extension_penalty (bool) – True (default) to auto-adjust the gap open and extend penalties to find the input indel by realignment.

  • no_realignment (bool) – True to only analyzed gap-aligned indels (default False)

count_alleles(self, fwrv=False, by_fragment=False, three_class=False, estimated_count=False, quality_window=None, quality_threshold=None)

returns a tuple of read counts: (#non_target reads, #target reads).

Parameters:
  • fwrv (bool) – breaks down to the forward and reverse read counts. ( (non_target-fw, non_target-rv), (target-fw, target-rv) ).

  • by_fragment (bool) – counts by fragment. Overlapping fw and rv reads are counted as one.

  • three_class (bool) – breaks down to (reference, non_reference_non_target, target) counts.

  • estimated_count (bool) – True to return estimated count when the coverage is higher than downsample_threshold.

  • quality_window (integer) – specifies the range of base call quality filter. indel pos +/- quality_window.

  • quality_threshold (integer) – filters reads with the median base call quality in indel pos +/- quality_window < quality_threshold.

fetch_reads(self, how='target')

returns a list of pysam.AlignedSegment objects, depending on the strategy specified.

Parameters:

how (string) – specifies the read fetch strategy. “target” (default) to retrieve reads supporting the target. “non_target” for reads w/o target. “covering” for reads covering the locus w/ and w/o the target.

get_contig(self)

returns Contig object built from reads supporting the target indel. FailedContig is returned if contig assembly is not successful.

get_target_indel(self)

returns Variant object representing the actual target indel analyzed. The target indel may be different from the input indel due to the internal realignment. NullVariant is returned when no target is found.

phase(self, how='local', local_threshold=20, longest_common_substring_threshold=15, indel_repeat_threshold=None, mutation_density_threshold=0.05)

returns a Variant object represeting a phased target indel. NullVariant is returned if no target is found. Phasing is limited to the exon where the target indel is located for RNA-Seq. Refer to get_phasables() to retrieve phasable variants across exons.

Parameters:
  • how (string) –

    • “local” (default) phasing by recursive elimination of longest common substrings (LCS) and locality scoring (see local_threshold).

    • ”greedy” phase all phasable events.

    • ”complex” phasing by the “local” option + exclusivity check. The exclusivity check removes phasable events that are also observed in non-target reads such as nearby homozygous germline events.

  • local_threshold (integer) – local (and complex) phasing method checks if phasable SNVs are locally clustered around the indel. For i-th base apart from indel, the score gains one if mismatch and, if match, loses 1*min(j/local_thresh, 1) where j = i for short indels (< 10-nt) and 0.6*i for longer indels.

  • longest_common_substring_threshold (integer) – removes common substrings between the reference and contig assembled from target reads that are longer than longest_common_substring_threshold (default 15).

  • indel_repeat_threshold (integer) – do not phase indels that repeat more than indel_repeat_threshold (default None)

  • mutation_density_threshold (float) – do not phase if the pileup contains too many mutations (possibly error-prone dirty region). In non-target reads, #non-ref bases/#all bases > mutation_density_threshold (default 0.05)

class indelpost.Contig

This class represents a consensus contig assembled from a subset (not all) of reads supporting the target indel.

get_alignment(self)

shows the contig alignment as namedtuple. ContigAlignment.chrom returns chromosome. ContigAlignment.aln returns an alignment dictionary as OrderedDict. The dictionary key is the position and the value is a tuple as (REF, ALT). ContigAlignment.spliced_intervals returns a list of spliced intervals involved in the contig. None if the region of interest is not spliced.

get_contig_seq(self, split=False)

returns the contig sequence.

Parameters:

split (bool) – splits the contig sequence at the target indel position.

get_phasables(self)

returns a list of Variant objects phasable with the target indel.

get_reference_seq(self, split=False)

returns the reference contig sequence.

Parameters:

split (bool) – splits the reference at the target indel position.

class indelpost.FailedContig

This class is returned when contig assembly has failed. Boolean expression evaluates to False.

Parameters:
  • target_not_found (bool) – True when contig was not constructed because the target indel was not found.

  • is_low_quality (bool) – True when contig was not constructed because the target indel was only suppported by low quality reads. Reads are defined low quality if 15% or more bases have quality score < base_quality_threshold.

  • failed_anyway (bool) – True when contig was not constructed due to the other reasons.