=======
PolyAna
=======

-------
Authors
-------
Smruthy Sivakumar, Dr. Jerry Fowler, Dr. Paul Scheet
UT MD Anderson Cancer Center
Houston, TX

------------
Introduction
------------

This perl script provides a quality control step in filtering 
single nucleotide variants to remove potential homopolymers based
on their location in the reference genome.

It was originally developed to process Ion Torrent data in order to
remove any mutation that is within a homopolymer region. Ion 
semiconductor technology works on detecting an electrical signal
due to the pH change from the release a hydrogen ion when a 
nucleotide is incorporated into a sequence. Therefore, in regions
of the genome that carry a homopolymer, there is an increased
hydrogen ion release that is proportional to the length of the
homopolymer. Therefore, the voltage signal generated will be 
indicative of the homopolymer length. However, the technology 
suffers for an inaccurate estimation of homopolymer length, especially 
beyond eight nucleotides. This results in false variant calls that have
to be removed during processing of such sequencing data.

We developed a quick and trivial method, PolyAna, to filter homopolymer 
derived variants solely based on their position in the reference genome. 
There are several parameters for adjusting the filter’s stringency.

------
Syntax
------

perl PolyAna.pl --ref <reference_fasta> --variant_file <variant_file> --homopolymer_min_length <min_length> 
--window <window> --output <output_file_name> [--help|-?]

OPTIONS:
  --ref <reference_fasta> : Provide path to the reference fasta file
  
  --variant_file <variant_file> : Provide path to input variant file. The variant file must be tab separated and ordered in the following columns: Chr, Pos, Ref, Alt. You can specify a header line starting with "#"
  
  --homopolymer_min_length <min_length> : Provide minimum length to chose as cut off for homopolymer. (Default = 6)
  
  --window <window> : Provide the length of the window to check for homopolymers in the vicinity.(Default = 10)
  
  --output <out> : Provide full path and output file name
  
  --help|-? : Provides help for executing this script

-------------
Prerequisites
-------------

1) Perl 5.6.1 or higher
2) Getopt::Long
   Documentation: http://perldoc.perl.org/Getopt/Long.html
3) Bio::Perl 
   Installantion instructions: http://bioperl.org/INSTALL.html
   Documentation: http://search.cpan.org/dist/BioPerl/Bio/Perl.pm
4) Bio::SeqIO
   Documentation: http://search.cpan.org/dist/BioPerl/Bio/SeqIO.pm


-------
Example
-------
Tested with perl v5.18.1, Getopt::Long 2.39, Bio::Perl 1.006001, Bio::SeqIO 1.006001


Example run: perl PolyAna.pl --variant_file example_input.txt --output example_output.txt --ref hg19.fasta


Example input:

chr1  18961727  G T
chr3  37067104  T A
chr3  47103877  A G
chr8  48732073  T A
chr15 40913096  A T

Example output:

CHR  POS REF ALT POLYX HP_LEN  HP_BASE N_LEFT  N_RIGHT OFFSET  CALL
chr1  18961727  G T 1 NA  NA  T C NA  Potentially_Valid
chr3  37067104  T A 21  21  T A A [-4,16] Homopolymer
chr3  47103877  A G 2 10  T T A 1 Homopolymer(left)
chr8  48732073  T A 2 10  A T A 1 Homopolymer(right)
chr15 40913096  A T 4 4 A T T 0 Homopolymer(based_on_alt)

Description of output:

It provides an annotation for every site listed in the input file. 

CHR     : Chromosome location of the variant
POS     : Genomic position of the variant
REF     : Reference allele
ALT     : Alternate allele
POLYX   : Number of consecutive bases at variant site
HP_LEN  : Homopolymer length (<int>, <NA> if not a homopolymer)
HP_BASE : Nucleotide base that is repeated within the homopolymer (<base(s)>, <NA> if not a homopolymer)
N_LEFT  : Base on the left of the homopolymer (<base(s)>)
N_RIGHT : Base on the right of the homopolymer (<base(s)>)
OFFSET  : Offset from the site to the homopolymer (<int>, [left, right] range if variant site is within a homopolymer)
CALL    : Final annotation call 
          Potentially_Valid - no homopolymer predicted in the neighborhood of the variant site
          Homopolymer - variant site within a homopolymer region
          Homopolymer(left/right/both/adjacent_left/adjacent_right) - homopolymer found immediately adjacent or in
          the vicinity of the variant site (as specified by window size)
          Homopolymer(based_on_alt) - homopolymer predicted solely based on alt allele matching the homopolymer base
          Homopolymer_Stretch(left/right/adjacent_left/adjacent_right) - stretches of short homopolymers predicted 
          immediately adjacent or in the vicinity of the variant site (as specified by window size)

-------
Contact
-------

Email: SSivakumar@mdanderson.org

