CRC Wiki
CRC Wiki
Log in

Gatk

From CRC Wiki

General Description

GATK version 3.3

Basic Usage

module load bio #to load all modules
module load bio/gatk/3.3 #to load GATK alone

Useful Options


Further Information

To run bio module of GATK please type GenomeAnalysisTK, after loading.

See the official website: [http:// ] The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22 Copyright (c) 2010 The Broad Institute For support and documentation go to http://www.broadinstitute.org/gatk



usage: java -jar GenomeAnalysisTK.jar -T <analysis_type> [-args <arg_file>] [-I <input_file>] [-rbs <read_buffer_size>] [-et

      <phone_home>] [-K <gatk_key>] [-tag <tag>] [-rf <read_filter>] [-L <intervals>] [-XL <excludeIntervals>] [-isr 
      <interval_set_rule>] [-im <interval_merging>] [-ip <interval_padding>] [-R <reference_sequence>] [-ndrs] [-maxRuntime 
      <maxRuntime>] [-maxRuntimeUnits <maxRuntimeUnits>] [-dt <downsampling_type>] [-dfrac <downsample_to_fraction>] [-dcov 
      <downsample_to_coverage>] [-baq <baq>] [-baqGOP <baqGapOpenPenalty>] [-fixNDN] [-fixMisencodedQuals] 
      [-allowPotentiallyMisencodedQuals] [-OQ] [-DBQ <defaultBaseQualities>] [-PF <performanceLog>] [-BQSR <BQSR>] [-DIQ] 
      [-EOQ] [-preserveQ <preserve_qscores_less_than>] [-globalQScorePrior <globalQScorePrior>] [-S <validation_strictness>] 
      [-rpr] [-kpr] [-sample_rename_mapping_file <sample_rename_mapping_file>] [-U <unsafe>] [-sites_only] [-writeFullFormat] 
      [-compress <bam_compression>] [-simplifyBAM] [--disable_bam_indexing] [--generate_md5] [-nt <num_threads>] [-nct 
      <num_cpu_threads_per_data_thread>] [-mte] [-bfh <num_bam_file_handles>] [-rgbl <read_group_black_list>] [-ped 
      <pedigree>] [-pedString <pedigreeString>] [-pedValidationType <pedigreeValidationType>] [-variant_index_type 
      <variant_index_type>] [-variant_index_parameter <variant_index_parameter>] [-l <logging_level>] [-log <log_to_file>] 
      [-h] [-version]
-T,--analysis_type <analysis_type>                                                      Name of the tool to run
-args,--arg_file <arg_file>                                                             Reads arguments from the 
                                                                                        specified file
-I,--input_file <input_file>                                                            Input file containing sequence 
                                                                                        data (SAM or BAM)
-rbs,--read_buffer_size <read_buffer_size>                                              Number of reads per SAM file to 
                                                                                        buffer in memory
-et,--phone_home <phone_home>                                                           Run reporting mode (NO_ET|AWS|
                                                                                        STDOUT)
-K,--gatk_key <gatk_key>                                                                GATK key file required to run 
                                                                                        with -et NO_ET
-tag,--tag <tag>                                                                        Tag to identify this GATK run 
                                                                                        as part of a group of runs
-rf,--read_filter <read_filter>                                                         Filters to apply to reads 
                                                                                        before analysis
-L,--intervals <intervals>                                                              One or more genomic intervals 
                                                                                        over which to operate
-XL,--excludeIntervals <excludeIntervals>                                               One or more genomic intervals 
                                                                                        to exclude from processing
-isr,--interval_set_rule <interval_set_rule>                                            Set merging approach to use for 
                                                                                        combining interval inputs 
                                                                                        (UNION|INTERSECTION)
-im,--interval_merging <interval_merging>                                               Interval merging rule for 
                                                                                        abutting intervals (ALL|
                                                                                        OVERLAPPING_ONLY)
-ip,--interval_padding <interval_padding>                                               Amount of padding (in bp) to 
                                                                                        add to each interval
-R,--reference_sequence <reference_sequence>                                            Reference sequence file
-ndrs,--nonDeterministicRandomSeed                                                      Use a non-deterministic random 
                                                                                        seed
-maxRuntime,--maxRuntime <maxRuntime>                                                   Stop execution cleanly as soon 
                                                                                        as maxRuntime has been reached
-maxRuntimeUnits,--maxRuntimeUnits <maxRuntimeUnits>                                    Unit of time used by maxRuntime 
                                                                                        (NANOSECONDS|MICROSECONDS|
                                                                                        MILLISECONDS|SECONDS|MINUTES|
                                                                                        HOURS|DAYS)
-dt,--downsampling_type <downsampling_type>                                             Type of read downsampling to 
                                                                                        employ at a given locus (NONE|
                                                                                        ALL_READS|BY_SAMPLE)
-dfrac,--downsample_to_fraction <downsample_to_fraction>                                Fraction of reads to downsample 
                                                                                        to
-dcov,--downsample_to_coverage <downsample_to_coverage>                                 Target coverage threshold for 
                                                                                        downsampling to coverage
-baq,--baq <baq>                                                                        Type of BAQ calculation to 
                                                                                        apply in the engine (OFF|
                                                                                        CALCULATE_AS_NECESSARY|
                                                                                        RECALCULATE)
-baqGOP,--baqGapOpenPenalty <baqGapOpenPenalty>                                         BAQ gap open penalty
-fixNDN,--refactor_NDN_cigar_string                                                     refactor cigar string with NDN 
                                                                                        elements to one element
-fixMisencodedQuals,--fix_misencoded_quality_scores                                     Fix mis-encoded base quality 
                                                                                        scores
-allowPotentiallyMisencodedQuals,--allow_potentially_misencoded_quality_scores          Ignore warnings about base 
                                                                                        quality score encoding
-OQ,--useOriginalQualities                                                              Use the base quality scores 
                                                                                        from the OQ tag
-DBQ,--defaultBaseQualities <defaultBaseQualities>                                      Assign a default base quality
-PF,--performanceLog <performanceLog>                                                   Write GATK runtime performance 
                                                                                        log to this file
-BQSR,--BQSR <BQSR>                                                                     Input covariates table file for 
                                                                                        on-the-fly base quality score 
                                                                                        recalibration
-DIQ,--disable_indel_quals                                                              Disable printing of base 
                                                                                        insertion and deletion tags 
                                                                                        (with -BQSR)
-EOQ,--emit_original_quals                                                              Emit the OQ tag with the 
                                                                                        original base qualities (with 
                                                                                        -BQSR)
-preserveQ,--preserve_qscores_less_than <preserve_qscores_less_than>                    Don't recalibrate bases with 
                                                                                        quality scores less than this 
                                                                                        threshold (with -BQSR)
-globalQScorePrior,--globalQScorePrior <globalQScorePrior>                              Global Qscore Bayesian prior to 
                                                                                        use for BQSR
-S,--validation_strictness <validation_strictness>                                      How strict should we be with 
                                                                                        validation (STRICT|LENIENT|
                                                                                        SILENT)
-rpr,--remove_program_records                                                           Remove program records from the 
                                                                                        SAM header
-kpr,--keep_program_records                                                             Keep program records in the SAM 
                                                                                        header
-sample_rename_mapping_file,--sample_rename_mapping_file <sample_rename_mapping_file>   Rename sample IDs on-the-fly at 
                                                                                        runtime using the provided 
                                                                                        mapping file
-U,--unsafe <unsafe>                                                                    Enable unsafe operations: 
                                                                                        nothing will be checked at 
                                                                                        runtime (ALLOW_N_CIGAR_READS|
                                                                                        ALLOW_UNINDEXED_BAM|
                                                                                        ALLOW_UNSET_BAM_SORT_ORDER|
                                                                                        NO_READ_ORDER_VERIFICATION|
                                                                                        ALLOW_SEQ_DICT_INCOMPATIBILITY|
                                                                                        LENIENT_VCF_PROCESSING|ALL)
-sites_only,--sites_only                                                                Just output sites without 
                                                                                        genotypes (i.e. only the first 
                                                                                        8 columns of the VCF)
-writeFullFormat,--never_trim_vcf_format_field                                          Always output all the records 
                                                                                        in VCF FORMAT fields, even if 
                                                                                        some are missing
-compress,--bam_compression <bam_compression>                                           Compression level to use for 
                                                                                        writing BAM files (0 - 9, 
                                                                                        higher is more compressed)
-simplifyBAM,--simplifyBAM                                                              If provided, output BAM files 
                                                                                        will be simplified to include 
                                                                                        just key reads for downstream 
                                                                                        variation discovery analyses 
                                                                                        (removing duplicates, PF-, 
                                                                                        non-primary reads), as well 
                                                                                        stripping all extended tags 
                                                                                        from the kept reads except the 
                                                                                        read group identifier
--disable_bam_indexing                                                                  Turn off on-the-fly creation of 
                                                                                        indices for output BAM files.
--generate_md5                                                                          Enable on-the-fly creation of 
                                                                                        md5s for output BAM files.
-nt,--num_threads <num_threads>                                                         Number of data threads to 
                                                                                        allocate to this analysis
-nct,--num_cpu_threads_per_data_thread <num_cpu_threads_per_data_thread>                Number of CPU threads to 
                                                                                        allocate per data thread
-mte,--monitorThreadEfficiency                                                          Enable threading efficiency 
                                                                                        monitoring
-bfh,--num_bam_file_handles <num_bam_file_handles>                                      Total number of BAM file 
                                                                                        handles to keep open 
                                                                                        simultaneously
-rgbl,--read_group_black_list <read_group_black_list>                                   Exclude read groups based on 
                                                                                        tags
-ped,--pedigree <pedigree>                                                              Pedigree files for samples
-pedString,--pedigreeString <pedigreeString>                                            Pedigree string for samples
-pedValidationType,--pedigreeValidationType <pedigreeValidationType>                    Validation strictness for 
                                                                                        pedigree information (STRICT|
                                                                                        SILENT)
-variant_index_type,--variant_index_type <variant_index_type>                           Type of IndexCreator to use for 
                                                                                        VCF/BCF indices (DYNAMIC_SEEK|
                                                                                        DYNAMIC_SIZE|LINEAR|INTERVAL)
-variant_index_parameter,--variant_index_parameter <variant_index_parameter>            Parameter to pass to the 
                                                                                        VCF/BCF IndexCreator
-l,--logging_level <logging_level>                                                      Set the minimum level of 
                                                                                        logging
-log,--log_to_file <log_to_file>                                                        Set the logging location
-h,--help                                                                               Generate the help message
-version,--version                                                                      Output version information
annotator                       
  VariantAnnotator              Annotates variant calls with context information.
                                
beagle                          
  BeagleOutputToVCF             Takes files produced by Beagle imputation engine and creates a vcf with modified 
                                annotations.
  ProduceBeagleInput            Converts the input VCF into a format accepted by the Beagle imputation/analysis 
                                program.
  VariantsToBeagleUnphased      Produces an input file to Beagle imputation engine, listing unphased, hard-called 
                                genotypes for a single sample in input variant file.
                                
bqsr                            
  AnalyzeCovariates             Tool to analyze and evaluate base recalibration tables.
  BaseRecalibrator              First pass of the base quality score recalibration -- Generates recalibration table 
                                based on various user-specified covariates (such as read group, reported quality score, 
                                machine cycle, and nucleotide context).
                                
coverage                        
  CallableLoci                  Emits a data file containing information about callable, uncallable, poorly mapped, and 
                                other parts of the genome <p/>
  CompareCallableLoci           Test routine for new VariantContext object
  DepthOfCoverage               Assess sequence coverage by a wide array of metrics, partitioned by sample, read group, 
                                or library
  GCContentByInterval           Walks along reference and calculates the GC content for each interval.
                                
diagnosetargets                 
  DiagnoseTargets               Analyzes coverage distribution and validates read mates for a given interval and 
                                sample.
                                
diagnostics                     
  BaseCoverageDistribution      Simple walker to plot the coverage distribution per base
  CoveredByNSamplesSites        Print intervals file with all the variant sites for which most of the samples have good 
                                coverage
  ErrorRatePerCycle             Compute the read error rate per position
  FindCoveredIntervals          Outputs a list of intervals that are covered above a given threshold.
  ReadGroupProperties           Emits a GATKReport containing read group, sample, library, platform, center, sequencing 
                                data, paired end status, simple read type name (e.g.
  ReadLengthDistribution        Outputs the read lengths of all the reads in a file.
                                
diffengine                      
  DiffObjects                   A generic engine for comparing tree-structured objects
                                
examples                        
  GATKPaperGenotyper            A simple Bayesian genotyper, that outputs a text based call format.
                                
fasta                           
  FastaAlternateReferenceMaker  Generates an alternative reference sequence over the specified interval.
  FastaReferenceMaker           Renders a new reference in FASTA format consisting of only those loci provided in the 
                                input data set.
  FastaStats                    Calculate basic statistics about the reference sequence itself
                                
filters                         
  VariantFiltration             Filters variant calls using a number of user-selectable, parameterizable criteria.
                                
genotyper                       
  UnifiedGenotyper              A variant caller which unifies the approaches of several disparate callers -- Works for 
                                single-sample and multi-sample data.
                                
haplotypecaller                 
  HaplotypeCaller               Call SNPs and indels simultaneously via local re-assembly of haplotypes in an active 
                                region.
  HaplotypeResolver             Haplotype-based resolution of variants in 2 different eval files.
                                
indels                          
  IndelRealigner                Performs local realignment of reads to correct misalignments due to the presence of 
                                indels.
  LeftAlignIndels               Left-aligns indels from reads in a bam file.
  RealignerTargetCreator        Emits intervals for the Local Indel Realigner to target for realignment.
                                
missing                         
  QualifyMissingIntervals       Walks along reference and calculates a few metrics for each interval.
                                
phasing                         
  PhaseByTransmission           Computes the most likely genotype combination and phases trios and parent/child pairs
  ReadBackedPhasing             Walks along all variant ROD loci, caching a user-defined window of VariantContext 
                                sites, and then finishes phasing them when they go out of range (using upstream and 
                                downstream reads).
                                
qc                              
  CheckPileup                   Compare GATK's internal pileup to a reference Samtools pileup
  CountBases                    Walks over the input data set, calculating the number of bases seen for diagnostic 
                                purposes.
  CountIntervals                Count contiguous regions in an interval list.
  CountLoci                     Walks over the input data set, calculating the total number of covered loci for 
                                diagnostic purposes.
  CountMales                    Walks over the input data set, calculating the number of reads seen from male samples 
                                for diagnostic purposes.
  CountReadEvents               Walks over the input data set, counting the number of read events (from the CIGAR 
                                operator)
  CountReads                    Walks over the input data set, calculating the number of reads seen for diagnostic 
                                purposes.
  CountRODs                     Prints out counts of the number of reference ordered data objects encountered.
  CountRODsByRef                Prints out counts of the number of reference ordered data objects encountered along the 
                                reference.
  CountTerminusEvent            Walks over the input data set, counting the number of reads ending in 
                                insertions/deletions or soft-clips
  ErrorThrowing                 A walker that simply throws errors.
  FlagStat                      A reimplementation of the 'samtools flagstat' subcommand in the GATK
  Pileup                        Emulates the samtools pileup command to print aligned reads
  PrintRODs                     Prints out all of the RODs in the input data set.
  QCRef                         Quality control for the reference fasta
  ReadClippingStats             Read clipping statistics for all reads.
                                
readutils                       
  ClipReads                     Read clipping based on quality, position or sequence matching
  PrintReads                    Renders, in SAM/BAM format, all reads from the input data set in the order in which 
                                they appear in the input file.
  ReadAdaptorTrimmer            Utility tool to blindly strip base adaptors.
  SplitSamFile                  Divides the input data set into separate BAM files, one for each sample in the input 
                                data set.
                                
rnaseq                          
  SplitNCigarReads              Splits reads that contain Ns in their cigar string (e.g.
                                
simulatereads                   
  SimulateReadsForVariants      Generates simulated reads for variants
                                
validation                      
  GenotypeAndValidate           Genotypes a dataset and validates the calls of another dataset using the Unified 
                                Genotyper.
                                
validationsiteselector          
  ValidationSiteSelector        Randomly selects VCF records according to specified options.
                                
varianteval                     
  VariantEval                   General-purpose tool for variant evaluation (% in dbSNP, genotype concordance, Ti/Tv 
                                ratios, and a lot more)
                                
variantrecalibration            
  ApplyRecalibration            Applies cuts to the input vcf file (by adding filter lines) to achieve the desired 
                                novel truth sensitivity levels which were specified during VariantRecalibration
  VariantRecalibrator           Create a Gaussian mixture model by looking at the annotations values over a high 
                                quality subset of the input call set and then evaluate all input variants.
                                
variantutils                    
  CalculateGenotypePosteriors   Calculates genotype posterior likelihoods given panel data
  CombineGVCFs                  Combines any number of gVCF files that were produced by the Haplotype Caller into a 
                                single joint gVCF file.
  CombineVariants               Combines VCF records from different sources.
  FilterLiftedVariants          Filters a lifted-over VCF file for ref bases that have been changed.
  GenotypeConcordance           Genotype concordance (per-sample and aggregate counts and frequencies, NRD/NRS and site 
                                allele overlaps) between two callsets
  GenotypeGVCFs                 Genotypes any number of gVCF files that were produced by the Haplotype Caller into a 
                                single joint VCF file.
  LeftAlignAndTrimVariants      Left-aligns indels from a variants file.
  LiftoverVariants              Lifts a VCF file over from one build to another.
  RandomlySplitVariants         Takes a VCF file, randomly splits variants into two different sets, and outputs 2 new 
                                VCFs with the results.
  RegenotypeVariants            Regenotypes the variants from a VCF.
  SelectHeaders                 Selects headers from a VCF source.
  SelectVariants                Selects variants from a VCF source.
  ValidateVariants              Validates a VCF file with an extra strict set of criteria.
  VariantsToAllelicPrimitives   Takes alleles from a variants file and breaks them up (if possible) into more 
                                basic/primitive alleles.
  VariantsToBinaryPed           Converts a VCF file to a binary plink Ped file (.bed/.bim/.fam)
  VariantsToTable               Emits specific fields from a VCF file to a tab-deliminated table
  VariantsToVCF                 Converts variants from other file formats to VCF format.
  VariantValidationAssessor     Annotates a validation (from Sequenom for example) VCF with QC metrics (HW-equilibrium, 
                                % failed probes)