They are artificial or known contaminant sequences (e.g., from E. coli , phiX174, or the human herpesvirus) that are added to the reference. Their purpose is to act as a "sponge." When a read aligns better to a decoy than to the human genome, it is discarded.
bwa mem -M -t 8 \ human_g1k_v37_decoy.fasta \ sample_R1.fastq.gz \ sample_R2.fastq.gz \ | samtools view -bS - \ | samtools sort -o sample_aligned.bam -
While NCBI primarily hosts GRCh38, their legacy collection includes v37. download human-g1k-v37-decoy.fasta
If you are about to download it – pause and ask whether you truly need GRCh37. If yes, this decoy version is the correct choice over plain hg19. If not, move to GRCh38.
Whether you are setting up a pipeline for whole-genome sequencing, configuring a GATK Best Practices workflow, or trying to reproduce legacy data, you will likely need this specific file. This article provides a deep dive into what this file is, why the "decoy" sequences matter, and a step-by-step guide on how to download human_g1k_v37_decoy.fasta safely and efficiently. They are artificial or known contaminant sequences (e
FTP from 1000 Genomes, GATK bundles, Illumina iGenomes, or NCBI.
After alignment, you should remove reads that mapped to decoys: bwa mem -M -t 8 \ human_g1k_v37_decoy
Do not confuse this with hs37d5.fasta , which is the NCBI/Genome Reference Consortium version of hg19 with decoys. human-g1k-v37-decoy.fasta is the 1000 Genomes variant. They are compatible but not identical.
The Broad Institute often provides these files via their Google Cloud Storage buckets or legacy FTPs associated with GATK. While they have moved toward , the v37 decoy remains a staple in their "Resource Bundle." How to Download via Command Line
After downloading, you must prepare the file for use in bioinformatics pipelines like GATK or BWA. Decompress gunzip hs37d5.fa.gz Index for SAMtools samtools faidx hs37d5.fa Create a Sequence Dictionary (for GATK/Picard) gatk CreateSequenceDictionary -R hs37d5.fa Index for BWA (if aligning) bwa index hs37d5.fa 4. Key Differences to Note
GRCh37 does not have ALT contigs like GRCh38. The decoy file includes only primary assembly + decoys.