Bioinformatics for Beginners – File Formats Part 3. – Alignments
The generally used file formats for sequence based alignments are the SAM and BAM formats. These files can contain information about mapped and unmapped reads, the contigs of the reference sequence that was used and many more things.
SAM
The SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments.
A sam file has two sections:
- Header section:
- The header section is not mandatory, but most NGS softwares require it.
- It contains information about five main topics:
- alignment file: format version, sorting;
- reference sequence(s): e.g. name, length, species, url;
- read group: sequencing lane, sample, sequencing center, library etc.;
- program: aligner name and version, parameters used for the alignment;
- custom comment(s).
- Each line of the header section starts with ‘@’ and a two letter record type code.
- Alignment section:
- Every read in the alignment (and sometimes unmapped reads) are represented by one row consisting of tab delimited fields (basically columns).
- If a read is mapped to more than one location, every mapping will have its own row in the sam file.
- There are 11 mandatory fields in each row:
- read name
- bitwise flag (it codes information about the read e.g. mapped/unmapped, paired/not paired, mapped to forward/reverse strand etc.) -> for a “flag decoder”, see here
- reference sequence name
- starting position of the mapped reads on the reference sequence
- mapping quality
- CIGAR string (this is basically a short description of the alignment)
- reference name for the mate (for paired data)
- position of the mate (for paired data)
- distance between paired reads (for paired data)
- nucleotide sequence of the read
- per base quality of the read
- there are several optional fields, for these, see the format specificatio.n
For a short(ish) introduction with some examples, see here.
Very simple header example:
@HD VN:1.3 SO:coordinate
@SQ SN:ref LN:45
The same example with explanations:
@HD <- This just means, that we have a header VN:1.3 <- file format version is 1.3SO:coordinate <- reads are sorted by mapping coordinate
@SQ <- In this row, we have information about (one of) the reference sequence(s) SN:ref <- reference sequence is named ‘ref’ LN:45 -< reference sequence is 45 bps long
Very simple read example:
read quality: 0;==-==9;>>>>>=>>>>>>>>>>>=>>>>>>>>>>
File extension: .sam
BAM
Contains the same information as the SAM file. Stores the data in a compressed, binary form.
File extension: .bam
0 comments:
Post a Comment