I have always faced a problem while analyzing small RNAseq data, at the step of adapter trimming.
Overview of small RNAseq (Illumina)
- RNA is size fractionated using columns or PAGE
- 3' and 5' adapter ligation
- cDNA synthesis
- PCR amplification
- Sequencing
The length of the reads is dependent on the machine and recent ones like HiSeq can provide ~200bp.
The problem, however, is with reads smaller than the max read length of the machine and this is common with small RNAs such as miRNAs (especially if the concatemer of smallRNA and the complete 3'adapter is bigger than the max read length).
The first step of analysis is trimming of the 3' adapter (Illumina Truseq: TCGTATGCCGTCTTCTGCTTGT).
Several algorithms are available to do this job and what they precisely do is to check for overlaps between the adapter sequence and 3'end of the reads and then clip the aligned region.
Now the problem is this
You can't be really sure of very small alignments because they may not really be originating from the adapters, which means that you should specify a lower limit of alignment for clipping. I usually set it as 5 (intuitively).
But if a small piece of sequence really came from the adapter then it will remain and there is no way to clip it without any doubt.
The real problem comes during aligning the reads to the reference sequence. Aligners like bowtie (which i prefer to use), generally have user defined argument for number of allowed mismatches. Bowtie generally doesn't perform very well if you allow a lots of mismatches.
Subsequently, you might lose a really valuable read.
Alternative
To avoid this problem I sometimes trim the reads to around 25nt (for miRNA profiling). This creates a new problem:
You can't really distinguish whether the read came from a pre-miRNA (a longer RNA) or from mature-miRNA (smaller RNA that arises by processing of pre-miRNA)
Does anyone have an experience or an idea about how to solve this problem?
0 comments:
Post a Comment