Sequencing bias: comparison of different protocols of MicroRNA library construction
- Geng Tian†1, 2, 4,
- XuYang Yin†3, 4,
- Hong Luo4,
- XiaoHong Xu4,
- Lars Bolund4, 5 and
- XiuQing Zhang4Email author
© Tian et al; licensee BioMed Central Ltd. 2010
Received: 7 August 2009
Accepted: 6 September 2010
Published: 6 September 2010
MicroRNAs(miRNAs) are 18-25 nt small RNAs playing critical roles in many biological processes. The majority of known miRNAs were discovered by conventional cloning and a Sanger sequencing approach. The next-generation sequencing (NGS) technologies enable in-depth characterization of the global repertoire of miRNAs, and different protocols for miRNA library construction have been developed. However, the possible bias between the relative expression levels and sequences introduced by different protocols of library preparation have rarely been explored.
We assessed three different miRNA library preparation protocols, SOLiD, Illumina versions 1 and 1.5, using cloning or SBS sequencing of total RNA samples extracted from skeletal muscles from Hu sheep and Dorper sheep, and then validated 9 miRNAs by qRT-PCR. Our results show that SBS sequencing data highly correlate with Illumina cloning data. The SOLiD data, when compared to Illumina's, indicate more dispersed distribution of length, higher frequency variation for nucleotides near the 3'- and 5'-ends, higher frequency occurrence for reads containing end secondary structure (ESS), and higher frequency for reads that do not map to known miRNAs. qRT-PCR results showed the best correlation with SOLiD cloning data. Fold difference of Hu sheep and Dorper sheep between qRT-PCR result and SBS sequencing data correlated well (r = 0.937), and fold difference of miR-1 and miR-206 among SOLiD cloning data, qRT-PCR and SBS sequencing data was similar.
The sequencing depth can influence the quantitative measurement of miRNA abundance, but the discrepancy caused by it was not statistically significant as high correlation was observed between Illumina cloning and SBS sequencing data. Bias of length distribution, sequence variation, and ESS was observed between data obtained with the different protocols. SOLiD cloning data differ from Illumina cloning data mainly because of distinct methods of adapter ligation. The good correlation between qRT-PCR result and SOLiD data might be due to the similarities of the hybridization-based methods. The fold difference analysis indicated that methods based on hybridization may be superior for quantitative measurement of miRNA abundance. Because of the genome sequence of the sheep is not available, our data may not explain how the entire miRNA bias in the natural miRNAs in sheep or other mammal miRNA expression, unbiased artificially synthesized miRNA will help on evaluating the methodology of miRNA library preparation.
MicroRNAs(miRNAs) are an abundant group of small RNAs with length ranging from 18 to 25 nucleotides, averaging 22 nucleotides, and performing post-transcriptional regulation of the expression of genes involved in a wide variety of biological processes. The complex biogenesis of mature miRNAs has recently been reviewed . Sequences of more than 10883 miRNAs have been deposited in the miRBase database [2, 3], the majority of them having been discovered by traditional cloning approach. Bioinformatics predictions with experimental validation indicate that the total number of miRNAs is significantly higher than previously estimated . It is essential to characterize the whole repertoire of miRNAs and to fully understand their integrated expression patterns. The next-generation sequencing (NGS) techniques enable these efforts with lower cost and have been applied in miRNAs studies in many species of animals, plants and viruses.
Sample preparation is of major importance for NGS and assessing the quality of a library preparation by cloning validation before sequencing is necessary . Different commercial protocols for miRNA library preparation have been developed. Illumina, Inc. published a miRNA sample preparation protocol (V1) for SBS sequencing in 2007, which requires a minimum of 4 days of procedure. In 2009 Illumina, Inc. proposed an alternative protocol (V1.5) which only requires one day of sample preparation. Applied Biosystems, Inc. developed a miRNA library preparation protocol for the SOLiD (Sequencing by Oligonucleotide Ligation and Detection) system, also requiring one day procedure, but its adapter ligation principle is based on hybridization. These protocols can be applied in all current sequencing techniques though the downstream procedures can be variable.
The cloning frequency of an individual miRNA should generally reflect its relative abundance in a sample, and the novel NGS methods offering a much richer source of sequence information should provide more accurate quantitative expression measurements . However, in reality biases caused by sample preparation cannot be avoided, sometimes leading to inaccurate conclusions. A systematic bias in the cloning protocol has previously been detected: miRNA clone counts did not correlate well with their concentrations in the pool . Biased cloning efficiencies were also observed for two different miRNAs from the same cluster, leading to discrepancies between cloning frequency and small RNA blot results .
Different protocols of library preparation may influence the cloning frequency significantly. The adapter ligation efficiency can be affected by the 5'- and 3'-end nucleotides or the secondary structure of miRNAs, and the number of polymerase chain reaction cycles or gel isolation procedures may also influence the results. In this article we compared sequencing data of libraries constructed by the above-mentioned three different protocols, and validated some results by qRT-PCR using stem-loop primers . Bias of length, sequence variation, and ESS were observed for all three protocols. Based on our data, we suggest that methods such as SOLiD and qRT-PCR, based on hybridization, may provide better quantitative measurement of miRNA abundance.
Results and discussion
Statistics for cloning and Illumina SBS sequencing libraries
The libraries constructed from Dorper sheep and Hu sheep, using the Illumina V1 protocol, were used for Illumina SBS sequencing. About 6 million raw reads were obtained for each library. Eighty-four percent of the Hu sheep reads mapped to known Ovis aries sequences and 111,078 were unique reads for Hu sheep. Concerning Dorper sheep, 82% of the reads could map to known Ovis aries sequences, and 147,044 unique reads were observed. About 5.8 and 5.6 million reads were obtained after adapter removing for Dorper sheep and Hu sheep respectively. Of these reads 82% and 84%, respectively, have previously been annotated as either known RNAs (rRNA, tRNA, snRNA etc, incl. miRNA), repeat regions, or are contained within the boundaries of protein coding genes for Dorper sheep and Hu sheep. The number of reads that were annotated as known microRNAs was 4,812,498 and 4,904,192 for Dorper sheep and Hu sheep respectively.
Length distribution for libraries
Different protocols generate different bias of sequence variation and end secondary structure (ESS)
The sequence variations of miR-1 and miR-206 in our data were assessed by WebLogo tool (Additional file 1, 2, 3, 4, 5, and 6). The sequences obtained by SOLiD cloning display a higher diversity than the ones from Illumina. We observed obvious higher-frequency variation of nucleotides near 3'-end sites. The adenine and thymine at the 3'-end of miR-1 were truncated in the majority of SOLiD cloning sequences. Variations of 5'-end nucleotides were also found in SOLiD cloning data, but were rare in all the data from Illumina protocols. Comparing the sequences obtained by the two Illumina protocol versions, we observed a generally high conservatism, though nucleotides near the 3'-end showed slightly more diversity using Illumina V1.5 protocol.
The 16, 17 sites of miR-1 have lower conservatism and more diversity for SOLiD cloning sequences (Additional file 1 and 4), this phenomenon can also be observed at the 17, 18 sites of miR-206 in SOLiD data, but not in Illumina's (Additional file 2 and 5). The sites listed above all locate near the 3'-end, however, the sites near the 5'-end indicated high conservatism for both SOLiD and Illumina data. Diversity of these sites near the 3'-end may be caused by the hybridization-based adapter ligation in the SOLiD protocol.
Bias of ESS also existed between the two versions of Illumina protocols (Figure 3). Data acquired with the V1.5 protocol contained more than 10% 3'-ESS for miR-1 sequences, while the data of the V1 protocol did not contain any ESS for miR-1 sequences.
Relative abundance of miRNAs varied for different protocols
Correlation coefficients between qRT-PCR and sequencing data for Dorper sheep and Hu sheep
A relatively higher frequency of sequences that do not map to known miRNAs was observed for SOLiD cloning data (about 20%), including the sequences mapping to mRNAs, repeats, or rRNA genes, and sequences that do not map to any known sequences (Figure 1).
qRT-PCR results correlate the best with SOLiD cloning data
Fold difference analysis between Hu sheep and Dorper sheep using qRT-PCR and sequencing
In the present study, we assessed three different protocols of miRNA library construction using cloning or SBS sequencing, and validated our results by qRT-PCR. SBS sequencing provided a high-throughput and deep measurement for miRNA expression, while the sequencing depth of cloning was much lower, though a concatemerization cloning strategy was developed . SBS sequencing data correlated better with qRT-PCR results than did Illumina cloning data, indicating that sequencing depth would influence the quantitative measurement of miRNA abundance, but the discrepancy caused by it was not significant, as seen from the high correlation between SBS and Illumina cloning data.
We finally assessed the relative abundance of 9 miRNAs by qRT-PCR. The principle of reverse transcription (RT)-PCR with stem-loop primer is based on hybridization as for the SOLiD protocol, which could explain the high correlation between qRT-PCR and SOLiD cloning data. The fold difference data between Hu sheep and Dorper sheep using qRT-PCR and SBS sequencing correlated significantly, and the fold difference data for miR-1 and miR-206 using SOLiD cloning were similar to data obtained with SBS sequencing and qRT-PCR, indicating that the methods using a hybridization principle may be more suitable for quantitative measurement of miRNA abundance. Moreover, qRT-PCR has been used prevalently for validation of microarray results [13, 14] and its accuracy has been recognized.
Total RNA preparation and DNase I treatment, Isolation of small RNAs
Total RNA from skeletal muscle tissues of Hu sheep and Dorper sheep were extracted using Trizol (Invitrogen, Carlsbad, CA) according to the manufacturer's protocol. About 10 ug total RNA was treated with DNase I (NEB) and then purified by ethanol precipitation.
MiRNA libraries construction
The Illumina V1.5 protocol has been previously described . Pre-adenylated 3' adapter deoxyoligonucleotides were used, and their 3' ends blocked. A truncated form of T4 RNA ligase 2, Rnl2, was used for 3' adapter ligation without ATP. Then 5' adapter, ATP, and T4 RNA ligase were added to the ligation mix without purification. Then reverse transcription PCR was performed to create and amplifiy cDNA constructs, which were then gel-purified.
The SOLiD protocol allows simultaneous 5' and 3' adapter ligation to the ends of small RNAs, a method based on hybridization of N6 at the end of the adapters. After ligation of adapters, reverse transcription was performed followed by RNase H digestion and cDNA library amplification. The library was finally size-selected and purified.
Cloning and SBS sequencing
Cloning of miRNAs was performed as described previously . The resulting cDNA libraries following the three protocols for Hu sheep and Dorper sheep were cloned and transformed into competent cells. Plasmids were isolated from individual colonies and sequenced. The sequences were subsequently processed to remove vector sequences and used for BLASTN analysis against the miRBase database [2, 3].
SBS sequencing using Illumina Genome Analyzer was performed for cDNA libraries of Hu sheep and Dorper sheep constructed by Illumina V1 protocol. 10 pM of each sample was used for cluster generation. After hybridization of sequencing primer, 35 cycles of base incorporation were carried out on the 1 G analyzer. Image analysis and basecalling were performed using Illumina Pipeline. The sequence tags obtained after purity filtering were sorted and annotated. The reads mapping to known miRNAs were annotated using the miRBase database .
Prediction of End Secondary Structure (ESS) for miR-1 and miR-206
We predicted the secondary structure for the reads of miR-1 in sequencing data by an RNA mfold web server using default parameters [18, 19]. Whenever a stem-loop structure was able to be formed in the 5'- or 3'-end of an miRNA sequence so that a terminal double-stranded structure would appear, we considered this sequence as containing ESS. We counted the reads with and without either 5'- ESS or 3'- ESS, or both.
Assessment of sequence variation of miR-1 and miR-206 by WebLogo
We assessed the variation of miR-1 and miR-206 sequences in our data by a WebLogo tool . In order to reflect the variation of 3'-end sequences, we added "N" to the vacancy sites at the 3'-end to bring all the sequences to the same length. miR-1 and miR-206 sequences of cloning and Illumina GA data were all assessed.
Real-time quantitative RT-PCR (qRT-PCR)
We selected the following nine miRNAs including miR-1, miR-206, miR-378, miR-486-5p, miR-140, miR-191, miR-16, let-7b, and let-7f. miR-16 was used as reference. The sequences of the primers are listed in Additional file 9. Stem-loop primers were preheated at 95°C for 3 min, then gradually cooled down to room temperature. 10 ng purified total RNA was used as template for a total of 10 ul reaction. 10 nM of each miRNA specific reverse transcription primer together with 10 U RNase Out, 5U Superscript II, 5 mM DTT and 20 mM dNTP were used for each RT reaction. Samples were incubated at 16°C for 30 min, then at 42°C for 30 min, and finally at 75°C for 15 min to inactivate the Superscript II enzyme.
Four microlitres of RT product were used as template for a 20 ul reaction of real-time PCR. All reactions were assayed in triplicates. Real-time PCR was performed using a TaKaRa SYBR Premix Ex Taq kit according to the manufacturer's protocol on an Applied Biosystems StepOnePlus Real-time PCR System. The reaction conditions were modified as follows: 95°C for 30 sec, followed by 40 cycles of 5 sec at 95°C, and 63°C for 31 sec. miR-16 was used to normalize the results. The relative abundance of miRNAs and fold difference between Hu sheep and Dorper sheep were calculated using the 2-ΔΔCT method.
Fold difference analysis using qRT-PCR and sequencing data
The ΔΔCT of 8 miRNAs between DNase-treated total RNA samples of Hu sheep and Dorper sheep were calculated with miR-16 as endogenous reference to normalize the qRT-PCR result. The log ratio of read counts of the same 8 miRNAs between Illumina GA data of Hu sheep and Dorper sheep were calculated. For limited sequencing depth, some low-abundance miRNAs cannot be detected by cloning. We selected the two muscle-specific miRNAs, miR-1 and miR-206, which were abundant in all sequencing data, and calculated the read counts log ratio of these two miRNAs between the data of Hu sheep and Dorper sheep. The fold difference of Hu sheep and Dorper sheep between qRT-PCR and sequencing data were compared (Additional file 10).
We thank Tian Wei for data analysis, and Gan Shang Quan for providing the RNA samples, and Søren Nørby for revising the manuscript. This work was supported by a grant from the Ministry of Science and Technology of China(863 program: 2006AA02A301)
- Winter Julia, Jung Stephanie, Keller Sarina, Gregory Richard, Diederichs Sven: Many roads to maturity: microRNA biogenesis pathways and their regulation. Nature Cell Biol. 2009, 11: 228-234. 10.1038/ncb0309-228.View ArticleGoogle Scholar
- Griffiths-Jones S, Grocock RJ, van Dongen S, Bateman A, Enright AJ: MiRBase: microRNA sequences, targets and gene nomenclature. Nucleic Acid Res. 2006, 34: D140-D144. 10.1093/nar/gkj112.View ArticleGoogle Scholar
- The miRBase Database. [http://www.mirbase.org/]
- Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V, Lindblad-Toh K, Lander ES, Kellis M: Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals. Nature. 2005, 434: 338-345. 10.1038/nature03441.View ArticleGoogle Scholar
- Lu Cheng, Meyers Blake, Green Pamela: Construction of small RNA cDNA libraries for deep sequencing. Methods. 2007, 43: 110-117. 10.1016/j.ymeth.2007.05.002.View ArticleGoogle Scholar
- Bar M, Wyman SK, Fritz BR, Tewari M: MicroRNA Discovery and Profiling in Human Embryonic Stem Cells by Deep Sequencing of Small RNA Libraries. Stem Cells. 2008, 26: 2496-2505. 10.1634/stemcells.2008-0356.View ArticleGoogle Scholar
- Landgraf Pablo, Rusu Mirabela, Sheridan Robert, et al: A mammalian microRNA expression atlas based on small RNA library sequencing. Cell. 2007, 129: 1401-1414. 10.1016/j.cell.2007.04.040.View ArticleGoogle Scholar
- Reddy Matta Alavala, Zheng Yun, Jagadeeswaran Guru, Macmil Simone, Graham Wiley, Roe Bruce, Desilva Udaya, Zhang Weixiong, Sunkar Ramanjulu: Cloning, characterization and expression analysis of porcine microRNAs. BMC Genomics. 2009, 10: 1471-2164. 10.1186/1471-2164-10-65.View ArticleGoogle Scholar
- Chen Caifu, Ridzon Dana, Broomer Adam, et al: Real-time quantification of microRNAs by stem-loop RT-PCR. Nucleic Acids Res. 2005, 33: e179-10.1093/nar/gni178.View ArticleGoogle Scholar
- Callis Thomas, Deng ZhongLiang, Chen Jian-Fu, Wang Da-Zhi: Muscling through the microRNA world. Exp Biol Med(Maywood). 2008, 233: 131-138. 10.3181/0709-MR-237.View ArticleGoogle Scholar
- Pfeffer Sebastien, Lagos-Quintana Mariana, Tuschl Thomas: Cloning of small RNA molecules. Current Protocols in Molecular Biology. 2005, 26.4.1-26.4.18.Google Scholar
- Nichols NM, Tabor S, McReynolds LA: RNA ligase. Curr Protoc Mol Biol. 2008, Chapter 3 (Unit3.15):Google Scholar
- Bruchova Hana, Merkerova Michaela, Prchal Josef: Aberrant expression of microRNA in polycythemia vera. Haematologica. 2008, 93: 1009-1016. 10.3324/haematol.12706.View ArticleGoogle Scholar
- Gibcus Johan, Tan Ping Lu, Harms Geert, Schakel Nynke Rikst, de Jong Debora, Blokzijl Tjasso, Moller Peter, Poppema Sibrand, Kroesen Bart-Jan, van den Berg Anke: Hodgkin lymphoma cell lines are characterized by a specific miRNA expression profile. Neoplasia. 2009, 11: 167-176.View ArticleGoogle Scholar
- Hafner Markus, Landgraf Pablo, Ludwig Janos, et al: Identification of microRNAs and other small regulatory RNAs using cDNA library sequencing. Method. 2008, 44: 3-12. 10.1016/j.ymeth.2007.09.009.View ArticleGoogle Scholar
- Sunkar Ramanjulu, Girke Thomas, Jain Kumar Pradeep, Zhu Jian-Kang: Cloning and characterization of microRNAs from rice. The Plant Cell. 2005, 17: 1397-1411. 10.1105/tpc.105.031682.View ArticleGoogle Scholar
- Ambros V, Bartel B, Bartel DP, Burge CB, Carrington JC, Chen X, Dreyfuss G, Eddy SR, Griffiths-Jones S, Marshall M, Matzke M, Ruvkun G, Tuschl T: A uniform system for microRNA annotation. RNA. 2003, 9: 277-279. 10.1261/rna.2183803.View ArticleGoogle Scholar
- Zuker M: Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acid Res. 2003, 31: 3406-15. 10.1093/nar/gkg595.View ArticleGoogle Scholar
- Mathews DH, Sabina J, Zuker M, Turner DH: Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J Mol Biol. 1999, 288: 911-940. 10.1006/jmbi.1999.2700.View ArticleGoogle Scholar
- Crooks Gavin, Hon Gary, Chandonia John-Marc, Brenner Steven: WebLogo: A sequence logo generator. Genome Research. 2004, 14: 1188-1190. 10.1101/gr.849004.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.