Characterization of unknown genetic modifications using high throughput sequencing and computational subtraction
BMC Biotechnology volume 9, Article number: 87 (2009)
When generating a genetically modified organism (GMO), the primary goal is to give a target organism one or several novel traits by using biotechnology techniques. A GMO will differ from its parental strain in that its pool of transcripts will be altered. Currently, there are no methods that are reliably able to determine if an organism has been genetically altered if the nature of the modification is unknown.
We show that the concept of computational subtraction can be used to identify transgenic cDNA sequences from genetically modified plants. Our datasets include 454-type sequences from a transgenic line of Arabidopsis thaliana and published EST datasets from commercially relevant species (rice and papaya).
We believe that computational subtraction represents a powerful new strategy for determining if an organism has been genetically modified as well as to define the nature of the modification. Fewer assumptions have to be made compared to methods currently in use and this is an advantage particularly when working with unknown GMOs.
Genetically modified organisms have been engineered through the stable integration of a recombinant genetic cassette into the genome of a recipient organism. The purpose of generating a genetically modified organism (GMOs) is, like breeding in general, to provide the new variety with novel features, and for introduced traits to be inheritable, the nuclear or organellar genome has to be altered. Protein coding mRNAs represent a causal starting point for most metabolic processes and structural components of a cell, and a cell's pattern of RNA transcription reflects the coding potential of its genome. For a genetic modification to have an effect, it is thus also vital that it changes the coding capacity of the recipient cell.
The strategy most commonly used when generating genetically modified plants that are commercially relevant is to introduce a genetic construct that either confers some kind of advantage when it comes to farming/storage or increases the nutritional quality of the end product. Among the most widely used genetic features are genes that encode herbicide tolerance, insect resistance or improve content of key nutrients http://www.agbios.com/. In addition to these trait genes, various selection markers are also usually introduced in order to simplify the process of GMO generation. These genes include herbicide resistance genes such as the bialaphos resistance gene (bar) from Streptomyces hygroscopicus , antibiotic resistance genes such as the neomycin phosphotransferase II gene (nptII) from Escherichia coli found in the Flavr Savr tomato  or positive selection markers such as the phosphomannose isomerase gene (pmi) from E. coli (used in for instance Golden Rice, see ). Careful examination of the pool of transcripts found in a plant should therefore reveal whether or not a plant has been genetically modified.
Recently, a new strategy for identification of foreign nucleic acids (DNA or RNA) called computational subtraction has been described for pathogen discovery in human diseases of unknown etiology . In short, the approach takes advantage of the fact that for a growing number of species the complete genomic sequence has now been generated, and sequencing costs have been dropping dramatically in recent years. Using sequence similarity search algorithms it is thus possible to analyze DNA or RNA sequence data from a sample, compare the sequences against a set of reference sequences, and filter away all the endogenous ('expected') reads, leaving a small collection of sequences that do not appear to stem from the organism in question. This principle appears to work well even when subtracting short sequence tags , and should be an efficient way to identify for instance unexpected transcripts.
We have attempted to use high massively parallel pyrosequencing and the concept of computational subtraction to look for allochthonous transcripts in a transgenic line of Arabidopsis thaliana. We also explore the concept of computational subtraction in silico using expressed sequence tag (EST) data from transgenic rice and papaya.
The cDNA sequencing of transgenic A. thaliana gave a total of 79,990 reads, yielding 17,457,856 bases (average read length: 218 bases) and the raw data were deposited in GenBank's Short Read Archive (SRA) as submission SRA009344: http://www.ncbi.nlm.nih.gov/sites/entrez?db=sra&cmd=search&term=SRA009344+. Sequence tag extraction gave a total of 58,933 high quality 75-basepair sequences. Computational subtraction was performed on the tag datasets and very few A. thaliana sequences remained after the second round of subtraction (Table 1). The remaining pool of sequence tags consisted almost exclusively of sequences with a high degree of sequence similarity to the pBI121 vector sequence (Table 1). Thirteen tags did not match the pBI121 vector or our reference transcriptome/genome sequences, but these sequences were all close matches to A. thaliana accessions or other plant sequences in the NCBI nt database. The maximum bitscore possible using our megablast settings and sequence length (75 basepairs) was 149, and average score obtained for the remaining 146 sequences was 145.5 when megablast was used against the T DNA (transfer DNA) region of pBI121. For the collection of 75-basepair prokaryotic tags on the other hand, only a very small number of tags were subtracted (Table 1).
A number of transgenic EST reads could be identified in both the rice and the papaya sequence collections (Figure 1). Both the trait genes and selection markers seemed to have reasonable expression levels, and some reads from papaya also showed some diversity in the 5' end of the coat protein transcript (Figure 1). The two different sequences found corresponded to two different versions of transgenic papaya; one with the complete transcript from the papaya ringspot virus and one earlier version where a composite sequence comprising a part of the papaya ringspot virus genome as well as a part of the cucumber mosaic virus genome was used .
Most of the methods currently used for characterization of (unknown) genetic modifications rely on PCR . This approach assumes some knowledge about the target sequence, as it relies on primer design. High density array-based methods that make fewer assumptions about the nucleic acids to be detected have been suggested and developed [8, 9], but even here some basic assumptions have to be made. By using high throughput sequencing of either a cDNA or a genomic/organellar DNA library, it should be possible to detect any novel transcript or genetic construct. The exception would be if one works with cDNA and the target organisms' only novel feature on the expression level is the increased or reduced expression of an otherwise endogenous gene .
Computational subtraction might also be performed using genomic DNA instead of mRNA. The number of sequences that need to be derived for computational subtraction to be effective when working with transcripts will depend upon the frequency and length of the transgenic mRNA versus the pool of endogenous mRNA and small transgenic transcripts and/or a low level of expression will require deeper sequencing. The same principle applies to computational subtraction using genomic DNA, but here the size of the inserted construct relative to the target genome will be the most important factor . Using A. thaliana transformed with pBI121 as an example, the insert size is 6,192 bases (GenBank accession number AF485783) and the genome size of A. thaliana is 125,000,000 basepairs  (excluding mitochondrial and chloroplast genome). If we had sequenced 58,933 genomic tags, we could have expected only to find <3 sequence tags that had sequence overlap with the insert.
One way to increase the likelihood of picking up GM-specific nucleic acids would be to do an in vitro physical subtraction of the DNA/RNA before library preparation. This would reduce the amount of nucleic acids that the sample would have in common with a (wildtype) reference and increase the relative amount of the GM-associated DNA or transcripts. There are kits available for performing suppressive subtractive hybridization based on published techniques  and subtractions can also be performed commercially (offered by for instance by Eurofins MWG/Operon, see Products & Services at http://www.eurofinsdna.com/).
Regardless of what the starting material for library preparation is, the target organisms' transcriptome and/or genome must be well characterized. Sequence filtering might be done using data from a close relative (see for instance the use of mouse data in ), but this alone will not be sufficient when working with a high number of sequence reads. At time of writing, ten large plant genome sequencing projects had been completed: Arabidopsis thaliana (thale cress), Glycine max (soybean), Phoenix dactylifera (date palm tree), Medicago truncatula (barrel medic), Oryza sativa (rice), Populus trichocarpa (black cottonwood), Sorghum bicolor (sorghum), Vitis vinifera (wine grape), Carica papaya (papaya) and Zea mays (corn). Many more species are slated to be sequenced in the near future http://www.ncbi.nlm.nih.gov/genomes/PLANTS/PlantList.html, so we believe that for the major crop species this will not be a limiting factor for long.
A possible example of the potential benefits of such an approach was observed in our collection of downloaded EST libraries where a library from a unpublished project entitled 'Subtractive cloning of differentially expressed mRNA from transgenic rice plants' was found (library name: Oryza sativa cv. Pusa Basmati-1). This library comprised only 9 sequences, but even with this small number a reads, a transgenic EST could be detected. The 242 basepair sequence found (accession number AJ309294) was a 100% match with the gene trapping Ds/T-DNA vector pDsG8 designed for insertion mutagenesis in rice .
The data generated in this study can also be used to search for other novel transcripts than those that represent transgenic candidates. Careful examination of the 5.568 transcripts that were found that did not match the reference A. thaliana transcriptome but matched the genome sequence well (Table 1; 5,727-159 = 5,568), revealed several potentially novel, spliced endogenous genes (data not shown). We do not believe that these transcripts are directly linked to the genetic modification, but this merely demonstrates how versatile data generated using high throughput sequencing of cDNA libraries can be.
As the amount of available sequences data increases and DNA sequencing costs drop, we believe that a sequencing-based approach using computational subtraction will be feasible for the detection, characterization and risk assessment of genetic modifications. In this pilot study we have shown that transgenic cDNA can be detected using genetically modified plants as a model system.
Plant growth and RNA isolation
A. thaliana seeds from plants vacuum infiltrated with Agrobacterium  carrying the pBI121 35S:GUS Ti plasmid (also includes the nptII selection marker; Clontech, Mountain View, CA, USA) were surface sterilized and grown on Murashige and Skoog medium  without kanamycin for 10 days in growth chambers at 22°C for 8 h of dark and 16 h of light (100 μEm-2s-1). 10 day old frozen A. thaliana seedlings were grinded using a pestle and mortar in the presence of liquid nitrogen and total RNA was isolated using the Spectrum Plant Total RNA kit (Sigma, St. Louis, MO, USA) following the manufacturer's recommendations. RNA was eluted once in 50 μl of elution solution. Quantification of RNA was done using a NanoDrop ND-1000 Spectrophotometer (Thermo Scientific NanoDrop Products, Wilmington, DE, USA).
Library construction, sequencing and computational subtraction
The mRNA was DNase I treated using a deoxyribonuclease I kit (Sigma) and subsequently reverse transcribed using the SMART PCR cDNA Synthesis Kit (Clontech). Briefly, first-strand synthesis was done using the 3' SMART CDS Primer II A oligonucleotide and PrimeScript Reverse Transcriptase (Takara Bio Inc., Shiga, Japan) in combination with the SMART II A oligonucleotide. cDNA was amplified using the 5' PCR primer II A, and six 50 μl reactions (150 ng DNase-treated RNA per sample as template) were done using 21 PCR cycles. Amplification products were pooled, phenol/chloroform/isoamyl alcohol extracted and ammonium acetate/ethanol precipitated. The pellet was dissolved in molecular grade water and DNA quantification confirmed that the yield was as expected using this kit and protocol.
Nebulization was used to fragment 5 ug of cDNA. Adaptors where appended to the fragments and one GS LR25 sequencing kit (PTP 25 × 75) was used according to manufacturer's recommendations (Roche Applied Science, Indianapolis, IN, USA). Sequencing and library construction was done at the Centre for Ecological and Evolutionary Synthesis' Ultra-high Throughput Sequencing Platform (University of Oslo, Norway) using the 454 Genome Sequencer FLX System (Roche Applied Science).
From the raw data, tags with high sequence quality were extracted. A 75 basepair window was slid through the reads, and when a window that did not overlap with the SMART PCR cDNA linkers or 454 sequencing key and that had an average sequence quality score  above 30 was found, a tag was extracted and the algorithm proceeded to the next read.
Sequence subtractions were performed using megablast  against a collection of reference mRNA sequences from A. thaliana (TAIR8_cdna, downloaded from The Arabidopsis Information Resource ftp site: ftp://ftp.arabidopsis.org/home/tair/Sequences/blast_datasets/TAIR8_blastsets/ with word size 20, no filter for low complexity regions and a high expect (e) value (1000). All of the sequences that gave a match were removed, and the procedure was repeated using the most recent release of the A. thaliana nuclear (downloaded from the National Center for Biotechnology Information ftp site: ftp://ftp.ncbi.nih.gov/genbank/genomes/A_thaliana/ as well as the mitochondrial and chloroplast genome (NC_000932 and NC_001284, respectively).
In order to test the robustness of the subtraction, a random collection of 75-basepair sequences was extracted from a set of 200+ completely sequenced bacterial genomes. Trait genes used in biotechnology are often of prokaryotic origin, and we used this to simulate what would happen if expression of an unknown prokaryotic gene was to be detected in a pool of endogenous A. thaliana transcripts.
Rice and papaya EST libraries
To test the feasibility of finding cDNA sequences derived from inserted GMO cassettes in a transcript libraries prepared from other plant species, we searched the National Center for Biotechnology Information (NCBI) EST database ftp://ftp.ncbi.nih.gov/repository/dbEST/ for sequence collections derived from genetically modified plants. Focusing on transgenic lines that had an associated publication and that did not merely overexpress endogenous genes we ranked all the libraries found according to size. The largest library was from genetically modified papaya (Carica papaya). This cDNA sequence collection had been compiled as a part of the work to characterize the SunUp papaya genome and transcriptome . The six sets of papaya data contained a total of more than 75,000 sequences (EST libraries PY01-PY06; http://www.ncbi.nlm.nih.gov/sites/entrez). The second largest library found (UniGene library 14238) had been derived from GM rice (Oryza sativa) and contained 5,455 sequences. It was an unpublished part of a 2005 article by Dr. Oh and colleagues ( and Dr. Yeon-Ki Kim, personal communication). The rice line had been transformed with a construct containing the abscisic acid responsive element binding transcription factor 3 gene (ABF3) from A. thaliana as well as the bar gene (phosphinothricin acetyltransferase) as selection marker.
Unfortunately, neither of these two sequence collections appeared to have been filtered for sequence quality or accurately trimmed to remove cloning vector sequences before being submitted. This made efficient computational subtraction intractable (in spite of both the rice and papaya genomes being publicly available), so we decided to instead specifically screen the two libraries for the presence of non-endogenous transcripts (as opposed to removing endogenous transcripts through filtering). The EST sequences were analyzed using BLAST sequence similarity searches and the information that could be obtained from the published ABF3 rice and SunUp papaya GMO cassette construct maps (see references above), the GMO Detection Method Database  and the nt sequence collection hosted by NCBI.
Thompson CJ, Movva NR, Tizard R, Crameri R, Davies JE, Lauwereys M, Botterman J: Characterization of the herbicide-resistance gene bar from Streptomyces hygroscopicus. EMBO J. 1987, 6: 2519-2523.
Sheehy RE, Kramer M, Hiatt WR: Reduction of polygalacturonase activity in tomato fruit by antisense RNA. Proc Natl Acad Sci USA. 1988, 85: 8805-8809. 10.1073/pnas.85.23.8805.
Ye X, Al-Babili S, Kloti A, Zhang J, Lucca P, Beyer P, Potrykus I: Engineering the provitamin A (beta-carotene) biosynthetic pathway into (carotenoid-free) rice endosperm. Science. 2000, 287: 303-305. 10.1126/science.287.5451.303.
Weber G, Shendure J, Tanenbaum DM, Church GM, Meyerson M: Identification of foreign gene sequences by transcript filtering against the human genome. Nat Genet. 2002, 30: 141-142.
Tengs T, LaFramboise T, Den RB, Hayes DN, Zhang J, DebRoy S, Gentleman RC, O'Neill K, Birren B, Meyerson M: Genomic representations using concatenates of Type IIB restriction endonuclease digestion fragments. Nucleic Acids Res. 2004, 32: e121-10.1093/nar/gnh120.
Tripathi S, Suzuki J, Gonsalves D: Development of genetically engineered resistant papaya for papaya ringspot virus in a timely manner: a comprehensive and successful approach. Methods Mol Biol. 2007, 354: 197-240.
Holst-Jensen A, Ronning SB, Lovseth A, Berdal KG: PCR technology for screening and quantification of genetically modified organisms (GMOs). Analytical and Bioanalytical Chemistry. 2003, 375: 985-993.
Nesvold H, Kristoffersen AB, Holst-Jensen A, Berdal KG: Design of a DNA chip for detection of unknown genetically modified organisms (GMOs). Bioinformatics. 2005, 21: 1917-1926. 10.1093/bioinformatics/bti248.
Tengs T, Kristoffersen AB, Berdal KG, Thorstensen T, Butenko MA, Nesvold H, Holst-Jensen A: Microarray-based method for detection of unknown genetic modifications. BMC Biotechnol. 2007, 7: 91-10.1186/1472-6750-7-91.
Conner AJ, Barrell PJ, Baldwin SJ, Lokerse AS, Cooper PA, Erasmuson AK, Nap JP, Jacobs JME: Intragenic vectors for gene transfer without foreign DNA. Euphytica. 2007, 154: 341-353. 10.1007/s10681-006-9316-z.
LaFramboise TL, Hayes DN, Tengs T: Statistical analysis of genomic tag data. Stat Appl Genet Mol Biol. 2004, 3: Article 34-
Kaul S, Koo HL, Jenkins J, Rizzo M, Rooney T, Tallon LJ, Feldblyum T, Nierman W, Benito MI, Lin XY, et al: Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000, 408: 796-815. 10.1038/35048692.
He CK, Dey M, Lin ZH, Duan FP, Li FL, Wu R: An efficient method for producing an indexed, insertional-mutant library in rice. Genomics. 2007, 89: 532-540. 10.1016/j.ygeno.2006.11.014.
Bechtold N, Ellis J, Pelletier G: In-Planta Agrobacterium-Mediated Gene-Transfer by Infiltration of Adult Arabidopsis-Thaliana Plants. Comptes Rendus de l Academie des Sciences Serie Iii-Sciences de la Vie-Life Sciences. 1993, 316: 1194-1199.
Murashige T, Skoog F: A revised medium for rapid growth and bio assays with tobacco tissue cultures. Physiol Plant. 1962, 15: 473-497. 10.1111/j.1399-3054.1962.tb08052.x.
Brockman W, Alvarez P, Young S, Garber M, Giannoukos G, Lee WL, Russ C, Lander ES, Nusbaum C, Jaffe DB: Quality scores and SNP detection in sequencing-by-synthesis systems. Genome Res. 2008, 18: 763-770. 10.1101/gr.070227.107.
Zhang Z, Schwartz S, Wagner L, Miller W: A greedy algorithm for aligning DNA sequences. J Comput Biol. 2000, 7: 203-214. 10.1089/10665270050081478.
Ming R, Hou S, Feng Y, Yu Q, onne-Laporte A, Saw JH, Senin P, Wang W, Ly BV, Lewis KL, et al: The draft genome of the transgenic tropical fruit tree papaya (Carica papaya Linnaeus). Nature. 2008, 452: 991-996. 10.1038/nature06856.
Oh SJ, Song SI, Kim YS, Jang HJ, Kim SY, Kim M, Kim YK, Nahm BH, Kim JK: Arabidopsis CBF3/DREB1A and ABF3 in transgenic rice increased tolerance to abiotic stress without stunting growth. Plant Physiol. 2005, 138: 341-351. 10.1104/pp.104.059147.
Dong W, Yang L, Shen K, Kim B, Kleter GA, Marvin HJ, Guo R, Liang W, Zhang D: GMDD: a database of GMO detection methods. BMC Bioinformatics. 2008, 9: 260-10.1186/1471-2105-9-260.
Rott ME, Lawrence TS, Wall EM, Green MJ: Detection and quantification of roundup ready soy in foods by conventional and real-time polymerase chain reaction. J Agric Food Chem. 2004, 52: 5223-5232. 10.1021/jf030803g.
The authors would like to thank Kjetil Fosnes for help with RNA isolation, Tore Brembu for assisting with plant transformations and Laura MacConaill for advice on the computational subtraction. This work was financially supported by the Research Council of Norway (grant number 170363/D10 and 178288/I10) and by the European Commission through the Sixth Framework Program, integrated project Co-Extra http://www.coextra.org; contract FOOD-2005-CT-007158.
TT conceived the idea and wrote the final version of the manuscript. HZ prepared cDNA libraries, JB and ABK did the computational subtraction, MAB and HGS provided mRNA from transgenic A. thaliana varieties and AHJ provided funding, supervised and guided the project together with KGB. All authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
About this article
Cite this article
Tengs, T., Zhang, H., Holst-Jensen, A. et al. Characterization of unknown genetic modifications using high throughput sequencing and computational subtraction. BMC Biotechnol 9, 87 (2009). https://doi.org/10.1186/1472-6750-9-87
- Genetically Modify Organism
- Sequence Collection
- Papaya Ringspot Virus
- Genetically Modify Organism Detection
- Transgenic cDNA