Production of soluble mammalian proteins in Escherichia coli: identification of protein features that correlate with successful expression

Background In the search for generic expression strategies for mammalian protein families several bacterial expression vectors were examined for their ability to promote high yields of soluble protein. Proteins studied included cell surface receptors (Ephrins and Eph receptors, CD44), kinases (EGFR-cytoplasmic domain, CDK2 and 4), proteases (MMP1, CASP2), signal transduction proteins (GRB2, RAF1, HRAS) and transcription factors (GATA2, Fli1, Trp53, Mdm2, JUN, FOS, MAD, MAX). Over 400 experiments were performed where expression of 30 full-length proteins and protein domains were evaluated with 6 different N-terminal and 8 C-terminal fusion partners. Expression of an additional set of 95 mammalian proteins was also performed to test the conclusions of this study. Results Several protein features correlated with soluble protein expression yield including molecular weight and the number of contiguous hydrophobic residues and low complexity regions. There was no relationship between successful expression and protein pI, grand average of hydropathicity (GRAVY), or sub-cellular location. Only small globular cytoplasmic proteins with an average molecular weight of 23 kDa did not require a solubility enhancing tag for high level soluble expression. Thioredoxin (Trx) and maltose binding protein (MBP) were the best N-terminal protein fusions to promote soluble expression, but MBP was most effective as a C-terminal fusion. 63 of 95 mammalian proteins expressed at soluble levels of greater than 1 mg/l as N-terminal H10-MBP fusions and those that failed possessed, on average, a higher molecular weight and greater number of contiguous hydrophobic amino acids and low complexity regions. Conclusions By analysis of the protein features identified here, this study will help predict which mammalian proteins and domains can be successfully expressed in E. coli as soluble product and also which are best targeted for a eukaryotic expression system. In some cases proteins may be truncated to minimise molecular weight and the numbers of contiguous hydrophobic amino acids and low complexity regions to aid soluble expression in E. coli.


Background
The production of purified proteins is important for several experimental approaches aimed to assign gene function including antibody generation for immunocytochemistry and immunoprecipitation studies [1][2][3], in vitro mapping of protein -protein, protein -DNA or protein -RNA interactions [4,5] and structure determination [6]. The availability of proteins is also important for biomedical applications such as small molecule drug discovery and the production of therapeutic proteins and vaccines. In these situations it is essential to be able to reliably express the proteins in a heterologous system and purify them so that they possess the same folds and structure as they would in a natural in vivo state. To achieve this on a whole proteome scale a generic approach must be taken to the expression of protein families, unlike the traditional approach of protein chemistry in optimising the isolation of individual proteins on a case by case basis. E. coli has been the expression system of choice for the majority of laboratories engaged in highthroughput, multi-plexed cloning, expression and purification of proteins for structural genomics [7]. The advantages of E. coli as an expression host include well studied physiology, genetics and availability of advanced genetic tools [8][9][10], rapid growth, high-level protein production rates achieving up to 10-30% of total cellular protein, ease of handling in a standard molecular biology laboratory, low cost and the ability to multiplex both expression screening [11] and protein production [12]. There are however several disadvantages, particularly for eukaryotic proteins, of expression in a prokaryotic system. The lack of eukaryotic chaperones, specialised post-translational modifications, ability to be targeted to sub-cellular locations or to form complexes with stabilising binding partners can result in protein mis-folding and aggregation. For example, when 2078 randomly selected C. elegans fulllength genes were cloned and expressed in E. coli only 11 % yielded soluble protein [13]. Similarly for 44 cloned human proteins, 12 were expressed solubly and 4 purified to homogeneity [14]. With the exception of full-length membrane proteins, the property of protein solubility has been shown to be a good indicator of correct folding as determined by functional binding [15,16] or enzymatic [17] assays. Purification of inclusion bodies and in vitro refolding has been used in a number of cases, but refolding conditions are highly protein specific and so unlikely to be useful for high-throughput protein expression.
There are several fall-back strategies for expression of correctly folded eukaryotic proteins in E. coli one of which is to truncate long multi-domain proteins into separate domains, as has been performed for the Ephb2 receptor [15,18,19]. Reducing translation rates so that proteins have an increased chance of folding into a native state prior to aggregating with folding intermediates, can be successful by lowering the temperature after induction [20] or inducing with lower concentrations of IPTG [21]. Alternate approaches include: co-expressing stabilising binding partners (see review [7]) or chaperones [22]; the induction of chaperones by heat shock [23] or chemical treatment [24]; or the use of genetically modified hoststrains that can conduct oxidative protein folding in the cytoplasm [25,26], over-express rare tRNAs [27] or lipid rafts [28]. Perhaps one of the most successful generic strategies to enhance the expression of soluble proteins is the fusion with solubility enhancing tags, such as maltose binding protein (MBP), thioredoxin (Trx) and glutathione-S-transferase (GST) [29][30][31].
The aim of this work was to ask if it is possible to derive some general conclusions regarding which expression strategy would most likely result in the expression of soluble, functionally active mammalian protein on a familyby-family or domain-by-domain basis. A deep-mining approach was taken to maximise the chances of successful expression by examining the soluble expression of 30 different proteins using 14 different expression vectors. This study allowed us to make several conclusions regarding the best strategies to adopt for the soluble expression of different mammalian proteins in bacteria. The conclusions were tested by the expression of an additional 95 mammalian proteins.

Expression clone construction
The 30 proteins chosen for this expression study are listed in Table 1. With the exception of GFP, they are all human or mouse proteins, and represent several diverse protein families with extra-cellular, cytoplasmic and nuclear cell locations. The list includes a mixture of full-length and truncated proteins expected to be easy or more challenging to express in a bacterial system. Protein truncations were designed to express individual domains annotated from the SwissProt [32] or Pfam [33] databases or following previous examples of successful expression [15]. The genes were isolated from cDNA using a nested PCR strategy [34] or provided by the FlexGene Consortium http:// www.hip.harvard.edu/flex_gene/index.htm and sequence confirmed. A recombinational cloning strategy was employed termed "GATEWAY" cloning [35,36] based on a modification of the phage lambda site-specific recombination system [37]. Primers were designed using the nearest neighbour algorithm [38] and open reading frames (ORFs) were PCR amplified from first strand cDNA with 5' attB1 and 3' attB2 linkers and then recombined with pDONR221 (Invitrogen) to give a set of entry clones which were sequence confirmed and then recombined with various destination vectors to give the expression constructs. Two sets of clones for each ORF were generated with and without stop codons for expression with N or C-terminal tags respectively. Recombinational cloning was useful in this study where the same set of ORFs could be cloned into a large set of different expression vectors without the requirement to check for compatible restriction sites in each vector or their absence within the ORFs.
For this study a set of destination vectors were constructed by modifying pET-DEST42 (see Materials and Methods). The T7 promoter was chosen over other promoters commonly used for bacterial expression because of the high specificity and processivity of T7 RNA polymerase and the wide choice of expression strains currently available. Briefly, multicloning sites were created either 5' of the attR1 or 3' of the attR2 recombination sites for insertion of DNA inserts encoding N or C-terminal tags respectively. The expression vectors contained a T7lac promoter [39] for improved control of basal expression. The N-terminal tag expression vectors contained a sequence at the translational start site to provide a partial match with the downstream box (ATG AAT CAC CAT), shown to provide enhancement of translation [40] and a decahistidine (H10) tag for enhanced affinity for Nickel resins com- Expression vector constructs after recombination between the destination and entry plasmids

pDEST-C102
T7 lacO attB1 SD attB2 ORF H10 stop fusion pared with hexahistidine (H6) tags (data not shown). A fusion partner was inserted between the H10 tag and recombination sites to examine the effect on soluble protein expression. Unlike previous tag comparisons [29][30][31] here the same promoter and 5'-UTR sequence was employed so that any expression differences observed would be purely due to the presence the fusion partner. A vector was also included in this study (pDEST17) with a T7 promoter and no downstream lac operator, which would add a H6 tag at the N-terminus ( Figure 1).

Effect of different N-terminal fusions on expression
Expression plasmids generated by recombination reactions were used to transform E. coli BL21(DE3), an expression strain containing chromosomally integrated T7 RNA polymerase gene (λDE3 lysogen) under the control of the lacUV5 promoter. To handle a large number of expression experiments (420 total) and associated manipulations to screen for total and soluble expression in E. coli, the recombinational cloning, transformation, growth of expression cultures and cell lysis and filtration separation of insoluble protein were performed in 96-well plate format. Figure 2 shows Western blots for total and soluble protein expression 2 hours after induction with 1 mM IPTG as described in Materials and Methods. The method for separating total from soluble proteins was based on that of Knaust and Nordlund [11] and consisted of detergent lysis of harvested cells followed by filtration through a 0.65 µm 96-well filter plate, which separates larger inclusion bodies from the soluble fraction. The filtration method agrees well with traditional centrifugation methods to separate soluble from insoluble protein [11,41] and has the advantage that multiple samples can be processed in parallel. Quantitation was achieved by separating the proteins by SDS-PAGE, electro-blotting onto PVDF membranes and detecting His tagged proteins with an anti-His5 monoclonal antibody followed by probing with an anti-mouse Cy-5 labelled antibody. The advantage of expression analysis by Western blot, compared to dotblots, is that this allows one to quantitate the expression levels of full-length constructs and eliminate the contribution from cleaved protein tag. It was found that Western blots based on fluorescence detection [42] gave a greater dynamic range of detection compared with detection based on enzymatic amplification such as horse radish peroxidase (data not shown). A His-tagged protein molecular weight ladder was used for normalisation to eliminate any blot to blot variation. Table 2 shows the results of this analysis, quantitating expression yields in terms of mg expressed protein per litre of induction media for total and soluble expression. Expression yields greater than 2 mg/l are highlighted in bold.  The molecular weight of these proteins ranged from 9 -35 Kda and averaged 22.8 Kda. These proteins are all expressed in the cytoplasm, have an average of 1 low-complexity region, 3.8 contiguous hydrophobic amino acids (hp_aa), pI of 6.6, grand average of hydropathicity index (termed GRAVY [43] where increased positive number indicates increased hydrophobicity) of -0.32, 2.6% cysteine residues and no coiled-coil structures. A second grouping of proteins was observed where soluble expression was improved when expressed as H10-Trx or H10-MBP fusions compared with the H10 tag alone. This grouping included GRB2, Efnb2(EC1 or 2), MAD, MAX, Efna1 (FL and EC). The molecular weight of these proteins ranged from 16 -25 Kda and averaged 20.5 Kda. These proteins were a mixture of those expressed in the cytoplasm, nucleus and extra-cellular, have an average of 0.71 Numbers correspond to total (T) or soluble (S) expression yield (mg/l). Yields greater than 2 mg/l are in bold, nc-not cloned. This may result in a faster rate of transcript synthesis, after induction with IPTG, and hence translation rates (due to an increased concentration of mRNA) for pDEST17 compared with pN110. If translation rate exceeds the rate of protein folding, then increased production of insoluble protein would occur.

T S T S T S T S T S T S T S T S m T S T S T S T S T S T S T S T S m T S T S T S T S T S T S T S T S m T S T S T S T S T S T S T S m T S T S T S T S T S T S T S T S m T S T S T S T S T S T S T S T S m T S T S T S T S T S T S T S T S m T S T S T S T S T S T S T S m T S T S T S T S T S T S T S T S m T S T S T S T S T S T S T S T S m T S T S T S T S T S T S T S T S m T S T S T S T S T S T S T S m
MBP-H10 was the most effective tag at the C-terminus to promote protein solubility with an average construct fulllength soluble yield of 5.0 mg/l, which compares well with an average of 5.8 mg/l when this tag is fused at the Nterminus. The order of C-terminal tags to promote soluble expression was similar for total expression: MBP-H10 > GST-H10 > V5-H6 > Dhfr-H10 ~ GFP-H10 ~ Trx-H10 > H10 ~ DHFR-H10. Thioredoxin was not as effective a solubility enhancing tag when fused at the C-terminus with an average soluble yield of only 0.7 mg/l compared with 6.0 mg/l when fused to the N-terminus.
Several correlations with protein features are seen when one groups the MPB fusions according to soluble protein expression levels. For the first group, where soluble expression levels were in the range of 5 -50 mg/l, the average molecular weight, pI and GRAVY score were 20.6 Numbers correspond to total (T) or soluble (S) expression yield (mg/l). Yields greater than 2 mg/l are in bold.
KDa, 5.9 and -0.58 respectively. The average numbers of contiguous hydrophobic amino acids, low complexity and coiled-coil regions were 3.1, 0.56 and 0.22 respectively. The second group displayed soluble expression levels between 1 -5 mg/l. Here, the average molecular weight, pI and GRAVY score were 25.1 KDa, 7.9 and -0.39 respectively and the average numbers of contiguous hydrophobic amino acids, low complexity and coiled-coil regions were 4.3, 0.71 and 0 respectively. The last group displayed soluble expression levels between 0 -1 mg / l. Here the average molecular weight, pI and GRAVY score were 41.1 KDa, 6.2 and -0.51 respectively and the average numbers of contiguous hydrophobic amino acids, low complexity and coiled-coil regions were 5, 2.43 and 0.21 respectively. There were representatives of nuclear, cytoplasmic and extra-cellular proteins in all three groupings.

Expression of a test set of 95 mammalian proteins
A diverse set of proteins were chosen to test the conclusions of this study (Table 4). They range from proteins that are well annotated, some of which have been expressed in E. coli previously (Nfkb1), to those that contain no PfamA domains and have not been expressed in E. coli previously (Maat1, BC031407, Ttyhl, 1500001H12RIKEXT2, Ext2, KIAA1136, G2 and KIAA1549). They included 24 proteins not annotated as PfamA domains, with unknown function. All cDNAs were amplified from a primary cDNA library, cloned into pDONR221 and sequence confirmed prior to transfer to pDEST-N112-MBP (Figure 1) for expression as N-terminal H10-MBP fusions. In some cases primers were designed to clone protein fragments to express particular PfamA domains or minimise the molecular weight or numbers of low complexity (LC) regions or contiguous hydrophobic amino acids (hp_aa). For proteins with no PfamA annotations, such as BC031407, SMART sequence analysis [44] was performed to identify the low complexity regions of the protein and truncations performed accordingly. Protein expression and quantitation of intact soluble fusion protein product was performed as for the N-and C-terminal tag comparison study. The total and soluble expression levels (mg of protein per litre culture) are listed in the last column of Table 4 together with selected protein features. 63 of the 95 proteins yielded soluble expression levels of greater than 1 mg/l and the average molecular weight, number of LC regions and hp_aa for these proteins was 24.4 kDa, 0.9 and 3.7 respectively. For the 32 proteins that failed to give soluble product of the correct size, the average molecular weight, number of LC regions and hp_aa was 37.1 kDa, 1.8 and 4.5 respectively.

Correlation between protein properties and solubility
To guide future expression strategies for new proteins, particularly regarding the choice of expressing a full-length protein in a bacterial or eukaryotic system and also where to truncate multi-domain containing proteins, it is interesting to investigate if the proteins expressed in a soluble form in this study share any common properties. Recently Goh et al. [45] used data generated by a structural genomics consortium to examine the ability of proteins to progress from cloning to expression and purification to crystallisation. The data used was very large, consisting of 27,000 targets from over 120 organisms and a number of important features were inferred that correlated with success including percentage composition of charged residues, occurrence of hydrophobic patches and length. Although a large study, there was a problem with interpretation of all the data-sets as it was unclear whether targets were simply waiting in the pipeline or had failed. Also structural genomics targets are often initially biased in favour of easy to express proteins, not representative of the whole proteomes of these organisms.
The present study, focused on mammalian proteins from several diverse families, examined the relationship between successful soluble expression with various protein properties. Several protein features were identified in this study to correlate with soluble expression, which had not previously been shown experimentally. For both the N and C-terminal tag expression studies it was observed that the presence of several features did not correlate with successful expression including protein pI, grand average of hydropathicity index (GRAVY) [43], sub-cellular location, the cysteine content as a percentage of the total number of amino acids and the number of coiled-coils. Protein pI has been linked to sub-cellular location [46] with a bimodal distribution observed in bacterial and archaeal genomes and trimodal pattern in eukaryotes. Proteins are thought to be less soluble at a pH environment near their pI. GRAVY simply calculates overall hydrophobicity of the linear polypeptide sequence with increasing positive score indicating greater hydrophobicity, but no account is taken of the way the protein folds in three dimensions or the percentage of residues buried in the hydrophobic core of the protein. In a recent study Luan et al. [47] tested the soluble expression of 10,167 full-length C. elegans ORFs and found that protein hydrophobicity was an important factor for an ORF to yield a soluble expression product. This different result may be attributable to the fact that the C. elegans study included a greater proportion of membrane proteins. Therefore the lack of correlation between GRAVY score and soluble expression we observed may be true for non-membrane proteins or for proteins where the trans-membrane domain has been deleted.
There was a strong correlation between successful soluble expression and molecular weight of the protein. Small proteins with an average molecular weight of 22.8 KDa did not require to be fused with solubility enhancing proteins for soluble expression whereas proteins that required to be fused with N-terminal MBP or Trx for soluble expression had an average molecular weight of 40.4 KDa and those where the addition of a N-terminal fusion could not rescue soluble expression had an average size of 51.4 KDa. The same pattern also emerged in the C-terminal fusion study. The decreasing probability of successful soluble expression of mammalian proteins with increasing molecular weight is likely due to increasing protein complexity, perhaps requiring specialised eukaryotic chaperones for folding or stabilising binding partners. The majority of proteins solubly expressed in this study contained single domains and as fusion proteins were either capable of self-folding or were folded with the aid of prokaryotic chaperones. Braun et al. found a similar rela-tionship with their set of 32 human proteins with 4 different N-terminal fusions [30].
A correlation in this study was observed between increasing numbers of contiguous hydrophobic amino (hp_aa) acids (AILFWV) and soluble expression. This ranged from an average of 3.8 hp_aa for those proteins not requiring a N-terminal fusion for high level soluble expression to 5 hp_aa for proteins requiring a N-terminal fusion for successful expression and 5.6 hp_aa where expression failed under the conditions described here. This pattern was also repeated in the C-terminal fusion study where good expression proteins had an average of 3.1 hp_aa whereas poor expression proteins had an average of 5 hp_aa. In a study of the sequences of 2753 non-membrane proteins it was found that the sequences of three or more consecutive hydrophobic residues are suppressed in   [48]. Low complexity regions of proteins are regions of a protein of biased composition containing a small number of amino acids [33] and can have a disordered structure important for protein function [49]. Here we found that the greater the number of low complexity regions contained within the target protein, the less likely soluble expression would be achieved. This was true for both the N-and C-terminal fusion protein studies with 0.6 -1 low complexity regions for proteins easy to express in a soluble form to 2.4 -3 low complexity regions for proteins difficult to express. Low complexity regions are less common in bacterial proteins and these may be targets for proteolytic degradation in vivo.
Some interesting conclusions were drawn when soluble expression was measured for an additional set of 95 mammalian proteins expressed as H10-MBP fusions (Table 4).
In several cases (ELF1, Fli1, Ldb1, BC031407, Nfkb1 and RelA-p65) truncating the proteins to minimise the molecular weight and the numbers of low complexity regions and contiguous hydrophobic amino acids made the difference between failed expression and good soluble protein expression. For proteins such as BC031407, with no annotated PfamA domains, it was found that truncating at low complexity regions was a good method to identify a fragment that could express in a soluble form of the correct size (protein 81). Although we found that successful soluble expression of the 95 protein set correlated with lower molecular weight, number of low complexity regions and contiguous hydrophobic amino acids compared with proteins that failed to express solubly with the correct size, validating our earlier conclusions, there were some exceptions. For example Elf1 and Gata1 both expressed well despite having 4 and 6 low complexity regions respectively and molecular weights of 66 and 42.5 kDa, whereas some smaller proteins such as the PDZ domains of Dlgh3 and Grip1 failed to express. It may be that there are additional protein features, such as the ability to form a stabilising interaction with a binding partner, that are also important for soluble expression. Also ensuring correct protein domain boundaries may be important since the annotated Pfam domain boundaries, based on sequence alignment, do not always match the structural or folding domain boundaries.

Protein fusions that enhance protein solubility
There have been three comparative studies recently where sets of proteins were cloned into several expression vectors and the effects of the fusion partner on total and soluble expression yield were examined. Hammarstrom et al. [29] cloned 27 human proteins (MW < 20 Kda) into various expression vectors and ranked the tags ability to promote soluble expression as Trx ~ MBP ~ Gb1 > ZZ > NusA > GST > His6. Another study ranked tags in terms of increased expression and yield after purification as GST ~M BP > CBP > His6 when comparing the expression of 32 human proteins where the molecular weight varied from 17 -110 kDa. [30] Here GST was preferred because of the weak affinity between MBP and amylose resin. In a third study of 40 different proteins (10 mammalian, 3 plant and 2 insect) with 8 different tags MBP gave the best overall results in terms of total and soluble expression [31]. However, these studies used different combinations of promoter and fusion partner, so it was unclear whether the observed effect was purely due to expression with the fusion partner or variable rates of transcript synthesis that would also affect translation rates.
In this study it was found that, on average, N-terminal fusion partners are preferable for optimal protein expression. When proteins are expressed with their native N-terminus, as in our C-terminal fusion proteins, total expression levels can be more variable than when expressed with a constant N-terminal tag. This may be because of variable RNA secondary structures in the region around the start codon which could interfere with ribosome binding. An additional explanation is that during translation the expressed protein emerges from the ribosome first and initiates an incorrect, irreversible, folding pathway before the soluble fusion partner has been translated and folded. The mis-folded protein would be ubiquitin labelled and targeted to the proteasome for degradation resulting in lower total expression levels. This scenario is more likely when expressing mammalian proteins in a bacterial system which lacks specific eukaryotic chaperone proteins. It has been shown previously that proteins prone to mis-folding and aggregation can arrest GFP folding when fused at the C-terminus [17]. However, when the soluble protein is fused at the N-terminus, this would be translated first and perhaps increase the solubility of the downstream protein domain folding intermediates, increasing their half lives prior to irreversible aggregation. This would allow greater reversibility in the individual steps along the folding pathway and increase the probability that the protein would eventually reach the lowest free energy native conformation.
It was found that Trx and MBP were the best N-terminal protein fusions to promote protein solubility. The best Cterminal fusion to promote protein solubility was MBP and this may be acting as a true intra-molecular chaperone [50], able to promote folding of the N-terminal protein fusion. The mechanism could be due to direct binding to folding intermediates [51], allowing stabilisation prior to correct folding and inhibition of aggregate formation. The observation that MBP was effective at enhancing soluble expression when fused at the C-terminus, in contrast to thioredoxin, suggests that MBP can actually reverse the process of incorrect folding that would have started prior to the translation of the downstream MBP. This property was not observed for thioredoxin when fused to the C-terminus suggesting either that, in three-dimensions, different proximal faces of the fusion partners have different solubility enhancing properties or that thioredoxin does not posses any chaperone properties and acts only as a solubility enhancer. Alternatively, the folding of thioredoxin may be more prone to inhibition than MBP. Also there are examples where MBP fusions can form soluble inclusion bodies [52,53], and this cannot be ruled out as a possibility here, although there are also several examples where MBP fusion proteins are fully functionally active [50,52,54,55].
It must be stressed here that although protein solubility is a useful indicator of correct folding, additional measurements need to be performed to give supporting evidence for correct folding. These may include removing the protein fusion with a protease and analysis of the cleaved protein of interest by a variety of biophysical and functional assays such as analysis of monodispersity by light scattering [52], NMR [56,57], CD spectropolarimetry, bis-ANS binding [53], ligand binding or enzymatic activity. In this study a protease cleavage site was not included in the vector constructs because the main use of the proteins generated in our laboratory will be in high-throughput antibody production where the cleavage of the fusion partner is unnecessary.
GFP did not significantly enhance soluble protein expression when fused to the C-terminus of the proteins in this study, supporting the use of this tag as an indicator of soluble protein expression of fused ORFs. [17,41] The observation that the V5-His6 tag resulted in a higher average soluble expression level than the His10 tag (1.7 compared with 0.3 mg/l) indicates that the identity of the peptide tag can also affect overall solubility of expressed proteins.

Conclusions
What guidelines have emerged from this study in developing a strategy for the production of soluble mammalian proteins in E. coli? If the protein has a molecular weight of less than 30 KDa and contains 1 or less low complexity regions and less than 4 contiguous hydrophobic amino acids expression of the full-length protein in E. coli should give good levels of soluble protein. As a generic strategy we would recommend expressing the protein with a fusion partner and found MBP and Trx to be the best fusions to enhance protein solubility as N-terminal tags with MBP being superior as a C-terminal fusion. C-terminal fusions are desirable for proteins such as the P450s where N-terminal tags can inhibit functional activity. When fused to an optimal fusion partner, nuclear, cytoplasmic and extra-cellular domains were equally likely to be expressed solubly. For larger proteins over 50 KDa, truncations should be considered to express specific pro-tein domains and to minimise the molecular weight, number of low complexity regions and contiguous hydrophobic amino acids. In conclusion, this study will help enable a systematic expansion in the number mammalian proteins and domains that can be successfully expressed in E. coli as soluble product, and also predict which are best targeted for a eukaryotic expression system.

Materials
Oligonucleotides were synthesised by Qiagen-Operon (Cologne, Germany) or Sigma-Genosys (Haverhill, UK). All restriction enzymes were from New England Biolabs (Hitchin, UK). The vectors pET-DEST42, pDEST17 and pDONR201 and E. coli DB3.1 and BL21(DE3)Star pLysS, Gateway BP and LR clonase enzyme mix, pre-cast 4-12 % NuPAGE Bis-Tris gels and PVDF membranes (0.45 µm pore size) were all from Invitrogen (Paisley, UK). Entry plasmids in both open (minus stop codon) or closed format (plus stop codon) containing the full-length genes for GRB2, HRAS, JUN, FOS, MAD, MAX, CDK2, CDK4, CDKN1B, CASP2, MMP1, CDKN2A and CD44 were provided by Pascal Braun and Josh LaBaer (Harvard Institute of Proteomics, Cambridge, USA). A full length clone containing the full-length human EGFR ORF was provided by the RIKEN BioResource Center (Tsukuba, Japan) and Efna1 from the Mammalian Gene Collection (MGC) archived at the Wellcome Trust Sanger Institute (Hinxton, UK). First strand synthesis human and mouse cDNA was from BD Biosciences (Oxford, UK). Plasmid, gel extraction and PCR purification kits and 6xHis protein ladder were purchased from Qiagen (Crawley, UK). The expression strain BL21(DE3), BugBuster protein extraction reagent and His tag monoclonal antibody was from Merck Biosciences (Nottingham, UK). The 96-well multiscreen-DV durapore filter plate with 0.65 µm pore size was from Millipore (Watford, UK) and Cy5-labelled goat antimouse IgG from Amersham Biosciences (Little Chalfont, UK). Europium labelled antibodies and DELFIA reagents were from Perkin Elmer (Beaconsfield, UK) and all other chemicals unless otherwise stated were from Sigma-Aldrich (Gillingham, UK).

N-Terminal fusion GATEWAY destination vector construction
To prepare pET-DEST42-MCS, a multi-cloning site was inserted into pET-DEST42 (Invitrogen) at nt396, between the shine-dalgarno sequence and the attR1 recombination site, encoding the recognition sequences for NdeI, KpnI, DraIII and BfrBI. Inverse or whole plasmid PCR was performed on pET-DEST42 with 5'-phosphorylated PAGE purified primer pairs 20 (5' TACCCACGAAGTGATGCAT-ACAAGTTTGTACAAAAAAGCTGAACG 3') and 21 (5' CCCATATGTATATCTCCTTCTTAAAGTTAAACAAAAT-TATTTCTAGAG 3') in a 20 µl reaction containing 10 ng sion vectors the forward and reverse primers were 5' AAAAAGCAGGCTTCGAAGGAGATAGAACCATGG 3' and 5' AGAAAGCTGGGTT 3' respectively with the forward primer encoding the shine-dalgarno and kozak sequences and start codon. PCR 2 products were analysed by 1 % TBE-agarose electrophoresis [58] and correct size fragments were then subjected to an adapter PCR step to complete the flanking attB1 and attB2 sites. This consisted of a PCR reaction as described above using 1 µl of a 50-fold dilution of the PCR 2 reaction in a total volume of 20 µl and primer pair 113 (5' GGGGACAAGTTTGTA-CAAAAAAGCAGGCT 3') and 114 (5' GGGGACCACTTT-GTACAAGAAAGCTGGGT 3') except that the annealing temperature was 45°C, only 12 cycles were used and extension time was 2 mins. The products of the adapter PCR were purified by a 96-well PCR clean-up kit (Qiagen), eluted in 100 µl 10 mM Tris-HCl (pH8.5) and had an average concentration of 40 ng /µl. Recombinational cloning of attB flanked PCR products with an attP containing pDONR vector to generate a set of entry plasmids was as described previously [35] except that pDONR221 (Invitrogen) was used. The ORFs within sequence confirmed attL containing entry plasmids were then recombined the various attR destination vectors described above to generate sets of expression plasmids. The LR recombination reactions [35] were used to transform E. coli DH5α cells, miniprep plasmid DNA prepared and this used to transform the various BL21(DE3) expression strains used in this study.

Expression screening and quantitation
All BL21(DE3) transformants were selected and propagated in the presence of 100 µg/ml ampicillin. A single antibiotic resistant colony was used to inoculate 0.5 ml 2xYT media in a 96-deep well block containing the appropriate antibiotics and shaken at 210 rpm at 37°C. When the average OD 600 had reached 1 (3 hrs for BL21(DE3)), 60 µl was transferred to 1.2 ml 2xYT media in a 96-deepwell block containing the appropriate antibiotics, placed on a shaking incubator at 37°C and when the OD 600 reached 0.5 (2 hrs for BL21(DE3)) IPTG added to a final concentration of 1 mM and shaking continued at 25°C for 12 hours. Total protein was analysed by transferring a 20 µl aliquot of the induced culture to a 96-well PCR plate containing 20 µl of 2 × NuPage LDS loading buffer (Invitrogen), 0.1 M DTT, heated to 95°C for 10 mins and cooled on ice prior to loading 10 µl on a 17-well 4-12 % NuPAGE Bis-Tris gels with a multi-channel gel loading syringe (Hamilton). Soluble protein was extracted by transferring 290 µl of induced culture to a shallow well plate, centrifugation at 3000 g for 5 mins, supernatant removed and cells were resuspended in 58 µl BugBuster containing 1.4 units of benzonase and 58 units of recombinant lysozyme (Novagen). For the C-terminal tag and expression strain comparison this buffer was also supplemented with 0.58 µl protease inhibitor cocktail set III 10-fold diluted in DMSO (Novagen). The cell-pellets were resuspended with a multi-channel pipette and incubated with slow shaking for 20 mins at room temperature prior to transfer to 96-well multiscreen-DV durapore filter plates with 0.65 µm pore size (Millipore). The filter plate was placed on top of a shallow 96-well plate and centrifuged at 1000 g for 2 mins. 4 µl of the filtrate was then added to a 96-well plate containing 5 µl of 4 × NuPage LDS loading buffer (Invitrogen), 11 µl of 182 mM DTT, the plate heated at 95°C for 5 mins and loaded onto a 17well 4-12 % NuPAGE Bis-Tris gel. A His-tagged molecular weight ladder (Qiagen) was also loaded onto each gel. Gel electrophoresis and electro-transfer to PVDF membrane was as described. [58] Blots were blocked with 3 % Marvel milk powder in PBS-Tween (PBS with 0.1% Tween) either 1 hour at room temperature or over-night at room-temperature, washed with PBS-Tween and incubated with 40 ng/ml anti-His5 tag monoclonal antibody (Novagen), 3 % Marvel, PBS-Tween for 1 hr, washed 3 × PBS-Tween, incubated with 1 µg/ml Cy5 labelled goat anti-mouse in 3% Marvel, PBS-Tween for 1 hr, washed 3 × PBS-Tween and 2 × PBS and blots dried at 37°C for 10 mins between blotting paper. The blots were scanned on a Typhoon 8600 variable mode imager (Amersham) with fluorescence scan mode, 633 nm excitation laser, 670 nm emission filter, 600 V PMT and 200 µm / pixel scan resolution. The integrated fluorescence intensity volumes of bands on the gel were quantitated using ImageQuant TL software (Amersham). Conversions to protein yield were made by using a calibration curve of purified His-tagged single chain antibody (scFv). Differences between the molecular weight (MW) of the scFv (31 KDa) and each expressed fusion protein were taken into account by multiplying each protein quantitation by the ratio MW construct (KDa) / 31. The numbers were normalised to eliminate blot to blot variation using a His-tagged molecular weight ladder (Qiagen).
Sanger Institute, UK) for sharing their cDNA isolation protocol. This work was supported by The Wellcome Trust.