Integrated application of uniform design and least-squares support vector machines to transfection optimization

Background Transfection in mammalian cells based on liposome presents great challenge for biological professionals. To protect themselves from exogenous insults, mammalian cells tend to manifest poor transfection efficiency. In order to gain high efficiency, we have to optimize several conditions of transfection, such as amount of liposome, amount of plasmid, and cell density at transfection. However, this process may be time-consuming and energy-consuming. Fortunately, several mathematical methods, developed in the past decades, may facilitate the resolution of this issue. This study investigates the possibility of optimizing transfection efficiency by using a method referred to as least-squares support vector machine, which requires only a few experiments and maintains fairly high accuracy. Results A protocol consists of 15 experiments was performed according to the principle of uniform design. In this protocol, amount of liposome, amount of plasmid, and the number of seeded cells 24 h before transfection were set as independent variables and transfection efficiency was set as dependent variable. A model was deduced from independent variables and their respective dependent variable. Another protocol made up by 10 experiments was performed to test the accuracy of the model. The model manifested a high accuracy. Compared to traditional method, the integrated application of uniform design and least-squares support vector machine greatly reduced the number of required experiments. What's more, higher transfection efficiency was achieved. Conclusion The integrated application of uniform design and least-squares support vector machine is a simple technique for obtaining high transfection efficiency. Using this novel method, the number of required experiments would be greatly cut down while higher efficiency would be gained. Least-squares support vector machine may be applicable to many other problems that need to be optimized.


Background
Central to life functions, protein expression in normal and diseased states is essential for quantifying altered patterns of gene expression. This is especially true in the era that the sequencing of human genome has been finished. To gain insights into protein expression, we have to transfect cells with kinds of expression vectors, based on plasmid, viral vector, or transposon, etc. Transfection may be one of the commonest but indispensable procedures for cellular biology. However, in the process of evolution, eukaryotic cells tend to have low transfection efficiency in order to protect their genomes from exogenous insults. Transfection difficulty manifests itself, especially in the cotransfection of mammalian cells. Theoretically, if the transfection efficiency of single kind of plasmid is E, which ranges from 0 to 1, the efficiency of double and triple cotransfection may decline to E 2 and E 3 , respectively. Therefore, it is of great importance to improve efficiency.
In order to enhance the transfection, several kinds of strategies are developed, which are categorized into two types: viral gene delivery carriers and non-viral gene delivery carriers. In non-viral gene delivery carriers, cationic liposomes has the widest application. Cationic liposomes are positively charged liposomes which interact with the negatively charged DNA molecules to form a stable complex. Cationic liposomes consist of a positively charged lipid and a co-lipid. A variety of positively charged lipid formulations are commercially available and many other are under development. Lipofection, one of the most frequently cited cationic lipids, was first reported by Felgner in 1987 to deliver genes to cells in culture [1]. Lipofection has been used to deliver linear DNA, plasmid DNA, and RNA to a variety of cells in culture. Liposomes offer several advantages in delivering genes to cells. (1) Liposomes have the ability to combine both with negatively and positively charged molecules. (2) Liposomes offer a degree of protection to the DNA from degradative processes. (3) Liposomes carry large pieces of DNA, potentially as large as a chromosome. (4) Liposomes can be targeted to specific cells or tissues. In addition, liposomes overcome problems inherent with viral vectors -specific concerns over immunogenicity and replication competent virus contamination. Liposomes resulted in a highly adaptable and flexible system capable of gene delivery both in vitro and in vivo. Current limitations regarding in vivo application of liposomes revolve around the low transfection efficiencies and transient gene expression. Also, liposomes display a small degree of cellular toxicity and appear to be inhibited by serum components. The ability to overcome these problems should greatly facilitate their application to a variety of gene delivery mechanisms.
Several factors have significant effects on the transfection efficiency of cationic liposomes, such as vigor of the host cells, the amount of plasmid, the amount of transfection agent, and the density of cells. However, it is hard to control vigor of host cells which has not a quantitative index. The other three factors are controllable in transfection, which can be adjusted according to the host cells and transfection agents. However, the adjustment of these three factors is a time and energy-consuming work. For most researchers, they may spend two to three months on optimizing transfection. Fortunately, several mathematical methods offer promising avenues to the resolution of this issue.
There are several ways to perform computer experiments, such as Latin Hypercube Sampling (LHS) and Uniform Design (UD). LHS was brought up by three scholars in the North American [2]. Uniform design, abbreviated as UD, was first developed by Fang et al in nineteen eighty [3]. UD seeks design points that are scattered uniformly on the domain. It has been popular since 1980. The main advantages of UD may be generalized as the following: first, it has the ability to greatly reduce the number of experiments while not to alter the representativeness; second, it generates a regression model based on the results and it's able to predict at what independent variables the dependent variable may gain the maximum.
As a relatively new algorithm used for classification and regression, support vector machine (SVM) was developed in the 1990s [4,5]. It is a desired method for estimation based on finite-sample and therefore is able to solve a lot of practical problems in case of limited samples. Their practical successes may be attributed to solid theoretical foundations based on Vapnik Chervonenkis theory, and to the minimization of structural risk [6]. In order to implement the SVM into our transfection optimization, the least squares support vector machines (LS-SVM) was used, which has a growing popularity for regression problems [7]. It can be argued that LS-SVM would yield better generalization for regression problems on finite samples [8].

Results and discussion
As was shown in Table 1 and Figure 1, transfection efficiency varied greatly with the changes of the amount of plasmid, LipofectAMINE, and the number of seeded cells. If these three independent varies did not match, transfection efficiency would decline sharply. In Table 1 and Figure 1, experiment L has the lowest efficiency (13.49%) for the ration of plasmid to transfection agent is too low, while experiment K has the highest efficiency owing to the designed ratios between the three independent factors. According to the established model, transfection efficiency would gain the maximum if 2.1×10 5 of cells, 0.66 μg of plasmid and 1.32 μg of LipofectAMINE were used.
And this was accord with the observed data (Table 2 and Figure 2). More than that, there was a high degree of coincidence between calculated transfection efficiency and the deduced date from the model ( Figure 3). Thus, by virtue of UD and LS-SVM, only 15 experiments, which can be performed in two 24-well plates, are needed to get the optimal transfection conditions whereas more than 15 3 experiments are needed to attain the expected purpose by using traditional method. What's more, if more accurate conditions were demanded, the number of experiments would greatly exceed 15 3 .
The proper setting of LS-SVM model training parameters was tuned by grid search. The most common performance assessment method is probably the k-fold cross-validation [9] and the leave-one-out procedure. In the k-fold crossvalidation, the training data are randomly split into k mutually exclusive subsets (the folds) with approximately equal sizes. The resulting LS-SVM model is obtained by training on k-1 subsets and then the model is tested on the remaining one subset. This procedure is repeated for k times and in this fashion each subset is used for testing only once. By averaging the test errors over the k trials it gives an estimate of the expected generalization error. The leave-one-out procedure can be viewed as an extreme form of the k-fold cross-validation with k equal to the number of examples. Leave-one-out is known as an unbiased estimation method for small-samples problems, such as our application. Therefore, fifteen times of training and test repeated on a pair of parameters and each MSE value for a pair of parameters were reported by the leave-one-out procedure. Part of results was listed in Table  3. As was shown in Table 3, the minimum MSE is found at a pair of parameters (γ = 42, C = 1500), and then the LS-SVM model obtains a peak estimated performance. After  the optimal parameters for model construction are known, the according model (final model) is validated by predicting the validation data and comparing these predictions with the real observations.
The discrepancy between the predicted value and their respective observed data was listed in Table 4. From Table  4, it could be find that the maximum of observed data was N7 (92.32%) and the maximum of predicted values based on LS-SVM was also N7 (84.04%). The error ratio between observed data and predicted value of LS-SVM was less than 10%. Thus, LS-SVM has an excellent predicted ability (generalization ability) on our problem. The mutual influence between the predicted value and two of all the three variables was shown in Figure 4, 5 and 6. In a three dimensional surface, each mesh point in the (x, y)-plane stood for a variable combination and the z-axis stood the predicted value. Figure 4, 5 and 6 showed that the change of LS-SVM predicted value on ten test samples was consistent with observed data.
The contribution of a specific independent variable was also evaluated. Table 5 showed the average MSE when one specific variable was ignored. As is indicated by Table 5 and Figure 7, 8 and 9, amount of plasmid has the most significant effect on transfection efficiency, followed by amount of LipofectAMINE, while the density of seeded cells has the least effect on transfection efficiency. And this result coincided with our experience.
Owing to UD, the amount of test points required can be enormously reduced, especially when the experimental region has many factors and multiple levels, while the results that reflect the major characteristics of the experimental system are ensured. As an efficient fractional factorial design, UD has been widely applied in manufacturing, system engineering, pharmaceutics, and natural sciences in the past decades [10][11][12]. The UD was used in this research to describe factors that significantly influence transfection efficiency to obtain a smaller, more manageable set. To perform a computer experiment, in order to have a wide coverage of the entire design region with a limited number of runs, UD is a good recommendation.
The SVM is a machine learning technique with a strong theoretical foundation that has been used to improve classification accuracy in biological applications [13][14][15][16][17][18][19]. The SVM is a maximum margin classifier that can solve nonlinear classification problems by learning an optimal separating hyperplane in a higher-dimensional feature space. By use of non-linear kernel functions such as a Gaussian kernel, complex and non-linear decision functions can be learned by the SVM. LS-SVM is a reformulation to standard SVM. It is closely related to regularization networks and Gaussian processes but it additionally emphasizes and exploits primal-dual interpretations from optimization theory. In our experiment, LS-SVM mapped the original input space into a high dimension feature space by a Gaussian kernel and then learns a smoothest hyperplane to fit the training data. From statistical learning theory, it can be expected that this hyperplane would have excellent generalization ability and has minor local extreme value. Together, UD has the ability to greatly reduce the number    of experiments while not to alter the representativeness and LS-SVM would yield better generalization for regression problems on finite samples. Thereupon, the integrated application of UD and LS-SVM would have high prediction accuracy and would contribute to transfection optimization.

Conclusion
This paper investigates the integrated application of UD and LS-SVM to transfection, for obtaining precise information on the optimal conditions. Based on our experiments, UD and LS-SVM appear to have high efficiency and perform well even when undergone experiments are extremely scarce. With the established model, we are able to gain the optimal transfection conditions and the highest transfection efficiency that can be reached. Thus, the required time and experiments to improve transfection efficiency can be greatly reduced while the achieved efficiency may even be higher than traditional methods. It seems that LS-SVM has higher accuracy in the prediction of optimal transfection conditions than it does in the pre-diction of highest transfection efficiency; nevertheless, we usually have higher stringency of the information on optimal transfection conditions. It should be pointed out that the vigor of host cells and the purity of plasmid have crucial effect on the transfection efficiency too. However, these factors are uncontrollable in most settings. Further interpretation of the results obtained from other host cell lines is required. These issues are part of our ongoing research.

Cell Culture
The 293FT cell line was maintained in DMEM supplemented with 100 mL/L fetal calf serum, 2 mmol/mL Lglutamine, 100 μg/mL penicillin and 100 units/mL streptomycin. The cells were incubated in a humidified incubator at 37°C containing 50 mL/L CO 2 . Cell viability was estimated by the trypan blue dye exclusion method. The 293FT cells were seeded into 24-well plates 24 h prior to transfection. Three wells of cells were transfected for every experiment. The cells were transfected using Lipo-

Experimentations
In a protocol consists of 15 experiments, amount of liposome, plasmid, and the number of seeded cells were set as independent variables while transfection efficiency was set as dependent variable. Each independent variable had 15 levels. The ranges of independent variables were set according to the instruction of manufacturer. The protocol was performed according to the principle of UD (Table  1). Each transfection efficiency (dependent variable) was calculated by flow cytometer. The expression of GFP in each experiment was also observed by fluorescence micro-scope (Eclipse 80I, Nikon, Tokyo, Japan). A model was constructed by using LS-SVM. The respective fitted value to each measured transfection efficiency was also deduced from the established model. Another protocol consisting of 10 experiments was designed centering on the predicted optimal conditions at which the dependent variable would reach the maximum ( Table 2). And the observed GFP expression was shown in Figure 1 and Figure 2. All the observed data in Table 1 and Table 2 were the mean values of three independent experiments.

Development of the LS-SVM based models for prediction of transfection efficiency
In regression formulation, the goal is to estimate an unknown continuous-valued function based on a finite number set of noisy samples (x i , y i ), (i = 1, ..., n), where ddimensional input is x ∈ R d and the output is y ∈ R. In SVM regression formulations, the input X is first mapped into a m-dimensional feature space using some fixed (nonlinear) mapping, and then a linear model is constructed in this feature space [8]. Using mathematical notation, the linear model (in the feature space) f(x, ω) is given by Equation (1), where g j (x), j = 1, ..., m denotes a set of nonlinear transformations, and b is the "bias" term.
The quality of estimation is measured by the loss function L(y, f(x, ω)). SVM regression uses a new type of loss function called ε-insensitive loss function proposed by Vapnik Compared with simple SVM, LS-SVM computes the solution by solving a linear system instead of quadratic programming. This is due to the use of equality instead of inequality constraints in the above problem formulation. It is well known that LS-SVM generalization performance (estimation accuracy) depends on a good setting of metaparameters parameters C and the kernel parameters. The main performance metric of LS-SVM is the prediction risk (Equation (4)), defined as mean square error (MSE), between estimated values derived from LS-SVM and true values for testing inputs.
Therefore, for ensuring good generalization performance, the main issue on LS-SVM application depends on the proper setting of these parameters for a given data set. Selecting a particular kernel type and kernel function parameters is usually based on application-domain knowledge and should also reflect distribution of inputted values of the training data [20]. Here, we showed example of SVM regression using radial basis function (RBF) kernels (Equation (5)