This is a web interface to help in the process of building a "good predictor." We have implemented a few strategies that seem to be relatively popular. However, as the name of the application suggests, "This is Not A Substitute for A Statistician".
We hope that, by making available a tool that builds simple, yet ---if we believe the literature; see below-- powerful predictors, AND cross-validating the whole process, we can make people aware of the widespread problem of "selection bias". At the same time, we hope to make it easy for people to see that some data sets are "easy", and some are "hard": whatever the method you use, with some data sets you always can do a great job, and with others the situation seems hopeless. Finally, Tnasas can be used as a benchmark against some (overly) optimistic claims that occasionally are attached to new methods and algorithms. This is a particularly important feature, since many new predictor methods are being proposed in the literature often without careful comparisons with alternative methods; Tnasas can be used as a simple, effective way of comparing the peformance of the newly proposed methods and can, itself, become a benchmarking tool. As a side issue, we hope it will be easy to see that with many data sets it is often very hard to do a good predictive job with only a few genes (this is often a mirage from selection bias).
In addition, we want to make people aware of the problems related to the instability of solutions: we often find many solutions that are equally good from the point of view of prediction error rate, but that share very few, if any, genes.
Tnasas allows you to combine several classification algorithms with different methods for gene selection.
We have included several methods, support vector machines (SVM), k-nearest neighbor (NN), diagonal linear discriminant analysis (DLDA), random forest, and shrunken centroids that have been shown to perform very well with microarray data (Dudoit et al., 2002; Romualdi et al., 2003; Díaz-Uriarte and Alvarez de Andrés, 2005).
Full explanations of the methods used can be used in the references. What follows is a quick summary:
What genes should we use when building the predictor? Often, researchers would like to build the predictor using only genes that are "relevant" for the prediction. In addition, using only "relevant" genes can lead to "better" predictions ("better" in the sense of, e.g., smaller variance). How should we select those genes?
A popular approach is to first select a set of genes, with some relatively simple, fast, method, and then feed these selected genes to our predictor. This is called, in the machine learning literature, the "filter approach" (e.g., Kohavi & John, 1998) because we first "filter" the predictor variables (in our case genes), keep only a subest, and then build the predictor. We provide three ways of ranking genes, and each can be used with any of the above class-prediction algorithms (except PAM, which includes gene selection as part of the algorithm itself).
Suppose we select the first 50 genes with the F-ratio, as explained above; then we use, e.g., DLDA, and we report the error rate of our classifier using 10-fold cross-validation. This sounds familiar, … but is actually a bad idea: the error rate we just estimated can be severely biased down.
This is the problem of selection bias, which has been discussed in the microarray literature by Ambroise & McLachlan (2002) and Simon et al. (2003) (this problem has been well known in statistics for a long time). Essentially, the problem is that we use all the subjects to do the filtering, and therefore the cross-validation of only the DLDA, with an already selected set of genes, cannot account for the effect of pre-selecting the genes. As just said, this can lead to severe underestimates of prediction error (and the references given provide several alarming examples). In addition it is very easily to obtain "excellent" predictors with completely random data, if we do not account for the preselection (we present some numerical examples with KNN and SVM in our R course.
A way to obtain better estimates of the prediction error is to use cross-validation over the whole process of "select-genes-then-build-predictor": leave aside a set of samples, and with the remaining samples do the gene selection and building of the predictor, and then predict the left out samples. This way, the left-out samples do not participate in the selection of the genes, and the "selection bias" problem is avoided.
In the procedures we have implemented, the problem of selection bias is taken into account: all the cross-validated error rates include the process of gene selection.
OK, so we have taken care of selection bias. But what if we repeat the process of selecting a number of genes for different numbers of genes, say 10, 50, 200, and 500, and then keep the one that leads to the smallest cross-validated error rate? We have a similar problem to selection bias; here the bias comes from selecting, a posteriori, the "best" number of predictors based on the performance of each of a number of predictors with our data set.
To give an example, suppose we use DLDA, and we use either 10, 50, or 100 genes. And suppose the cross-validated error rates (cross-validated including variable selection) are 15%, 12%, and 20%. Now, we select the DLDA with 50 genes. But we cannot say that the expected error rate, when applied to a new sample, will be 12%; it will probably be higher. We cannot know how much of our low error rates is due to our "capitalizing on chance".
Thus, we need to add another layer of cross-validation (or bootstrap if we prefer). We need to evaluate the error rate of a prediction rule that is built by selecting among a set of rules the one with the smallest error. Because that is what we are doing: we are building, say, 3 prediction rules (one with 10 genes, one with 50, one with 100), and then choosing the rule with the smallest error rate.
If we need to add this extra lyaer of cross-validation, why do we use, to select the number of of genes, the cross-validated error rate accounting for "selection bias"? This certainly takes a lot more time. The reason is that, a priori, selecting the number of genes this way should lead to better predictors, since we base our choice on an estimate of prediction error rate that accounts for selection bias.
The methods we have implemented return, as part of their final output, the cross-validated error rate of the complete process; in other words, we cross-validate the process of "building several predictors and then choosing the one with the smallest error rate".
So is all of this statisticians paranoia? No it is not. As we have just said, the references provided give some very sobering examples. And it is extremely easy to obtain "excellent predictors" from completely random data if one does not account for selection biases. More generally, the issue of variable selection is a very delicate one. It is well know in the statistical literature that most of the "usual" variable selection procedures, such as the (in)famous stepwise, backwards, and forward methods in, for example, linear regression, are severely biased (and, for example, lead to p-values that are not meaningful). This is issue is ellaborated upon beautifully by Frank Harrell Jr. in his book Regression modeling strategies; some of the main arguments can be found at this site. This is very relevant here, because some people think they can escape this problem by doing something like: "first, select 50 genes; then go to my favourite stats program ZZZZ, and do a logistic regression, selecting variables with bacwkards ---or forward or whatever--- variable selection". This approach solves nothing, and it compounds the problem: we have the problem of selection bias because of the preselection of genes and we have addedd the nasty problem of using stepwise variable selection. A very didactic excersise in these situations is to bootstrap the whole process (Regression modeling strategies ellaborates on these issues); it is often amazing how many widely dissimilar models are obtained. Repeating the exercise a couple of times will turn most people into skeptics of the virtues of variable selection methods. These problems are particularly sever and serious with microarray data sets, were we often have relatively few subjects (e.g., less than 100) but several thousands of genes; in these settings, it is extremely easy to obtain "excellent predictors" from purely random data; we can capitalize on chance to build a fantastic logistic regression model using stepwise regression methods; a logistic regression model that will fail miserably when we apply it to new samples …
If using shrunken centroids a somewhat similar procedure is used implicitly by the method.
The number of folds we use is 10 if possible. If there are not enough samples, we use as many folds as possible given the data (so that there is at least one testing sample in the testing data set, and so that there is at least one training sample from each class in every training sample). The cross-validation samples are selected so that the relative proportions of each class are the same (or as close to the same as possible) in all training and test samples.
Some procedures are faster than others. NN and DLDA with randking based on the F-ratio are relatively fast (< 3 minutes). SVM and random forest are slower than NN and DLDA. The Wilcoxon and random forest ranking are slower than the F-ratio. Shrunken centroids is relatively fast.
The file with the covariates; generally the gene expression data. In this file, rows represent variables (generally genes), and columns represent subjects, or samples, or arrays.
The file for the covariates should be formated as:
#Name ge1 ge2 ge1 ge1 ge2 gene1 23.4 45.6 44 76 85.6 genW@ 3 34 23 56 13 geneX# 23 25.6 29.4 13.2 1.98
These are the class labels (e.g., healthy or affected, or different types of cancer) that group the samples. Our predictor will try to predict these class labels.
Please note that we do not allow any data set with 3 or fewer cases in any class. Why? Because, on the one hand, any results from such data would be hard to believe; on the other hand, that would result in some cross-validation samples having some training samples with 0 elements from one of the classes.
Separate values by tab (\t), and finish the file with a carriage return or newline. No missing values are allowed here. Class labels can be anything you wish; they can be integers, they can be words, whatever.
This is a simple example of class labels file
CL1 CL2 CL1 CL4 CL2 CL2 CL1 CL4
If you use any of the currently standard identifiers for your gene IDs for either human, mouse, or rat genomes, you can obtain additional information by clicking on the gene names in the output. This information is returned from IDClight based on that provided by our IDConverter tool.
Two forms of output are provided:
This plot shows the cross-validated error rate of the predictor when built using different numbers of genes. This already accounts for selection bias. The final model selected will be the one with the smallest cross-validated error rate.
With gray lines we show the same type of error vs. number of genes as obtained with each of the cross-validation samples.
For comparison, the plots shows the error rate we can achieve by always betting on the most frequent class (a dotted blue line) and the estimate of the error rate from the 10-fold cross-validation (a dotted red line) that is returned also in the Results).
Lets show an example:
Confussion Matrix: N T TotalError RelativeErrorPerClass Observed N 4 3 0.06 0.42857143 Observed T 1 42 0.02 0.02325581
The Total Error is obtained as the number of misclassifications relative to the total number of predictions. The total number of missclassifications is 3 + 1 = 4, and the total number of predictions is 50. As said before, all these are "out-of-bag" predictions. So the "Error rate" here is 4/50 = 0.08.
The "TotalError" column entries are 3/50 = 0.06 and 1/50 = 0.02, when the true, observed, classes are "N" and "T" respectively. Sure enough, the "Error rate" = 0.08 = 0.06 + 0.02.
The "RelativeErrorPerClass" give you a view of how well it is doing relative to the size of the class. It is 3/7 = 0.429 when the true class is "N" and 1/43 = 0.023 when the true class is "T". In other words, the overall error rate of the predictor is not bad (0.08), but the error rate for those of class "N" is not good (almost 50%); how can these two facts be reconciled? Notice that the size of class "N" is very small compared to that of "T" (7 vs. 43). And, as you can see, the predictor in this case can do a good job by emphasizing good predictions for the large class ("T").
The above comments are something to pay attention to with very unbalanced classes. An extreme example: if we have a situation where 99% of the cases are class "A" and 1% of the cases are of class "B", we can achieve a very low prediction error (1%) if we always assign every case to class "A".
It is now possible to send the results to PaLS. PaLS "analyzes sets of lists of genes or single lists of genes. It filters those genes/clones/proteins that are referenced by a given percentage of PubMed references, Gene Ontology terms, KEGG pathways or Reactome pathways." (from PalS's help). By sending your results to PaLS, it might be easier to make biological sense of your results, because you are "annotating" your results with additional biological information.
Scroll to the bottom of the main outpu, where you will find the PaLS icon and the gene lists. When you click on any of the links, the corresponding list of genes will be sent to PaLS. There, you can configure the options as you want (please, consult PalS's help for details) and then submit the list. In PaLS, you can always go back, and keep playing with the very same gene list, modifying the options.
For individual genes, recall that the names in the tables are
clickable, and display additional information from
Examples of several runs, one with fully commented results, are available here.
This program was developped by Juan M. Vaquerizas and Ramón Díaz-Uriarte, from the Bioinformatics Unit at CNIO. This tool is, essentially, a web interface to a set of R functions, plus a small piece of C++ code (for the dlda part), written by Ramón. Some of these functions themselves call functions in the packages e1071 (by E. Dimitriadou, K. Hornik, F. Leisch, D. Meyer, and A. Weingessel), class (by W. Venables and B. Ripley), pamr (by T. Hastie, R. Tibshirani, Balasubramanian Narasimhan, G. Chu), supclust (by M. Dettling and M. Maechler), multtest (by Y. Ge and S. Dudoit) and randomForest (by A. Liaw, M. Wiener, with Fortran code by L. Breiman and A. Cutler). Our set of functions themselves will (soon) be converted into an R package and released under the GPL.
We want to thank all these authors for the great tools that they have made available for all to use. If you find this useful, and since R and Bioconductor are developed by a team of volunteers, we suggest you consider making a donation to the R foundation for statistical computing.
Uploaded data set are saved in temporary directories in the server and are accessible through the web until they are erased after some time. Anybody can access those directories, nevertheless the name of the directories are not trivial, thus it is not easy for a third person to access your data.
In any case, you should keep in mind that communications between the client (your computer) and the server are not encripted at all, thus it is also possible for somebody else to look at your data while you are uploading or dowloading them.
This software is experimental in nature and is supplied "AS IS", without
obligation by the authors or the CNIO the to provide accompanying services or
support. The entire risk as to the quality and performance of the software is
with you. The authors expressly disclaim any and all warranties regarding the
software, whether express or implied, including but not limited to warranties
pertaining to merchantability or fitness for a particular purpose.
Ambroise C, McLachlan GJ (2002) Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci USA 99: 6562--6566.
Breiman L (2001) Random forests. Machine Learning 45: 5--32 (Tech. report).
Breiman L (2003) Manual--Setting Up, Using, And Understanding Random Forests V4.0.
Díaz-Uriarte R, Alvarez de Andrés, S (2005) Gene selection and classification of microarray data using random forest. In review. (tech. report.)
Dudoit S, Fridlyand J, Speed TP (2002) Comparison of discrimination methods for the classification of tumors suing gene expression data. J Am Stat Assoc 97: 77--87. (tech. report.)
Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, et~al. (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16: 906--914.
Harrell JFE (2001) Regression modeling strategies. New York: Springer.
Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. New York: Springer.
Kohavi R, John GH (1998) The Wrapper Approach, in Feature Selection for Knowledge Discovery and Data Mining, H. Liu & H. Motoda (Ed.), Kluwer, 33-50 (reprint).
Lee Y, Lee CK (2003) Classification of multiple cancer types by multicategory support vector machines using gene expression data. Bioinformatics 19: 1132--1139.
Liaw A, Wiener M (2002) Classification and regression by randomForest. R News, 2: 18--22.
Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang C, et~al. (2001) Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci USA 98: 15149--15154.
Ripley BD (1996) Pattern recognition and neural networks. Cambridge: Cambridge University Press.
Romualdi C, Campanaro S, Campagna D, Celegato B, Cannata N, et~al. (2003) Pattern recognition in gene expression profiling using DNA array: a comparative study of different statistical methods applied to cancer classification. Hum Mol Genet 12: 823--836..
Simon R, Radmacher MD, Dobbin K, McShane LM (2003) Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. Journal of the National Cancer Institute 95: 14--18.
Tibshirani, R., Hastie, T., Narasimhan, B. & Chu, G. (2002) Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci USA, 99: 6567--6572.