  1. Introduction and purpose
  2. Methods included
  3. Finding the important genes: "selection bias"
  4. What the program does
  5. Is the program fast?
  6. Usage
  7. Examples
  8. Authors and Acknowledgements
  9. Terms of use
  10. Privacy and Security
  11. Disclaimer
  12. References

Introduction and purpose

This is a web interface to help in the process of building a "good predictor." We have implemented a few strategies that seem to be relatively popular. However, as the name of the application suggests, "This is Not A Substitute for A Statistician".

We hope that, by making available a tool that builds simple, yet ---if we believe the literature; see below-- powerful predictors, AND cross-validating the whole process, we can make people aware of the widespread problem of "selection bias". At the same time, we hope to make it easy for people to see that some data sets are "easy", and some are "hard": whatever the method you use, with some data sets you always can do a great job, and with others the situation seems hopeless. Finally, Tnasas can be used as a benchmark against some (overly) optimistic claims that occasionally are attached to new methods and algorithms. This is a particularly important feature, since many new predictor methods are being proposed in the literature often without careful comparisons with alternative methods; Tnasas can be used as a simple, effective way of comparing the peformance of the newly proposed methods and can, itself, become a benchmarking tool. As a side issue, we hope it will be easy to see that with many data sets it is often very hard to do a good predictive job with only a few genes (this is often a mirage from selection bias).

In addition, we want to make people aware of the problems related to the instability of solutions: we often find many solutions that are equally good from the point of view of prediction error rate, but that share very few, if any, genes.

Tnasas allows you to combine several classification algorithms with different methods for gene selection.

Methods included

We have included several methods, support vector machines (SVM), k-nearest neighbor (NN), diagonal linear discriminant analysis (DLDA), random forest, and shrunken centroids that have been shown to perform very well with microarray data (Dudoit et al., 2002; Romualdi et al., 2003; Díaz-Uriarte and Alvarez de Andrés, 2005).

Full explanations of the methods used can be used in the references. What follows is a quick summary:

Variable selection: finding the "important genes"

What genes should we use when building the predictor? Often, researchers would like to build the predictor using only genes that are "relevant" for the prediction. In addition, using only "relevant" genes can lead to "better" predictions ("better" in the sense of, e.g., smaller variance). How should we select those genes?

A popular approach is to first select a set of genes, with some relatively simple, fast, method, and then feed these selected genes to our predictor. This is called, in the machine learning literature, the "filter approach" (e.g., Kohavi & John, 1998) because we first "filter" the predictor variables (in our case genes), keep only a subest, and then build the predictor. We provide three ways of ranking genes, and each can be used with any of the above class-prediction algorithms (except PAM, which includes gene selection as part of the algorithm itself).

After ranking the genes, we examine the performance of the class prediction algorithm using different numbers of the best ranked genes and select the best performing predictor. In the current version of Tnasas we build the predictor using the best g =2, 5, 10, 20, 35, 50, 75, 120, 200, 500, 1000, 2000 and the total number of genes.

Finding the important genes: "selection bias"

Suppose we select the first 50 genes with the F-ratio, as explained above; then we use, e.g., DLDA, and we report the error rate of our classifier using 10-fold cross-validation. This sounds familiar, … but is actually a bad idea: the error rate we just estimated can be severely biased down.

This is the problem of selection bias, which has been discussed in the microarray literature by Ambroise & McLachlan (2002) and Simon et al. (2003) (this problem has been well known in statistics for a long time). Essentially, the problem is that we use all the subjects to do the filtering, and therefore the cross-validation of only the DLDA, with an already selected set of genes, cannot account for the effect of pre-selecting the genes. As just said, this can lead to severe underestimates of prediction error (and the references given provide several alarming examples). In addition it is very easily to obtain "excellent" predictors with completely random data, if we do not account for the preselection (we present some numerical examples with KNN and SVM in our R course.

A way to obtain better estimates of the prediction error is to use cross-validation over the whole process of "select-genes-then-build-predictor": leave aside a set of samples, and with the remaining samples do the gene selection and building of the predictor, and then predict the left out samples. This way, the left-out samples do not participate in the selection of the genes, and the "selection bias" problem is avoided.

In the procedures we have implemented, the problem of selection bias is taken into account: all the cross-validated error rates include the process of gene selection.

Further potential biases: finding the best subset among subsets

OK, so we have taken care of selection bias. But what if we repeat the process of selecting a number of genes for different numbers of genes, say 10, 50, 200, and 500, and then keep the one that leads to the smallest cross-validated error rate? We have a similar problem to selection bias; here the bias comes from selecting, a posteriori, the "best" number of predictors based on the performance of each of a number of predictors with our data set.

To give an example, suppose we use DLDA, and we use either 10, 50, or 100 genes. And suppose the cross-validated error rates (cross-validated including variable selection) are 15%, 12%, and 20%. Now, we select the DLDA with 50 genes. But we cannot say that the expected error rate, when applied to a new sample, will be 12%; it will probably be higher. We cannot know how much of our low error rates is due to our "capitalizing on chance".

Thus, we need to add another layer of cross-validation (or bootstrap if we prefer). We need to evaluate the error rate of a prediction rule that is built by selecting among a set of rules the one with the smallest error. Because that is what we are doing: we are building, say, 3 prediction rules (one with 10 genes, one with 50, one with 100), and then choosing the rule with the smallest error rate.

If we need to add this extra lyaer of cross-validation, why do we use, to select the number of of genes, the cross-validated error rate accounting for "selection bias"? This certainly takes a lot more time. The reason is that, a priori, selecting the number of genes this way should lead to better predictors, since we base our choice on an estimate of prediction error rate that accounts for selection bias.

The methods we have implemented return, as part of their final output, the cross-validated error rate of the complete process; in other words, we cross-validate the process of "building several predictors and then choosing the one with the smallest error rate".

So is all of this statisticians paranoia? No it is not. As we have just said, the references provided give some very sobering examples. And it is extremely easy to obtain "excellent predictors" from completely random data if one does not account for selection biases. More generally, the issue of variable selection is a very delicate one. It is well know in the statistical literature that most of the "usual" variable selection procedures, such as the (in)famous stepwise, backwards, and forward methods in, for example, linear regression, are severely biased (and, for example, lead to p-values that are not meaningful). This is issue is ellaborated upon beautifully by Frank Harrell Jr. in his book Regression modeling strategies; some of the main arguments can be found at this site. This is very relevant here, because some people think they can escape this problem by doing something like: "first, select 50 genes; then go to my favourite stats program ZZZZ, and do a logistic regression, selecting variables with bacwkards ---or forward or whatever--- variable selection". This approach solves nothing, and it compounds the problem: we have the problem of selection bias because of the preselection of genes and we have addedd the nasty problem of using stepwise variable selection. A very didactic excersise in these situations is to bootstrap the whole process (Regression modeling strategies ellaborates on these issues); it is often amazing how many widely dissimilar models are obtained. Repeating the exercise a couple of times will turn most people into skeptics of the virtues of variable selection methods. These problems are particularly sever and serious with microarray data sets, were we often have relatively few subjects (e.g., less than 100) but several thousands of genes; in these settings, it is extremely easy to obtain "excellent predictors" from purely random data; we can capitalize on chance to build a fantastic logistic regression model using stepwise regression methods; a logistic regression model that will fail miserably when we apply it to new samples …

What the program does

Essentially, this is what the program does.

To find the best number of predictors

  1. Draw a 10-fold cross-validation sample, and with each of the 10 "training sets":
    1. Rank the genes using one of the ranking methods above.
    2. For each g number of genes (where g is 2, 5, 10, 20, 35, 50, 75, 120, 200, 500, 1000, 2000 and the total number of genes).
      1. Build the predictor using those g genes.
      2. Predict the left-out sample.
  2. Compute the cross-validation error corresponding to each g number of genes.
  3. Select as the "best number of genes" gb the one that results in the smallest cross-validation error. If there are several equally good, choose the one corresponding to the smaller number of genes.
  4. Now run the gene ranking method on the complete sample, and select the top gb genes.

If using shrunken centroids a somewhat similar procedure is used implicitly by the method.

To evaluate the error rate of this procedure

Do a 10-fold cross-validation of the procedure above. So:
  1. Divide the whole data set in 10 approximately equal subsets.
  2. For each of the 10 subsets do:
    1. Leave aside this subset; these data left aside are "out-of-bag".
    2. With the other 9 subsets, (the "in-bag") use the procedure above to find the best number of genes, and train a predictor with those genes (but, again, only using the "in-bag" samples).
    3. Predict the out-of-bag samples with the predictor just found.
  3. At the end of the process, each sample has been in the "out-of-bag" set exactly once. Thus, for each sample we have a prediction where that sample has been out-of-bag. Using the out-of-bag predictions, and comparing them with the true class labels we obtain the "error rate".

Cross-validation: how many folds?

The number of folds we use is 10 if possible. If there are not enough samples, we use as many folds as possible given the data (so that there is at least one testing sample in the testing data set, and so that there is at least one training sample from each class in every training sample). The cross-validation samples are selected so that the relative proportions of each class are the same (or as close to the same as possible) in all training and test samples.

Is the program fast?

Some procedures are faster than others. NN and DLDA with randking based on the F-ratio are relatively fast (< 3 minutes). SVM and random forest are slower than NN and DLDA. The Wilcoxon and random forest ranking are slower than the F-ratio. Shrunken centroids is relatively fast.



Covariates file

The file with the covariates; generally the gene expression data. In this file, rows represent variables (generally genes), and columns represent subjects, or samples, or arrays.

The file for the covariates should be formated as:

Class file

These are the class labels (e.g., healthy or affected, or different types of cancer) that group the samples. Our predictor will try to predict these class labels.

Please note that we do not allow any data set with 3 or fewer cases in any class. Why? Because, on the one hand, any results from such data would be hard to believe; on the other hand, that would result in some cross-validation samples having some training samples with 0 elements from one of the classes.

Separate values by tab (\t), and finish the file with a carriage return or newline. No missing values are allowed here. Class labels can be anything you wish; they can be integers, they can be words, whatever.

This is a simple example of class labels file

CL1     CL2     CL1     CL4     CL2     CL2     CL1     CL4       

Type of gene identifier and species

If you use any of the currently standard identifiers for your gene IDs for either human, mouse, or rat genomes, you can obtain additional information by clicking on the gene names in the output. This information is returned from IDClight based on that provided by our IDConverter tool.


Two forms of output are provided:

Sending results to PaLS (New!!)

It is now possible to send the results to PaLS. PaLS "analyzes sets of lists of genes or single lists of genes. It filters those genes/clones/proteins that are referenced by a given percentage of PubMed references, Gene Ontology terms, KEGG pathways or Reactome pathways." (from PalS's help). By sending your results to PaLS, it might be easier to make biological sense of your results, because you are "annotating" your results with additional biological information.

Scroll to the bottom of the main outpu, where you will find the PaLS icon and the gene lists. When you click on any of the links, the corresponding list of genes will be sent to PaLS. There, you can configure the options as you want (please, consult PalS's help for details) and then submit the list. In PaLS, you can always go back, and keep playing with the very same gene list, modifying the options.

For individual genes, recall that the names in the tables are clickable, and display additional information from IDClight.


Examples of several runs, one with fully commented results, are available here.

Authors and Acknowledgements

This program was developped by Juan M. Vaquerizas and Ramón Díaz-Uriarte, from the Bioinformatics Unit at CNIO. This tool is, essentially, a web interface to a set of R functions, plus a small piece of C++ code (for the dlda part), written by Ramón. Some of these functions themselves call functions in the packages e1071 (by E. Dimitriadou, K. Hornik, F. Leisch, D. Meyer, and A. Weingessel), class (by W. Venables and B. Ripley), pamr (by T. Hastie, R. Tibshirani, Balasubramanian Narasimhan, G. Chu), supclust (by M. Dettling and M. Maechler), multtest (by Y. Ge and S. Dudoit) and randomForest (by A. Liaw, M. Wiener, with Fortran code by L. Breiman and A. Cutler). Our set of functions themselves will (soon) be converted into an R package and released under the GPL.

We want to thank all these authors for the great tools that they have made available for all to use. If you find this useful, and since R and Bioconductor are developed by a team of volunteers, we suggest you consider making a donation to the R foundation for statistical computing.

Terms of use

Privacy and Security

Uploaded data set are saved in temporary directories in the server and are accessible through the web until they are erased after some time. Anybody can access those directories, nevertheless the name of the directories are not trivial, thus it is not easy for a third person to access your data.

In any case, you should keep in mind that communications between the client (your computer) and the server are not encripted at all, thus it is also possible for somebody else to look at your data while you are uploading or dowloading them.


This software is experimental in nature and is supplied "AS IS", without obligation by the authors or the CNIO the to provide accompanying services or support. The entire risk as to the quality and performance of the software is with you. The authors expressly disclaim any and all warranties regarding the software, whether express or implied, including but not limited to warranties pertaining to merchantability or fitness for a particular purpose.


