This work is licensed under a Creative Commons License.

Tnasas help

Introduction and purpose
Methods included
Finding the important genes: "selection bias"
- Further potential biases: finding the best subset among subsets
What the program does
Is the program fast?
Usage
- Input
- Output
Examples
Authors and Acknowledgements
Terms of use
Privacy and Security
Disclaimer
References

Introduction and purpose

This is a web interface to help in the process of building a "good predictor." We have implemented a few strategies that seem to be relatively popular. However, as the name of the application suggests, "This is Not A Substitute for A Statistician".

We hope that, by making available a tool that builds simple, yet ---if we believe the literature; see below-- powerful predictors, AND cross-validating the whole process, we can make people aware of the widespread problem of "selection bias". At the same time, we hope to make it easy for people to see that some data sets are "easy", and some are "hard": whatever the method you use, with some data sets you always can do a great job, and with others the situation seems hopeless. Finally, Tnasas can be used as a benchmark against some (overly) optimistic claims that occasionally are attached to new methods and algorithms. This is a particularly important feature, since many new predictor methods are being proposed in the literature often without careful comparisons with alternative methods; Tnasas can be used as a simple, effective way of comparing the peformance of the newly proposed methods and can, itself, become a benchmarking tool. As a side issue, we hope it will be easy to see that with many data sets it is often very hard to do a good predictive job with only a few genes (this is often a mirage from selection bias).

In addition, we want to make people aware of the problems related to the instability of solutions: we often find many solutions that are equally good from the point of view of prediction error rate, but that share very few, if any, genes.

Tnasas allows you to combine several classification algorithms with different methods for gene selection.

Methods included

We have included several methods, support vector machines (SVM), k-nearest neighbor (NN), diagonal linear discriminant analysis (DLDA), random forest, and shrunken centroids that have been shown to perform very well with microarray data (Dudoit et al., 2002; Romualdi et al., 2003; Díaz-Uriarte and Alvarez de Andrés, 2005).

Full explanations of the methods used can be used in the references. What follows is a quick summary:

Diagonal Linear Discriminant Analysis (DLDA) DLDA is the maximum likelihood discriminant rule, for multivariate normal class densities, when the class densities have the same diagonal variance-covariance matrix (i.e., variables are uncorrelated, and for each variable, its variance is the same in all classes). This yields a simple linear rule, where a sample is assigned to the class k which minimizes $\Sigma_{j = 1}^p (x_j - \bar{x}_{kj})^2/\hat{\sigma}^2_j$ , where p is the number of variables, is the value on variable (gene) j of the test sample, $\bar{x}_{kj}$ is the sample mean of class k and variable (gene) j, and $\hat{\sigma}^2_j$ is the (pooled) estimate of the variance of gene j. In spite of its simplicity and its somewhat unrealistic assumptions (independent multivariate normal class densities), this method has been found to work very well. In contrast to the more common Fisher's LDA technique, DLDA works even when the number of cases is smaller than the number of variables. Details and explanations of DLDA can be found in Dudoit et al. 2002.
Nearest neighbor (KNN) KNN is a non-parametric classification method that predicts the sample of a test case as the majority vote among the k nearest neighbors of the test case ( Ripley 1996; Hastie et al., 2001). To decide on ``nearest'' here we use, as in Dudoit et al. (2002) the Euclidean distance. The number of neighbors used (k) is often chosen by cross-validation (for a given training set, the performance of the KNN for a set of values of k is determined by cross-validation, and the k that produces the smallest error is used). However, since finding the optimal k by cross-validation is time consuming, we have set k = 1, as this is often a successful rule (Ripley 1996; Hastie et al., 2001).
Support Vector Machines (SVM)SVM are becoming increasingly popular classifiers in many areas, including microarrays (Furey et al., 2000; Lee & Lee, 2003; Ramaswamy et al., 2001). SVM (with linear kernel, as used here) try to find an optimal separating hyperplane between the classes. When the classes are linearly separable, the hyperplane is located so that it has maximal margin (i.e., so that there is maximal distance between the hyperplane and the nearest point of any of the classes) which should lead to better performance on test data. When the data are not separable, there is no separating hyperplane; in this case, we still try to maximize the margin but allow some classification errors subject to the constraint that the total error (distance from the hyperplane in the ``wrong side'') is less than a constant. For problems involving more than two classes there are several possible approaches; the one used here is the ``one-against-one'' approach, as implemented in libsvm ( Chang & Lin, 2003). Reviews and introductions to SVM can be found in Burgues, 1998 and Hastie et al., 2001.
Random forests Random forests are an ensemble of classification trees, that can be used directly with a number of variables much larger than the number of samples; random forests show very good predictive performance, comparable to SVMs and are quite resistant to overfitting (Breiman, 2001, 2003; see also Liaw and Wiener, 2002). Random forests return a prediction as the unweighted majority of predictions from a very large (e.g., 5000) collection of classification trees; the important features about these trees are that each is grown using a bootstrap sample of the data set, and that at each node only a random subset of the original variables is examined. As part of a random forest run, we also obtain measures of the importance of variables, and classification trees implicitly include possible interactions (non-additive effects) of predictor variables. We have recently exhaustively examined the performance of random forest with microarray data (Díaz-Uriarte and Alvarez de Andrés, 2005).
Shrunken centroidsThe method of ``nerest shrunken centroids'' was originally described in Tibshiranie et al., 2002. It uses ``de-noised'' versions of centroids to classify a new observations to the nearest centroid. The ``de-noising'' is achieved using soft-thresholding or penalization, so that for each gene, class centroids are shrunken towards the overall centroid. This method is very similar to a DLDA with shrinkage on the centroids. The optimal amount of shrinkage can be found with cross-validation, and used to select the number of genes to retain in the final classifier. To determine the best number of features we choose the number of genes that minimizes the cross-validated error rate and, in case of several solutions with minimal error rates, we choose the one with smallest number of genes (larger penalty).

Variable selection: finding the "important genes"

What genes should we use when building the predictor? Often, researchers would like to build the predictor using only genes that are "relevant" for the prediction. In addition, using only "relevant" genes can lead to "better" predictions ("better" in the sense of, e.g., smaller variance). How should we select those genes?

A popular approach is to first select a set of genes, with some relatively simple, fast, method, and then feed these selected genes to our predictor. This is called, in the machine learning literature, the "filter approach" (e.g., Kohavi & John, 1998) because we first "filter" the predictor variables (in our case genes), keep only a subest, and then build the predictor. We provide three ways of ranking genes, and each can be used with any of the above class-prediction algorithms (except PAM, which includes gene selection as part of the algorithm itself).

F-ratio, or between to within classes sums of squares, the popular ANOVA F-ratio, used also in Dudoit et al., 2002.
Wilcoxon statistic a non-parametric test for differences between two classes. When there are more than two classes we use a "one-vs-all" approach.
Random forests We use random forest to obtain variable importances, and rank genes based on those variable importances.

After ranking the genes, we examine the performance of the class prediction algorithm using different numbers of the best ranked genes and select the best performing predictor. In the current version of Tnasas we build the predictor using the best g =2, 5, 10, 20, 35, 50, 75, 120, 200, 500, 1000, 2000 and the total number of genes.

Finding the important genes: "selection bias"

Suppose we select the first 50 genes with the F-ratio, as explained above; then we use, e.g., DLDA, and we report the error rate of our classifier using 10-fold cross-validation. This sounds familiar, … but is actually a bad idea: the error rate we just estimated can be severely biased down.

This is the problem of selection bias, which has been discussed in the microarray literature by Ambroise & McLachlan (2002) and Simon et al. (2003) (this problem has been well known in statistics for a long time). Essentially, the problem is that we use all the subjects to do the filtering, and therefore the cross-validation of only the DLDA, with an already selected set of genes, cannot account for the effect of pre-selecting the genes. As just said, this can lead to severe underestimates of prediction error (and the references given provide several alarming examples). In addition it is very easily to obtain "excellent" predictors with completely random data, if we do not account for the preselection (we present some numerical examples with KNN and SVM in our R course.

A way to obtain better estimates of the prediction error is to use cross-validation over the whole process of "select-genes-then-build-predictor": leave aside a set of samples, and with the remaining samples do the gene selection and building of the predictor, and then predict the left out samples. This way, the left-out samples do not participate in the selection of the genes, and the "selection bias" problem is avoided.

In the procedures we have implemented, the problem of selection bias is taken into account: all the cross-validated error rates include the process of gene selection.

Further potential biases: finding the best subset among subsets

OK, so we have taken care of selection bias. But what if we repeat the process of selecting a number of genes for different numbers of genes, say 10, 50, 200, and 500, and then keep the one that leads to the smallest cross-validated error rate? We have a similar problem to selection bias; here the bias comes from selecting, a posteriori, the "best" number of predictors based on the performance of each of a number of predictors with our data set.

To give an example, suppose we use DLDA, and we use either 10, 50, or 100 genes. And suppose the cross-validated error rates (cross-validated including variable selection) are 15%, 12%, and 20%. Now, we select the DLDA with 50 genes. But we cannot say that the expected error rate, when applied to a new sample, will be 12%; it will probably be higher. We cannot know how much of our low error rates is due to our "capitalizing on chance".

Thus, we need to add another layer of cross-validation (or bootstrap if we prefer). We need to evaluate the error rate of a prediction rule that is built by selecting among a set of rules the one with the smallest error. Because that is what we are doing: we are building, say, 3 prediction rules (one with 10 genes, one with 50, one with 100), and then choosing the rule with the smallest error rate.

If we need to add this extra lyaer of cross-validation, why do we use, to select the number of of genes, the cross-validated error rate accounting for "selection bias"? This certainly takes a lot more time. The reason is that, a priori, selecting the number of genes this way should lead to better predictors, since we base our choice on an estimate of prediction error rate that accounts for selection bias.

The methods we have implemented return, as part of their final output, the cross-validated error rate of the complete process; in other words, we cross-validate the process of "building several predictors and then choosing the one with the smallest error rate".

So is all of this statisticians paranoia? No it is not. As we have just said, the references provided give some very sobering examples. And it is extremely easy to obtain "excellent predictors" from completely random data if one does not account for selection biases. More generally, the issue of variable selection is a very delicate one. It is well know in the statistical literature that most of the "usual" variable selection procedures, such as the (in)famous stepwise, backwards, and forward methods in, for example, linear regression, are severely biased (and, for example, lead to p-values that are not meaningful). This is issue is ellaborated upon beautifully by Frank Harrell Jr. in his book Regression modeling strategies; some of the main arguments can be found at this site. This is very relevant here, because some people think they can escape this problem by doing something like: "first, select 50 genes; then go to my favourite stats program ZZZZ, and do a logistic regression, selecting variables with bacwkards ---or forward or whatever--- variable selection". This approach solves nothing, and it compounds the problem: we have the problem of selection bias because of the preselection of genes and we have addedd the nasty problem of using stepwise variable selection. A very didactic excersise in these situations is to bootstrap the whole process (Regression modeling strategies ellaborates on these issues); it is often amazing how many widely dissimilar models are obtained. Repeating the exercise a couple of times will turn most people into skeptics of the virtues of variable selection methods. These problems are particularly sever and serious with microarray data sets, were we often have relatively few subjects (e.g., less than 100) but several thousands of genes; in these settings, it is extremely easy to obtain "excellent predictors" from purely random data; we can capitalize on chance to build a fantastic logistic regression model using stepwise regression methods; a logistic regression model that will fail miserably when we apply it to new samples …

What the program does

Essentially, this is what the program does.

To find the best number of predictors

Draw a 10-fold cross-validation sample, and with each of the 10 "training sets":
1. Rank the genes using one of the ranking methods above.
2. For each g number of genes (where g is 2, 5, 10, 20, 35, 50, 75, 120, 200, 500, 1000, 2000 and the total number of genes).
  1. Build the predictor using those g genes.
  2. Predict the left-out sample.
Compute the cross-validation error corresponding to each g number of genes.
Select as the "best number of genes" gb the one that results in the smallest cross-validation error. If there are several equally good, choose the one corresponding to the smaller number of genes.
Now run the gene ranking method on the complete sample, and select the top gb genes.

If using shrunken centroids a somewhat similar procedure is used implicitly by the method.

To evaluate the error rate of this procedure

Do a 10-fold cross-validation of the procedure above. So:

Divide the whole data set in 10 approximately equal subsets.
For each of the 10 subsets do:
1. Leave aside this subset; these data left aside are "out-of-bag".
2. With the other 9 subsets, (the "in-bag") use the procedure above to find the best number of genes, and train a predictor with those genes (but, again, only using the "in-bag" samples).
3. Predict the out-of-bag samples with the predictor just found.
At the end of the process, each sample has been in the "out-of-bag" set exactly once. Thus, for each sample we have a prediction where that sample has been out-of-bag. Using the out-of-bag predictions, and comparing them with the true class labels we obtain the "error rate".

Cross-validation: how many folds?

The number of folds we use is 10 if possible. If there are not enough samples, we use as many folds as possible given the data (so that there is at least one testing sample in the testing data set, and so that there is at least one training sample from each class in every training sample). The cross-validation samples are selected so that the relative proportions of each class are the same (or as close to the same as possible) in all training and test samples.

Is the program fast?

Some procedures are faster than others. NN and DLDA with randking based on the F-ratio are relatively fast (< 3 minutes). SVM and random forest are slower than NN and DLDA. The Wilcoxon and random forest ranking are slower than the F-ratio. Shrunken centroids is relatively fast.

Usage

Input

Covariates file

The file with the covariates; generally the gene expression data. In this file, rows represent variables (generally genes), and columns represent subjects, or samples, or arrays.

The file for the covariates should be formated as:

Data should conform to the "genes in rows, patients (or arrays) in columns". In other words, each row of the data file is supposed to represent a different gene or variable.
Use tab (\t) as the field separator within rows.
Use newline or carriage return (\n) between rows. It is also convenient to finish each file with one carriage return (\n).
Array names: if you want to name your arrays (useful for the output of the analyses) do as follows:
1. Place a line that starts with "#";
2. After the "#" put "Name" or "NAME" or "name" (don't say we don't give you choices);
3. Write the array names (separated by tabs).
The first column is assumed to contain the ID information for genes, marker, or whatever. This will be used to label the output (but it also means that whatever is in the first column is not used in the analyses).
You can have an arbitrary number of rows with comments. These rows must always start with an "#".
Gene names and array names MUST be unique. If they are not, the program will let you know. If you do not want to provide array names, that is OK, and we will name them with sequential integers starting at 1. If you do not want to provide gene names, then put some arbitrary labels on the first column (e.g., fill it with a sequence of integers). Why do we need gene and array names to be unique? Because in many steps, we need to provide either where we classify a given array (and what should we do if two or more arrays are named "A"?), or the genes used in the classifier (and what should we do if two or more genes are named "gene B"?).
Missing values are NOT allowed. You can use the preprocessor and do several things with your data before sending it to Tnasas. We would probably recommend you do imputation after eliminating genes with too many (more than, say, 20%?) missing. Anyway, how best to deal with missing values is not a trivial issue and is outside the scope of this help file.

This is a small covariate data file:

 
#Name     ge1      ge2      ge1      ge1      ge2 
gene1   23.4    45.6    44      76      85.6 
genW@   3      34      23      56      13 
geneX#  23      25.6    29.4    13.2    1.98

Class file

These are the class labels (e.g., healthy or affected, or different types of cancer) that group the samples. Our predictor will try to predict these class labels.

Please note that we do not allow any data set with 3 or fewer cases in any class. Why? Because, on the one hand, any results from such data would be hard to believe; on the other hand, that would result in some cross-validation samples having some training samples with 0 elements from one of the classes.

Separate values by tab (\t), and finish the file with a carriage return or newline. No missing values are allowed here. Class labels can be anything you wish; they can be integers, they can be words, whatever.

This is a simple example of class labels file

  
CL1     CL2     CL1     CL4     CL2     CL2     CL1     CL4

Type of gene identifier and species

If you use any of the currently standard identifiers for your gene IDs for either human, mouse, or rat genomes, you can obtain additional information by clicking on the gene names in the output. This information is returned from IDClight based on that provided by our IDConverter tool.

Output

Two forms of output are provided:

A plot.
This plot shows the cross-validated error rate of the predictor when built using different numbers of genes. This already accounts for selection bias. The final model selected will be the one with the smallest cross-validated error rate.

With gray lines we show the same type of error vs. number of genes as obtained with each of the cross-validation samples.

For comparison, the plots shows the error rate we can achieve by always betting on the most frequent class (a dotted blue line) and the estimate of the error rate from the 10-fold cross-validation (a dotted red line) that is returned also in the Results).
The results text. Here you find:
1. OPTIONS: information about the options and defaults used by the run.
2. RESULTS:
  1. Error Rate: the 10-fold cross-validated error rate of the predictor built by choosing the predictor with the set of genes that leads to smallest error (as seen in the figure). Thus, this estimate of the error rate takes care of the two potential types of severe underestimates of error rates (see Selection bias and Further biases). (Technically, this estimate of the error rate is not unbiased, but the amount of bias is negligible compared to the severe biased introduced if not accounting for selection of genes and number of genes).
  2. Confussion matrix: a 2-way table of true class ("Observed"), in the horizontal, and predicted, in the vertical. This confussion matrix is based on "out-of-bag" predictions. In other words, for each case, its prediction is based on a model (number of genes + predictor) where this case did not participate; so it did not participate in the selection of genes, it did not participate in the selection of the nubmer of genes, and it did not participate in the fitting of the model. We also provide the total error (i.e., the number of missclassifications relative to the total number of cases) broken down by original class, and the relative error per class (number of misclassifications relative to the size of the class). Precisely because these confussion matrix is based on "out-of-bag" predictions, the sum of the entries in the "Total Error" column is identical to the "error rate".
    Lets show an example:
```
  
Confussion Matrix:
           N  T TotalError RelativeErrorPerClass
Observed N 4  3       0.06            0.42857143
Observed T 1 42       0.02            0.02325581
```
    The Total Error is obtained as the number of misclassifications relative to the total number of predictions. The total number of missclassifications is 3 + 1 = 4, and the total number of predictions is 50. As said before, all these are "out-of-bag" predictions. So the "Error rate" here is 4/50 = 0.08.
    
    The "TotalError" column entries are 3/50 = 0.06 and 1/50 = 0.02, when the true, observed, classes are "N" and "T" respectively. Sure enough, the "Error rate" = 0.08 = 0.06 + 0.02.
    
    The "RelativeErrorPerClass" give you a view of how well it is doing relative to the size of the class. It is 3/7 = 0.429 when the true class is "N" and 1/43 = 0.023 when the true class is "T". In other words, the overall error rate of the predictor is not bad (0.08), but the error rate for those of class "N" is not good (almost 50%); how can these two facts be reconciled? Notice that the size of class "N" is very small compared to that of "T" (7 vs. 43). And, as you can see, the predictor in this case can do a good job by emphasizing good predictions for the large class ("T").
    
    The above comments are something to pay attention to with very unbalanced classes. An extreme example: if we have a situation where 99% of the cases are class "A" and 1% of the cases are of class "B", we can achieve a very low prediction error (1%) if we always assign every case to class "A".
  3. Number of predictors that yields minimum error. Well, the number of predictors that yields the minimum median 10-fold cross-validated error rate accounting for selection bias. This is something you get directly from the figure.
  4. Selected predictor genes: if the previous output says its 50 genes, these are the top 50 genes, according to the ranking criterion you choose.
  5. OOB predictions: the OOB predictions. We also provide the true class and the labels. You can use these data to reproduce the confussion matrix above.
  6. Cases with incorrect predictions (errors). If there are any incorrect predictions, we show those, indicating the true class and the predicted class.
  7. Stability assessments.
    1. Genes selected in each of the cross-validation runs.
    2. Number of shared genes. The number of genes selected in common between any two runs (e.g., between the run with the complete sample and the 2nd cross-validated run, or between the 4th and 7th cv run). Of course, this is a symmetric matrix.
    3. Proportion of shared genes (relative to row total). This is the above table divided by the total number of gene selected by the procedure of the given row. So suppose the third row, the 2nd cv run, selected 20 genes. Then the third row of this table is the same as the third row of the table above divided by 20. Thus, this is not a symmetric matrix.
    4. Gene freqs. in cross-validated runs of genes selected in model with all data. Take each of the genes selected from the complete data run, and count how many times it shows up in the selections from the cross-validation runs.
    5. Gene frequencies in cross-validated runs. Count how many times any gene (that shows up at least once in the cross-validation runs) appeared among the selected ones in the cross-validation runs.

Sending results to PaLS (New!!)

It is now possible to send the results to PaLS. PaLS "analyzes sets of lists of genes or single lists of genes. It filters those genes/clones/proteins that are referenced by a given percentage of PubMed references, Gene Ontology terms, KEGG pathways or Reactome pathways." (from PalS's help). By sending your results to PaLS, it might be easier to make biological sense of your results, because you are "annotating" your results with additional biological information.

Scroll to the bottom of the main outpu, where you will find the PaLS icon and the gene lists. When you click on any of the links, the corresponding list of genes will be sent to PaLS. There, you can configure the options as you want (please, consult PalS's help for details) and then submit the list. In PaLS, you can always go back, and keep playing with the very same gene list, modifying the options.

For individual genes, recall that the names in the tables are clickable, and display additional information from IDClight.

Examples

Examples of several runs, one with fully commented results, are available here.

Authors and Acknowledgements

This program was developped by Juan M. Vaquerizas and Ramón Díaz-Uriarte, from the Bioinformatics Unit at CNIO. This tool is, essentially, a web interface to a set of R functions, plus a small piece of C++ code (for the dlda part), written by Ramón. Some of these functions themselves call functions in the packages e1071 (by E. Dimitriadou, K. Hornik, F. Leisch, D. Meyer, and A. Weingessel), class (by W. Venables and B. Ripley), pamr (by T. Hastie, R. Tibshirani, Balasubramanian Narasimhan, G. Chu), supclust (by M. Dettling and M. Maechler), multtest (by Y. Ge and S. Dudoit) and randomForest (by A. Liaw, M. Wiener, with Fortran code by L. Breiman and A. Cutler). Our set of functions themselves will (soon) be converted into an R package and released under the GPL.

We want to thank all these authors for the great tools that they have made available for all to use. If you find this useful, and since R and Bioconductor are developed by a team of volunteers, we suggest you consider making a donation to the R foundation for statistical computing.

Terms of use

You acknowledge that the Tnasas Software is experimental in nature and is supplied "AS IS", without obligation by the authors, the CNIO's Bioinformatics Unit or the CNIO to provide accompanying services or support. The entire risk as to the quality and performance of the Software is with you. The CNIO and the authors expressly disclaim any and all warranties regarding the software, whether express or implied, including but not limited to warranties pertaining to merchantability or fitness for a particular purpose.
If you use Tnasas for any publication, we would appreciate if you could let us know and if you cite our program (you know, "credit where credit is due"). For now, you can give the main web site: http://tnasas.iib.uam.es) in that publication.
We appreciate if you give us feedback concerning bugs, errors or misconfigurations. Complaints or suggestions are welcome.

Privacy and Security

Uploaded data set are saved in temporary directories in the server and are accessible through the web until they are erased after some time. Anybody can access those directories, nevertheless the name of the directories are not trivial, thus it is not easy for a third person to access your data.

In any case, you should keep in mind that communications between the client (your computer) and the server are not encripted at all, thus it is also possible for somebody else to look at your data while you are uploading or dowloading them.

Disclaimer

This software is experimental in nature and is supplied "AS IS", without obligation by the authors or the CNIO the to provide accompanying services or support. The entire risk as to the quality and performance of the software is with you. The authors expressly disclaim any and all warranties regarding the software, whether express or implied, including but not limited to warranties pertaining to merchantability or fitness for a particular purpose.

References

Ambroise C, McLachlan GJ (2002) Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci USA 99: 6562--6566.

Breiman L (2001) Random forests. Machine Learning 45: 5--32 (Tech. report).

Breiman L (2003) Manual--Setting Up, Using, And Understanding Random Forests V4.0.

Díaz-Uriarte R, Alvarez de Andrés, S (2005) Gene selection and classification of microarray data using random forest. In review. (tech. report.)

Dudoit S, Fridlyand J, Speed TP (2002) Comparison of discrimination methods for the classification of tumors suing gene expression data. J Am Stat Assoc 97: 77--87. (tech. report.)

Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, et~al. (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16: 906--914.

Harrell JFE (2001) Regression modeling strategies. New York: Springer.

Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. New York: Springer.

Kohavi R, John GH (1998) The Wrapper Approach, in Feature Selection for Knowledge Discovery and Data Mining, H. Liu & H. Motoda (Ed.), Kluwer, 33-50 (reprint).

Lee Y, Lee CK (2003) Classification of multiple cancer types by multicategory support vector machines using gene expression data. Bioinformatics 19: 1132--1139.

Liaw A, Wiener M (2002) Classification and regression by randomForest. R News, 2: 18--22.

Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang C, et~al. (2001) Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci USA 98: 15149--15154.

Ripley BD (1996) Pattern recognition and neural networks. Cambridge: Cambridge University Press.

Romualdi C, Campanaro S, Campagna D, Celegato B, Cannata N, et~al. (2003) Pattern recognition in gene expression profiling using DNA array: a comparative study of different statistical methods applied to cancer classification. Hum Mol Genet 12: 823--836..

Simon R, Radmacher MD, Dobbin K, McShane LM (2003) Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. Journal of the National Cancer Institute 95: 14--18.

Tibshirani, R., Hastie, T., Narasimhan, B. & Chu, G. (2002) Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci USA, 99: 6567--6572.

Copyright

Last modified: 2006-12-21