Gene ranking by bootstrapped P-values
SIGKDD Explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining
Recent research has shown that it is possible to find genes involved in the pathogenesis of a particular condition on the basis of microarray experiments. Genes which are differentially expressed, for example between healthy and diseased tissues, are likely to be relevant to the disease under study. Some of the properties of microarray datasets make the task of finding these genes a challenging one. This paper proposes a gene-ranking algorithm whose main novelty is the use of bootstrapped P-values. We present an analysis of the algorithm, showing how it takes account of small-sample variability in observed values of the test statistic, in a way conventional statistical tests cannot. Experimental results show that our algorithm outperforms the widely-used two-sample t-test on challenging artificial data. Gene ranking is then performed on two well-known microarray datasets, with encouraging results. For example, a number of genes from one of the datasets, whose differential expression was subsequently confirmed by a more reliable biochemical analysis, are found to be ranked higher by the bootstrapped algorithm than by the conventional t-test, suggesting that the proposed algorithm may be better able to exploit the limited data available to infer biologically useful information.
SNM gratefully acknowledges the support of the Biotechnology and Biological Sciences Research Council (BBSRC).
This is the final version of the article. Available from SIGKDD via the URL in this record.
Vol. 5 (2), pp. 14 - 20