t-stat for feature selection

If you don’t want to worry about speed (and with 155 columns you probably don’t care) you can use the t.test function and apply it to every column.

Simulate some data first

set.seed(1)
DF <- data.frame(y=rep(1:2, 50), x1=rnorm(100), x2=rnorm(100), x3=rnorm(100))
head(DF)

  y         x1          x2         x3
1 1 -0.6264538 -0.62036668  0.4094018
2 2  0.1836433  0.04211587  1.6888733
3 1 -0.8356286 -0.91092165  1.5865884
4 2  1.5952808  0.15802877 -0.3309078
5 1  0.3295078 -0.65458464 -2.2852355
6 2 -0.8204684  1.76728727  2.4976616

Then we can apply the t.test function to all but the first column using the formula argument.

group <- DF$y
lapply(DF[,-1], function(x) { t.test(x ~ group)$statistic })

which returns the test statistic for each column.

t.test computes a lot of extra information that you don’t need so you can speed this up substantially by doing the computations directly, but it really isn’t necessary here

Leave a Comment