Chapter 9 Basic Machine Learning
Note: I have rephrased some parts of the questions for clarity. These changes are bolded. Due to the random numbers, the exact values of the answers, despite the same seeds, might differ. So please be mindful of that.
First, upload necessary package(s).
library(dplyr) # uplaods the function filter() and %>%
library(rafalib) # important for plotting with base R
library(genefilter) # rowttests
#library(devtools) # allows download from github
library(GSE5859Subset) # subset of gene expression data
library(tissuesGeneExpression)
data(tissuesGeneExpression)
library(RColorBrewer)
library(caret) # for createFold
library(gplots) # for heatmap
library(class) # for knn
library(matrixStats)
9.2 Exercises
Question 1
Create a random matrix with no correlation in the following way:
set.seed(1)
= 10000
m = 24
n = matrix(rnorm(m*n),m,n)
x colnames(x)=1:n
Run hierarchical clustering on this data with the hclust function with default parameters to cluster the columns. Create a dendrogram. In the dendrogram, which pairs of samples are the furthest away from each other?
<- dist(t(x))
d <- hclust(d)
hc mypar()
plot(hc)
#7 and 23 - 141
# 19 and 14 - 143
# 1 and 16 - 142
# 17 and 18 - 142
The answer is B: 19 and 14. The answer might be different due to the random numbers.
Question 2
Set the seed at 1, set.seed(1) and replicate the creation of this matrix 100 times:
= 10000
m = 24
n = matrix(rnorm(m*n),m,n) x
then perform hierarchical clustering as in the solution to Question 1, and find the number of clusters if you use cuttree at height 143. This number is a random variable. Based on the Monte Carlo simulation, what is the standard error of this random variable?
set.seed(1)
<- replicate(100, {
res_list = 10000
m = 24
n = matrix(rnorm(m*n),m,n)
x <- dist(t(x))
d <- hclust(d)
hc <- cutree(hc, h=143)
hclusters <- length(unique(hclusters))
num_clus return(num_clus)
})popsd(res_list)
## [1] 0.8986657
Question 3
Run kmeans
with 4 centers for the blood RNA data:
library(GSE5859Subset)
data(GSE5859Subset)
Set the seed to 10, set.seed(10)
right before running kmeans
with 5 centers. Explore the relationship of clusters and information in sampleInfo. Which of the following best describes what you find?
<- kmeans(t(geneExpression), centers = 5)
km $cluster km
## GSM136508.CEL.gz GSM136530.CEL.gz GSM136517.CEL.gz GSM136576.CEL.gz
## 2 1 1 5
## GSM136566.CEL.gz GSM136574.CEL.gz GSM136575.CEL.gz GSM136569.CEL.gz
## 3 5 5 4
## GSM136568.CEL.gz GSM136559.CEL.gz GSM136565.CEL.gz GSM136573.CEL.gz
## 5 3 3 4
## GSM136523.CEL.gz GSM136509.CEL.gz GSM136727.CEL.gz GSM136510.CEL.gz
## 1 2 2 2
## GSM136515.CEL.gz GSM136522.CEL.gz GSM136507.CEL.gz GSM136524.CEL.gz
## 2 1 2 1
## GSM136514.CEL.gz GSM136563.CEL.gz GSM136564.CEL.gz GSM136572.CEL.gz
## 1 3 3 4
table(true = sampleInfo$group, cluster = km$cluster)
## cluster
## true 1 2 3 4 5
## 0 4 5 2 1 0
## 1 2 1 3 2 4
table(true = sampleInfo$date, cluster = km$cluster)
## cluster
## true 1 2 3 4 5
## 2005-06-10 0 1 0 0 0
## 2005-06-23 1 5 0 0 0
## 2005-06-27 5 0 0 0 0
## 2005-10-07 0 0 5 3 2
## 2005-10-28 0 0 0 0 2
The answer is C: Date is driving the clusters.
Question 4
Load the data:
library(GSE5859Subset)
data(GSE5859Subset)
Pick the 25 genes with the highest across sample variance. This function might help:
library(matrixStats)
Use heatmap.2
to make a heatmap showing the sampleInfo$group
with color, the date as labels, the rows labelled with chromosome, and scaling the rows. What do we learn from this heatmap?
<- colorRampPalette(brewer.pal(9, "GnBu"))(100)
hmcol = format( sampleInfo$date, "%m")
month <- rowVars(geneExpression)
rv <- order(-rv)[1:25]
idx <- palette(brewer.pal(8, "Dark2"))[as.fumeric(as.character(sampleInfo$group))]
cols
heatmap.2(geneExpression[idx,],
trace = 'none', labRow = geneAnnotation[idx,]$CHR,
col = hmcol, labCol = month,
ColSideColors = cols)
The correct answer is C: A group of chrY genes are higher in group 0 and appear to drive the clustering. Within those clusters there appears to be clustering by month.
I thought this question was tricky because it was not straightforward for me to draw the heatmap with many inputs. First, the orange and green colors represent the sample group (0 or 1). This was established in the code cols <- palette(brewer.pal(8, "Dark2"))[as.fumeric(as.character(sampleInfo$group))]
which then acted as input for ColSideColors
in heatmap.2
function. labRow
refers to the labelling of row, which in this case is chromosome geneAnnotation[idx,]$CHR
. labCol
has nothing to do with color; it has to do with column labelling, which in this case is month
. col
represents color for the values of the data geneExpression[idx,]
.
Answer A is wrong because if the data were generated by norm
, the color distribution of the heatmap would be entirely random. Answer B is wrong because the colors in the row chr1
are more or less the same except for one column (one sample). Answer D is wrong chrY
genes are higher in June, not October.
Question 5
Create a large dataset of random data that is completely independent of sampleInfo$group
like this:
set.seed(17)
= nrow(geneExpression)
m = ncol(geneExpression)
n = matrix(rnorm(m*n),m,n)
x = factor(sampleInfo$g) g
Create two heatmaps with these data. Show the group g either with labels or colors. First, take the 50 genes with smallest p-values obtained with rowttests
. Then, take the 50 genes with largest standard deviations. Which of the following statements is true?
# p-value
<- rowttests(x, g)$p.value
pvals <- order(pvals)[1:50]
idx <- palette(brewer.pal(8, "Dark2"))[as.fumeric(as.character(sampleInfo$g))]
cols heatmap.2(x[idx,],
trace = 'none', labRow = geneAnnotation[idx,]$CHR,
col = hmcol, labCol = month,
ColSideColors = cols)
# std dev
<- genefilter::rowSds(x,g)
sds <- order(-sds)[1:50]
idx <- palette(brewer.pal(8, "Dark2"))[as.fumeric(as.character(sampleInfo$g))]
cols heatmap.2(x[idx,],
trace = 'none', labRow = geneAnnotation[idx,]$CHR,
col = hmcol, labCol = month,
ColSideColors = cols)
The answer is A: There is no relationship between g and x, but with 8,793 tests some will appear significant by chance. Selecting genes with the t-test gives us a deceiving result.
Recall that we have already selected smallest p-values from a dataset in which the null hypothesis is true. Therefore, we can see clusters that indicate that there is a significant difference between sample groups. However, this significance is not real because we know that the null hypothesis is true.
9.4 Exercises
Question 1
Generate some random data to imitate heights for men (0) and women (1):
= 10000
n set.seed(1)
= rnorm(n,176,7) #height in centimeters
men = rnorm(n,162,7) #height in centimeters
women = c(rep(0,n),rep(1,n))
y = round(c(men,women))
x ##mix it up
= sample(seq(along=y))
ind = y[ind]
y = x[ind] x
Using the data generated above, what is the \(E(Y|X=176)\) (proportion of females)?
mean(y[x==176])
## [1] 0.1049475
Question 2
Now make a plot of \(E(Y|X=x)\) for x=seq(160,178)
using the data generated in Question 1.
If you are predicting female or male based on height and want your probability of success to be larger than 0.5, what is the largest height where you predict female?
mypar()
plot(x,y)
<- seq(160,178)
x_list <- vector('double', length(x_list))
res for (i in seq_along(x_list)) {
<- mean(y[x==x_list[[i]]])
res[[i]]
}<- max(which(res > 0.5))
ind # answer x_list[ind]
## [1] 168
mean(y[x==168])
## [1] 0.5787172
9.8 Exercises
Question 1
Generate the following data:
= 10000
n set.seed(1)
= rnorm(n,176,7) #height in centimeters
men = rnorm(n,162,7) #height in centimeters
women = c(rep(0,n),rep(1,n))
y = round(c(men,women))
x ##mix it up
= sample(seq(along=y))
ind = y[ind]
y = x[ind] x
Set the seed at 5, set.seed(5)
and take a random sample of 250 from:
set.seed(5)
= 250
N = sample(length(y),N)
ind = y[ind]
Y = x[ind] X
Use loess to estimate \(f(x) = E(Y|X=x)\) using the default parameters. What is the predicted \(f(168)\)?
<- loess(Y~X)
fit <- seq(min(X),max(X),len=45)
newx <- predict(fit, newdata=data.frame(X=newx))
hat mypar()
plot(X,Y)
names(hat) <- round(newx,1)
lines(newx,hat)
'168'] hat[
## 168
## 0.5480233
Question 2
The loess estimate above is a random variable. We can compute standard errors for it. Here we use Monte Carlo to demonstrate that it is a random variable. Use Monte Carlo simulation to estimate the standard error of your estimate of \(f(168)\).
Set the seed to 5, set.seed(5)
and perform 10,000 simulations and report the SE of the loess based estimate.
set.seed(5)
<- 10000
B <- 250
N <- seq(min(X),max(X),len=45)
newx <- replicate(B, {
res = sample(length(y),N)
ind = y[ind]
Y = x[ind]
X <- loess(Y~X)
fit <- predict(fit, newdata=data.frame(X=newx))
hat names(hat) <- round(newx,1)
return(hat['168'])
})names(res) <- NULL
popsd(res)
## [1] 0.05618195
9.11 Exercises
Load the following dataset:
library(GSE5859Subset)
data(GSE5859Subset)
And define the outcome and predictors. To make the problem more dificult, we will only consider autosomal genes:
= factor(sampleInfo$group)
y = t(geneExpression)
X = which(geneAnnotation$CHR%in%c("chrX","chrY"))
out = X[,-out] X
Question 1
Use the createFold
function in the caret
package, set the seed to 1 set.seed(1)
and create 10 folds of y
. What are the 2nd entry in the fold 3?
library(caret)
set.seed(1)
<- createFolds(y, k = 10)
idx 3]][2] idx[[
## [1] 15
Question 2
We are going to use kNN. We are going to consider a smaller set of predictors by using filtering gene using t-tests. Specifically, we will perform a t-test and select the \(m\) genes with the smallest p-values.
Let m = 8 and k = 5 and train kNN by leaving out the second fold idx[[2]]
. How many mistakes do we make on the test set? Remember it is indispensable that you perform
the t-test on the training data. Use all 10 folds, keep k = 5. Hint: be careful about indexing.
<- 8 # number of genes
m <- rowttests(t(X[-idx[[2]],]),y[-idx[[2]]])$p.value
pvals <- order(pvals)[1:m]
ind
<- knn(train = X[-idx[[2]],ind],
pred test = X[idx[[2]],ind],
cl = y[-idx[[2]]], k = 5)
sum(pred != y[idx[[2]]])
## [1] 1
When performing rowttests
it is important to exclude test data. Indexing becomes quite tricky because you need to separate training and test data.
Question 3
Now run through all 5 folds. What is our error rate (total number of errors / total predictions)?
<- length(idx)
n_fold <- vector('double', n_fold)
res <- 8
m for (i in seq(n_fold)) {
<- rowttests(t(X[-idx[[i]],]),y[-idx[[i]]])$p.value
pvals <- order(pvals)[1:m]
ind <- knn(train = X[-idx[[i]],ind],
pred test = X[idx[[i]],ind],
cl = y[-idx[[i]]], k = 5)
<- sum(pred != y[idx[[i]]])
res[[i]]
}sum(res)/length(y)
## [1] 0.375
Question 4
Now we are going to select the best values of k and m. Use the expand.grid
function to try out the following values:
=2^c(1:11)
ms=seq(1,9,2)
ks= expand.grid(k=ks,m=ms) params
Now use sapply
or a for
-loop to obtain error rates for each of these pairs of parameters. Which pair of parameters minimizes the error rate? Like previously, use the same p-values that you selected from the training data alone, and loop over 10 folds.
= 2^c(1:11)
ms = seq(1,9,2)
ks = expand.grid(k=ks, m=ms)
params <- length(idx)
n_fold = vector('double',nrow(params))
error_rate_avg for (j in seq(nrow(params))) {
for (i in seq(n_fold)) {
<- rowttests(t(X[-idx[[i]],]),y[-idx[[i]]])$p.value
pvals <- order(pvals)[1:params[j,][[2]]]
ind <- knn(train = X[-idx[[i]],ind],
pred test = X[idx[[i]],ind],
cl = y[-idx[[i]]], k = params[j,][[1]])
<- sum(pred != y[idx[[i]]])
res[[i]]
}<- sum(res)/length(y)
error_rate_avg[[j]]
}<- which(error_rate_avg == min(error_rate_avg))
ind # answer params[ind,]
## k m
## 47 3 1024
min(error_rate_avg) # minimum error rate
## [1] 0.2083333
Question 5
Repeat Question 4, but now perform the t-test filtering before the cross validation. Note how this biases the entire result and gives us much lower estimated error rates. What is the minimum error rate?
= 2^c(1:11)
ms = seq(1,9,2)
ks = expand.grid(k=ks, m=ms)
params <- length(idx)
n_fold = vector('double',nrow(params))
error_rate_avg for (j in seq(nrow(params))) {
for (i in seq(n_fold)) {
<- rowttests(t(X),y)$p.value
pvals <- order(pvals)[1:params[j,][[2]]]
ind <- knn(train = X[-idx[[i]],ind],
pred test = X[idx[[i]],ind],
cl = y[-idx[[i]]], k = params[j,][[1]])
<- sum(pred != y[idx[[i]]])
res[[i]]
}<- sum(res)/length(y)
error_rate_avg[[j]]
}min(error_rate_avg) # minimum error rate
## [1] 0.08333333
mean(error_rate_avg) # mean error rate
## [1] 0.1833333
The error rate is much lower than the one in Question 4 because we did not filter out p values from the test data in this case. This is not a correct practice. The practice shown in Question 4 is correct.
Question 6
Repeat Question 3, but now, instead of sampleInfo$group
, use
= factor(as.numeric(format( sampleInfo$date, "%m")=="06")) y
What is the minimum error rate now?
= 2^c(1:11)
ms = seq(1,9,2)
ks = expand.grid(k=ks, m=ms)
params <- length(idx)
n_fold = vector('double',nrow(params))
error_rate_avg for (j in seq(nrow(params))) {
for (i in seq(n_fold)) {
<- rowttests(t(X[-idx[[i]],]),y[-idx[[i]]])$p.value
pvals <- order(pvals)[1:params[j,][[2]]]
ind <- knn(train = X[-idx[[i]],ind],
pred test = X[idx[[i]],ind],
cl = y[-idx[[i]]], k = params[j,][[1]])
<- sum(pred != y[idx[[i]]])
res[[i]]
}<- sum(res)/length(y)
error_rate_avg[[j]]
}min(error_rate_avg) # minimum error rate
## [1] 0
mean(error_rate_avg) # mean error rate
## [1] 0.06060606