Chapter 10 Batch Effects
Note: I have rephrased some parts of the questions for clarity. These changes are bolded. Due to the random numbers, the exact values of the answers, despite the same seeds, might differ. So please be mindful of that.
First, upload necessary package(s).
library(dplyr) # uplaods the function filter() and %>%
library(rafalib) # important for plotting with base R
library(genefilter) # rowttests
library(Biobase) ##available from Bioconductor
library(genefilter)
library(GSE5859) ##available from github
data(GSE5859)
library(GSE5859Subset) # subset of gene expression data
library(tissuesGeneExpression)
data(tissuesGeneExpression)
library(qvalue)
library(RColorBrewer)
10.3 Exercises
Load the admissions data from the dagdata
package (which is available from the genomicsclass repository):
library(dagdata)
data(admissions)
Familiarize yourself with this table: admissions
Question 1
Let’s compute the proportion of men who were accepted:
= which(admissions$Gender==1)
index = sum(admissions$Number[index] * admissions$Percent[index]/100)
accepted= sum(admissions$Number[index])
applied /applied accepted
## [1] 0.4451951
What is the proportion of women that were accepted?
<- admissions$Gender == 1
index <- admissions[index,]
men <- admissions[!index,]
women <- sum(men$Percent/100 * men$Number)
menYes <- sum((1 - men$Percent/100) * men$Number)
menNo <- sum(women$Percent/100 * women$Number)
womenYes <- sum((1-women$Percent/100) * women$Number)
womenNo
<- matrix(c(menYes,womenYes,menNo,womenNo),2,2)
tab
print(womenYes/(womenYes+womenNo))
## [1] 0.3033351
Question 2
Now that we have observed different acceptance rates between genders, test for the significance of this result. If you perform an independence test, what is the p-value?
chisq.test(tab)$p.value
## [1] 9.139492e-22
Question 3
We can quantify how “hard” a major is by using the percent of students that were accepted. Compute the percent that were accepted (regardless of gender) to each major and call this vector H
. Which is the hardest major?
<- admissions %>% mutate(accepted = Number * Percent/100)
admissions1 <- cbind(admissions1[1:6,c(1,2,5)], admissions1[7:12,c(2,5)])
admissions2 colnames(admissions2) <- c('major','number_m','accepted_m',
'number_f', 'accepted_f')
<- admissions2 %>% group_by(major) %>%
admission3 transmute(total_number= number_m+number_f,
total_accepted= accepted_m + accepted_f,
proportion = total_accepted/total_number)
<- admission3 %>% dplyr::select(major,proportion)
H H
## # A tibble: 6 x 2
## # Groups: major [6]
## major proportion
## <fct> <dbl>
## 1 A 0.643
## 2 B 0.632
## 3 C 0.351
## 4 D 0.339
## 5 E 0.253
## 6 F 0.0648
Major F is the hardest major.
Question 4
What proportion is accepted for this major?
$major == 'F', ] H[H
## # A tibble: 1 x 2
## # Groups: major [1]
## major proportion
## <fct> <dbl>
## 1 F 0.0648
Question 5
For men, what is the correlation between the number of applications across majors and H
?
cor.test(men$Number,H$proportion)$estimate
## cor
## 0.7647567
Question 6
For women, what is the correlation between the number of applications across majors and H
?
cor.test(women$Number,H$proportion)$estimate
## cor
## -0.6743393
Question 7
Given the answers to the above, which best explains the differences in admission percentages when we combine majors?
The answer is C: There is confounding between gender and preference for “hard” majors: females are more likely to apply to harder majors.
10.6 Exercises
library(Biobase)
library(GSE5859)
data(GSE5859)
= exprs(e)
geneExpression = pData(e) sampleInfo
Familiarize yourself with the sampleInfo
table. Note that some samples were processed at different times. This is an extraneous variable and should not affect the values in geneExpression
. However, as we have seen in previous analyses, it does appear to have an effect so we will explore this here. You can extract the year from each date like this:
= format(sampleInfo$date,"%y") year
Note that ethnic group and year is almost perfectly confounded:
table(year,sampleInfo$ethnicity)
##
## year ASN CEU HAN
## 02 0 32 0
## 03 0 54 0
## 04 0 13 0
## 05 80 3 0
## 06 2 0 24
Question 1
For how many of these years do we have more than one ethnicity represented?
table(year,sampleInfo$ethnicity)
##
## year ASN CEU HAN
## 02 0 32 0
## 03 0 54 0
## 04 0 13 0
## 05 80 3 0
## 06 2 0 24
Two of the years have more than one ethnicity represented.
Question 2
Repeat the above exercise, but now, instead of year, consider the month as well. Specifically, instead of the year variable defined above use:
= format(sampleInfo$date,"%m%y") month.year
For what proportion of these month.year
values do we have more than one ethnicity represented?
= format(sampleInfo$date,"%m%y")
month.year <- table(month.year, sampleInfo$ethnicity)
tab <- vector('double', nrow(tab))
res for (i in seq_along(res)) {
<- length(unique(tab[i,]))
res[[i]]
}mean(res==3)
## [1] 0.04761905
Question 3
Perform a t-test (use rowttests
) comparing CEU samples processed in 2002 to those processed in 2003. Then use the qvalue
package to obtain q-values for each gene. How many genes have q-values < 0.05?
<- sampleInfo$ethnicity
eth <- which(eth == 'CEU' & year %in% c('02','03'))
ind <- rowttests(geneExpression[,ind], factor(year[ind]))$p.value
pvals
<- qvalue(pvals)$qvalues
qvals sum(qvals < 0.05)
## [1] 4308
Question 4
What is the estimate of pi0
provided by qvalue
?
qvalue(pvals)$pi0
## [1] 0.3628642
Question 5
Now perform a t-test (use rowttests
) comparing CEU samples processed in 2003 to those processed in 2004. Then use the qvalue
package to obtain q-values for each gene. How many genes have q-values less than 0.05?
<- which(eth == 'CEU' & year %in% c('03','04'))
ind <- rowttests(geneExpression[,ind], factor(year[ind]))$p.value
pvals <- qvalue(pvals)$qvalues
qvals sum(qvals < 0.05)
## [1] 2463
Question 6
Now we are going to compare ethnicities as was done in the original publication in which these data were first presented. Use the qvalue function to compare the ASN population to the CEU population. Once again, use the qvalue function to obtain q-values. How many genes have q-values < 0.05?
<- which(eth %in% c('ASN','CEU'))
ind <- rowttests(geneExpression[,ind], factor(eth[ind]))$p.value
pvals <- qvalue(pvals)$qvalues
qvals sum(qvals < 0.05)
## [1] 7217
Question 7
Over 80% of genes are called differentially expressed between ethnic groups. However, due to the confounding with processing date, we need to confirm these differences are actually due to ethnicity.This will not be easy due to the almost perfect confounding. However, above we noted that two groups were represented in 2005. Just like we stratified by majors to remove the “major effect” in our admissions example, here we can stratify by year and perform a t-test comparing ASN and CEU, but only for samples processed in 2005. How many genes have q-values < 0.05?
<- which(eth %in% c('ASN','CEU')& year == '05')
ind <- rowttests(geneExpression[,ind], factor(eth[ind]))$p.value
pvals <- qvalue(pvals)$qvalues
qvals sum(qvals < 0.05)
## [1] 560
table(sampleInfo$ethnicity[ind])
##
## ASN CEU HAN
## 80 3 0
Question 8
To provide a more balanced comparison, we repeat the analysis, but now taking 3 random CEU samples from 2002. Repeat the analysis above, but comparing the ASN from 2005 to three random CEU samples from 2002. Set the seed at 3,
set.seed(3)
. How many genes have q-values < 0.05?
set.seed(3)
<- which(eth =='CEU' & year == '02')
ind_ceu <- sample(ind_ceu,3)
ind_ceu2 <- which(eth == 'ASN' & year =='05')
ind_asn <- c(ind_ceu2,ind_asn)
ind
<- rowttests(geneExpression[,ind], factor(eth[ind]))$p.value
pvals <- qvalue(pvals)$qvalues
qvals sum(qvals < 0.05)
## [1] 3695
10.9 Exercises
library(GSE5859Subset)
data(GSE5859Subset)
= sampleInfo$group
sex = factor( format(sampleInfo$date,"%m"))
month table( sampleInfo$group, month)
## month
## 06 10
## 0 9 3
## 1 3 9
Question 1
Using the functions rowttests
and qvalue
compare the two groups. Because this is a smaller dataset which decreases our power, we will use the more lenient FDR
cut-off of 10%. How many gene have q-values less than 0.1?
<- rowttests(geneExpression, factor(sex))$p.val
pvals <- qvalue(pvals)$qvalues
qvals sum(qvals<0.1)
## [1] 59
Question 2
Note that sampleInfo$group
here presents males and females. Thus, we expect differences to be in on chrY and, for genes that escape inactivation, chrX. We do not expect many autosomal genes to be different between males and females. This gives us an opportunity to evaluate false and true positives with experimental data. For example, we evaluate results using the proportion genes of the list that
are on chrX or chrY.
For the list calculated in Question 1, what proportion of this list is on chrX or chrY?
<- (qvals<0.1) # qvalue index for significance
ind_qval <- geneAnnotation$CHR
chr <- (chr[ind_qval] %in% c('chrX','chrY'))
ind_xy
sum(ind_xy)/sum(ind_qval)
## [1] 0.3389831
Question 3
We can also check how many of the chromosomes X and Y genes we detected as different. How many are on Y?
<- which(chr[ind_qval] %in% c('chrX','chrY'))
chr_ind_xy length(chr[ind_qval][chr_ind_xy] == 'chrY')
## [1] 20
Question 4
Now for the autosomal genes (not on chrX and chrY) for which q-value < 0.1, perform a t-test comparing samples processed in June to those processed in October. What proportion of these have p-values <0.05? Hint: Be careful about indexing.
<- rowttests(geneExpression, factor(sex))$p.val
pvals <- qvalue(pvals)$qvalues
qvals <- which(qvals < 0.1)
ind
<- geneExpression[ind[!chr[ind] %in% c('chrX','chrY')],]
gene_dat_non_xy <- rowttests(gene_dat_non_xy, factor(month))$p.val
pvals mean(pvals < 0.05)
## [1] 0.8717949
Question 5
The above result shows that the great majority of the autosomal genes show differences due to processing data. This provides further evidence that confounding is resulting in false positives. So we are going to try to model the month effect to better estimate the sex effect. We are going to use a linear model. Which of the following creates the appropriate design matrix?
<- factor(month)
batch model.matrix(~sex+batch) # answer D
## (Intercept) sex batch10
## 1 1 1 0
## 2 1 1 0
## 3 1 1 0
## 4 1 1 1
## 5 1 1 1
## 6 1 1 1
## 7 1 1 1
## 8 1 1 1
## 9 1 1 1
## 10 1 1 1
## 11 1 1 1
## 12 1 1 1
## 13 1 0 0
## 14 1 0 0
## 15 1 0 0
## 16 1 0 0
## 17 1 0 0
## 18 1 0 0
## 19 1 0 0
## 20 1 0 0
## 21 1 0 0
## 22 1 0 1
## 23 1 0 1
## 24 1 0 1
## attr(,"assign")
## [1] 0 1 2
## attr(,"contrasts")
## attr(,"contrasts")$batch
## [1] "contr.treatment"
Question 6
Now use the X defined above, to fit a regression model using lm for each gene. You can obtain p-values for estimated parameters using summary
. Here is an example
= model.matrix(~sex+month)
X = 234
i = geneExpression[i,]
y = lm(y~X)
fit summary(fit)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.33316105 0.06098459 153.0412990 1.714250e-33
## Xsex -0.04281057 0.09220004 -0.4643227 6.471942e-01
## Xmonth10 0.25352885 0.09220004 2.7497696 1.200505e-02
How many of the q-values for the group comparison are now <0.1? Note the big drop from what we obtained without the correction (Question 1).
<- model.matrix(~sex+month)
X
<- t(sapply(1:nrow(geneExpression), function(i) {
res <- lm(geneExpression[i,]~X)
fit return(summary(fit)$coef[2,c(1,4)])
}))<- data.frame(res)
res names(res) <- c('est','pvals')
<- qvalue(res$pvals)$qvalues
qvals sum(qvals <0.1)
## [1] 17
Question 7
With this new list, what proportion of these are chrX and chrY? Notice the big improvement.
<- (qvals<0.1)
ind_qval <- geneAnnotation$CHR
chr <- (chr[ind_qval] %in% c('chrX','chrY')) # index for chrX and chrY
ind_xy
sum(ind_xy)/sum(ind_qval)
## [1] 0.8823529
Question 8
How many on Y or X?
sum(ind_xy)
## [1] 15
Question 9
Now from the linear model above, extract the p-values related to the coeffcient representing the October versus June differences using the same linear model. How many of the q-values for the month comparison are now <0.1? This approach is basically the approach implemented by Combat.
<- model.matrix(~sex+month)
X
<- t(sapply(1:nrow(geneExpression), function(i) {
res <- lm(geneExpression[i,]~X)
fit return(summary(fit)$coef[3,c(1,4)])
}))
<- data.frame(res)
res names(res) <- c('est','pvals')
<- qvalue(res$pvals)$qvalues
qvals sum(qvals <0.1)
## [1] 3170
10.11 Exercises
Question 1
Suppose you want to make an MA-plot of the first two samples y = geneExpression[,1:2]
. Which of the following projections gives us the projection of y so that column 2 versus column 1 is an MA plot?
\[ \, A. y\begin{pmatrix} 1/\sqrt{2} & 1/\sqrt{2} \\ 1/\sqrt{2} & -1/\sqrt{2} \\ \end{pmatrix} \\ B. y\begin{pmatrix} 1 & 1 \\ 1 & -1 \\ \end{pmatrix} \\ C. \begin{pmatrix} 1 & 1 \\ 1 & -1 \\ \end{pmatrix} y \\ D. \begin{pmatrix} 1 & 1 \\ 1 & -1 \\ \end{pmatrix} y^T \]
To be honest, according to my attempt, there’s no correct answer among the provided answer choices. Please let me know if I am incorrect (through github or email); I will make sure to put your name in one of the contributors to this project. Here’s my attempt.
= geneExpression[,1:2]
y = (y[,1]+y[,2])/2 # average
avg = z2 = (y[,1]-y[,2]) # difference
diff = cbind(avg,diff)
z head(z) # the actual values of average and difference
## avg diff
## 1007_s_at 6.472712 0.14248432
## 1053_at 7.405128 0.28316044
## 117_at 5.226584 0.35207594
## 121_at 7.800149 0.18478972
## 1255_g_at 3.232791 0.01997527
## 1294_at 7.311012 0.44148316
# Now let's try each answer and compare to head(z).
# answer A - avg = sqrt(2) x z[,1], diff = z[,2] / sqrt(2)
head(y %*% matrix(c(1,1,1,-1),2,2))
## [,1] [,2]
## 1007_s_at 12.945424 0.14248432
## 1053_at 14.810255 0.28316044
## 117_at 10.453168 0.35207594
## 121_at 15.600297 0.18478972
## 1255_g_at 6.465583 0.01997527
## 1294_at 14.622024 0.44148316
# answer B - avg = 2 x z[1,2], diff = z[,2]
head(y %*% matrix(c(1,1,1,-1),2,2))
## [,1] [,2]
## 1007_s_at 12.945424 0.14248432
## 1053_at 14.810255 0.28316044
## 117_at 10.453168 0.35207594
## 121_at 15.600297 0.18478972
## 1255_g_at 6.465583 0.01997527
## 1294_at 14.622024 0.44148316
# answer C - non-comfortable array, can't multiply
# answer D - avg = 2 x z[1,2], diff = z[,2]
head(t(matrix(c(1,1,1,-1),2,2) %*% t(y)))
## [,1] [,2]
## 1007_s_at 12.945424 0.14248432
## 1053_at 14.810255 0.28316044
## 117_at 10.453168 0.35207594
## 121_at 15.600297 0.18478972
## 1255_g_at 6.465583 0.01997527
## 1294_at 14.622024 0.44148316
So none of the choices are equal to head(z)
. Instead here is my answer that transforms y
correctly.
<- y %*% matrix(c(1/2,1/2,1,-1),2,2)
ans any(!near(z, ans)) # they are identical
## [1] FALSE
Therefore, my answer is:
\[ y\begin{pmatrix} 1/2 & 1 \\ 1/2 & -1 \\ \end{pmatrix} \\ \]
Question 2
Say \(Y\) is \(M\)x\(N\) in the \(SVD Y = UDV^T\) which of the following is not correct?
Answer is C: \(D\) are the coordinates of the projection \(U^TY\).
Question 3
Define
= geneExpression - rowMeans(geneExpression) y
Compute and plot an image of the correlation for each sample. Make two image plots of these correlations. In the first one, plot the correlation as image. In the second, order the samples by date and then plot an image of the correlation. The only difference in these plots is the order in which the samples are plotted. Based on these plots, which of the following would you say is true?
<- ncol(y)
n =cor(y)
cors=colorRampPalette(rev(brewer.pal(11,"RdBu")))(100)
cols
image(1:n, 1:n, cors, col=cols, xlab = 'samples', ylab = 'samples',
zlim = c(-1,1)) # correlation as image
<- order(sampleInfo$date)
ind
<- y[,ind]
y2 <- ncol(y)
n2 =cor(y2)
cors2image(1:n2, 1:n2, cors2, col=cols, xlab = 'samples', ylab = 'samples',
zlim = c(-1,1)) # correlation as image after sorting the date. some samples are more correlated within a span of date than others.
The answer is D: The fact that in the plot ordered by month we see two groups mainly driven by month, and within these we see subgroups driven by date, seems to suggest date more than month per se are the hidden factors.
Question 4
Based on the correlation plots above, we could argue that there are at least two hidden factors. Using PCA estimate these two factors. Specifically, apply the svd
to y
and use the first two PCs as estimates. Which command gives us these estimates?
<- svd(y)
s head(s$v[,1:2])
## [,1] [,2]
## [1,] -0.232970658 -0.01693543
## [2,] -0.034229085 -0.13955710
## [3,] -0.083264418 -0.19476886
## [4,] 0.306299241 -0.07357253
## [5,] 0.003861145 0.17918969
## [6,] 0.353302375 -0.09079626
The answer is B: pcs = svd(y)$v[,1:2]
.
Question 5
Plot each of the estimated factors ordered by the date. Use color to denote month. The first factor is clearly related to date. Which of the following appears to be most different according to this factor?
To be honest, I do not know what the question is actually asking for. I also think this is one of the most difficult questions in the book. It can possibly ask either of these two things: 1) which date appears to be most different in the function of PC1? 2) which dates are most different from each other in the function of PC1? To answer both of these questions, we need to visualize each date in a plot where the x-axis is PC1 and y-axis is PC2. My code below also displays the range of PC1 for each date blue.
<- svd(y)
s <- which(sampleInfo$date == '2005-06-10')
ind_0610 <- which(sampleInfo$date == '2005-06-23')
ind_0623 <- which(sampleInfo$date == '2005-06-27')
ind_0627 <- which(sampleInfo$date == '2005-10-07')
ind_1007 <- which(sampleInfo$date == '2005-10-28')
ind_1028
<- list(ind_0610, ind_0623, ind_0627, ind_1007, ind_1028)
index_list <- vector('double',length(index_list))
res_range mypar(2,3)
for (i in seq_along(index_list)) {
plot(s$v[,1][index_list[[i]]], s$v[,2][index_list[[i]]],
xlab = 'PC1', ylab = 'PC2', xlim = c(-0.4,.4),
ylim = c(-0.4,0.6),
main = paste0(unique(sampleInfo$date[index_list[[i]]])))
<- max(s$v[,1][index_list[[i]]])-min(s$v[,1][index_list[[i]]])
res_range[[i]] text(0.1,0.1, signif(res_range[[i]],2), pos=2,col = 'blue')
}
The plot shows data that are spread in the function of PC1. The blue numbers show the range of PC1 that each date’s data span. Oct 7th has a highest difference in the function of PC1, with a range of 0.38
. However, the range of Oct 28th is merely 0.075
. Also they do not seem to be much different from each other, primarily because Oct 7th covers a wide range of PC1. This rules out answer choice B.
June 10th and 23th seem to be not much different from each other. Also, the range of June 10th is 0
because it only has one point. This rules out answer choice C. Answer D is eliminated because there is no June 15th.
Answer choice A seems to be most appropriate because both June 23th and 27th data seem to be spread in the function of PC1. Also they do seem to be quite different from each other in terms of their location in the PC1 x-axis.
Question 6
Use the svd
function to obtain the principal components (PCs) for our detrended gene expression data y
. How many PCs explain more than 10% of the variability?
= geneExpression - rowMeans(geneExpression)
y <- svd(y)
s sum((s$d^2)/sum(s$d^2) > 0.1)
## [1] 2
Question 7
Which PC most correlates (negative or positive correlation) with month?
<- vector('double',ncol(s$v))
res <- format(sampleInfo$date,"%m")
month <- as.numeric(month)
month
for (i in seq_along(res)) {
<- cor(month, s$v[,i])
res[[i]]
}<- which(abs(res)==max(abs(res)))
ind ind
## [1] 1
Question 8
What is this correlation (in absolute value)?
abs(res[ind])
## [1] 0.8297915
Question 9
Which PC most correlates (negative or positive correlation) with sex?
<- vector('double',ncol(s$v))
res <- sampleInfo$group
sex
for (i in seq_along(res)) {
<- cor(sex, s$v[,i])
res[[i]]
}<- which(abs(res)==max(abs(res)))
ind ind
## [1] 1
Question 10
What is this correlation (in absolute value)?
abs(res[ind])
## [1] 0.6236858
Question 11
Now instead of using month, which we have shown does not quite describe the batch (Question 5), add the two estimated factors s$v[,1:2]
to the linear model we used
above. Apply this model to each gene and compute q-values for the sex difference. How many q-values < 0.1 for the sex comparison?
= geneExpression - rowMeans(geneExpression)
y <- svd(y)
s <- model.matrix(~sex + s$v[,1:2])
X <- sapply(1:nrow(y), function(i) {
res <- lm(y[i,]~X)
fit return(summary(fit)$coef[2,4])
})
<- qvalue(res)$qvalues
qvals sum(qvals < 0.1)
## [1] 14
Question 12
What proportion of the genes are on chromosomes X and Y?
<- geneAnnotation$CHR
chr <- qvals < 0.1
ind mean(chr[ind] %in% c('chrX','chrY'))
## [1] 1
10.13 Exercises
library(sva)
library(Biobase)
library(GSE5859Subset)
data(GSE5859Subset)
Question 1
In a previous section we estimated factors using PCA, but we noted that the first factor was correlated with our outcome of interest:
<- svd(geneExpression-rowMeans(geneExpression))
s cor(sampleInfo$group,s$v[,1])
## [1] 0.6236858
The svafit
function estimates factors, but downweighs the genes that appear to correlate with the outcome of interest. It also tries to estimate the number of factors and returns
the estimated factors like this:
= sampleInfo$group
sex = model.matrix(~sex)
mod = sva(geneExpression,mod) svafit
## Number of significant surrogate variables is: 5
## Iteration (out of 5 ):1 2 3 4 5
head(svafit$sv)
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.26862626 -0.03838109 0.15306742 -0.3007374 0.210098158
## [2,] -0.06132157 -0.15755769 -0.03538763 0.0851655 -0.063869257
## [3,] -0.12161818 -0.21766433 -0.12624414 0.2443445 -0.099004174
## [4,] 0.30660574 -0.09657648 -0.32034135 -0.1680430 0.620260643
## [5,] -0.01850853 0.18648507 0.17931970 0.4244993 -0.007840835
## [6,] 0.36062840 -0.08542758 -0.10726746 0.1074114 0.033204590
The resulting estimated factors are not that different from the PCs.
for(i in 1:ncol(svafit$sv)){
print( cor(s$v[,i],svafit$sv[,i]) )
}
## [1] 0.9944755
## [1] 0.9964202
## [1] -0.9915972
## [1] -0.9896597
## [1] -0.9518427
Now fit a linear model to each gene that instead of month includes these factors in the model. Use the qvalue
function. How many genes have q-value < 0.1?
<- geneExpression - rowMeans(geneExpression)
y <- model.matrix(~sex + svafit$sv)
X
<- sapply(1:nrow(y), function(i) {
res <- lm(y[i,]~X)
fit return(summary(fit)$coef[2,4])
})
<- qvalue(res)$qvalues
qvals sum(qvals < 0.1)
## [1] 13
Question 2
How many of these genes are from chrY
or chrX
?
<- geneAnnotation$CHR
chr <- qvals < 0.1
ind sum(chr[ind] %in% c('chrX','chrY'))
## [1] 12