Chapter 7 Statistical Models
Note: I have rephrased some parts of the questions for clarity. These changes are bolded. Due to the random numbers, the exact values of the answers, despite the same seeds, might differ. So please be mindful of that.
First, upload necessary package(s).
library(dplyr) # uplaods the functions filter() and %>%
library(rafalib) # important for plotting with base R
library(limma)
7.5 Exercises
Question 1
The probability of conceiving a girl is 0.49. What is the probability that a family with 4 children has 2 girls and 2 boys (you can assume that the outcomes are independent)?
<- 0.49
p dbinom(2,4,p=p)
## [1] 0.3747001
dbinom
and pbinom
are different. pbinom
is cumulative, whereas dibnom
is for individual value. Here is a demonstration. They are all equal.
<- 0.49
p dbinom(2,4,p=p)
## [1] 0.3747001
pbinom(2,4,p=p)
## [1] 0.702348
dbinom(0,4,p=p) + dbinom(1,4,p=p) + dbinom(2,4,p=p)
## [1] 0.702348
Question 2
What is the probability that a family with 10 children has 6 girls and 4 boys (assume no twins)?
<- 0.49
p dbinom(6,10,p=p)
## [1] 0.1966421
Question 3
The genome has 3 billion bases. About 20% are C, 20% are G, 30% are T, and 30% are A. Suppose you take a random interval of 20 bases, what is the probability that the GC-content (proportion of Gs of Cs) is strictly above 0.5 in this interval?
<- 0.4
p pbinom(20,20,p) - pbinom(10,20,p)
## [1] 0.1275212
Question 4
The probability of winning the lottery is 1 in 175,223,510. If 20,000,000 people buy a ticket, what is the probability that more than one person wins?
<- 1/175233510
p <- 20000000
N 1 - ppois(1,N*p)
## [1] 0.006038215
1 - pbinom(1,N,p)
## [1] 0.006038215
Since the poisson distrubtion is a type of binomial distribution, both distribution give the same values.
Question 5
The genome has 3 billion bases. About 20% are C, 20% are G, 30% are T, and 30% are A. Suppose you take a random interval of 20 bases, what is the exact probability that the GC-content (proportion of Gs of Cs) is greater than 0.35 and smaller or equal to 0.45 in this interval?
<- 0.4
p pbinom(0.45*20, 20, p) - pbinom(0.35*20,20,p)
## [1] 0.3394443
Question 6
For the question above, what is the normal approximation to the probability?
<- 0.4
p <- 20
N <- (0.45*20 - N*p) / sqrt(N*p*(1-p))
a <- (0.35*20 - N*p) / sqrt(N*p*(1-p))
b
<- pnorm(a) - pnorm(b)
approx approx
## [1] 0.3519231
Question 7
Repeat Question 5, but using an interval of 1000 bases. What is the difference (in absolute value) between the normal approximation and the exact distribution of the GC-content being greater than 0.35 and lesser or equal to 0.45?
<- 1000
N <- 0.4
p <- (0.45*N - N*p) / sqrt(N*p*(1-p))
a <- (0.35*N - N*p) / sqrt(N*p*(1-p))
b <- pnorm(a) - pnorm(b)
approx
<- pbinom(0.45*N,N,p) - pbinom(0.35*N,N,p)
exact - approx exact
## [1] 9.728752e-06
Question 8
The Cs in our genomes can be methylated or unmethylated. Suppose we have a large (millions) group of cells in which a proportion p of the Cs of interest are methylated. We break up the DNA of these cells and randomly select pieces and end up with \(N\) pieces that contain the C we care about. This means that the probability of seeing \(k\) methylated Cs is binomial:
= dbinom(k,N,p) exact
We can approximate this with the normal distribution:
<- (k+0.5 - N*p)/sqrt(N*p*(1-p))
a <- (k-0.5 - N*p)/sqrt(N*p*(1-p))
b = pnorm(a) - pnorm(b) approx
Compute the difference approx-exact
for:
<- c(5,10,50,100,500)
N <- seq(0,1,0.25) p
Compare the approximation and exact probability of the proportion of Cs being \(p\), \(k = 1,...,N-1\) plotting the exact versus the approximation for each \(p\) and \(N\) combination. Which statement is false?
<- c(5,10,50,100,500)
Ns <- seq(0,1,0.25)
Ps mypar(5,5)
for (i in seq_along(Ns)) {
<- Ns[[i]]
n <- seq(1:n-1)
k for (j in seq_along(Ps)) {
<- Ps[[j]]
p = dbinom(k, n, p)
exact = (k+0.5- n*p)/sqrt(n*p*(1-p))
a = (k-0.5- n*p)/sqrt(n*p*(1-p))
b = pnorm(a) - pnorm(b)
approx qqplot(exact,approx,xlab='exact',ylab='approx',
main = paste0('N=',n,' ','p=',p))
abline(0,1)
} }
The answer is C: When N is 100 all approximations are spot on. When p is close to 0 or 1, the normal distribution breaks down even at N = 100.
Question 9
We saw in the previous question that when p is very small, the normal approximation breaks down. If N is very large, then we can use the Poisson approximation.
Earlier we computed 1 or more people winning the lottery when the probability of winning was 1 in 175,223,510 and 20,000,000 people bought a ticket. Using the binomial, we can compute the probability of exactly two people winning to be:
<- 20000000
N <- 1/175223510
p dbinom(2,N,p)
## [1] 0.005811321
If we were to use the normal approximation, we would greatly underestimate this:
<- (2+0.5 - N*p)/sqrt(N*p*(1-p))
a <- (2-0.5 - N*p)/sqrt(N*p*(1-p))
b pnorm(a) - pnorm(b)
## [1] 2.04756e-05
To use the Poisson approximation here, use the rate \(\lambda = Np\) representing the number of people per 20,000,000 that win the lottery. Note how much better the approximation is:
dpois(2,N*p)
## [1] 0.005811321
In this case, it is practically the same because N is very large and Np is not 0. These are the assumptions needed for the Poisson to work. What is the Poisson approximation for more than one person winning?
1 - ppois(1,N*p)
## [1] 0.006038879
Question 10
Write a function that takes \(\lambda\) and the vector of counts as input and returns the log-likelihood. Compute this log-likelihood for lambdas = seq(0,15,len=300)
and make
a plot. What value of lambdas
maximizes the log-likelihood?
library(dagdata)
data(hcmv)
mypar()
plot(locations,rep(1,length(locations)),ylab="",yaxt="n")
=seq(0,4000*round(max(locations)/4000),4000)
breaks=cut(locations,breaks)
tmp=as.numeric(table(tmp))
countshist(counts)
<- dpois(counts,4)
probs <- prod(probs)
likelihood likelihood
## [1] 1.177527e-62
<- dpois(counts,4,log=T)
logprobs <- sum(logprobs)
loglikelihood loglikelihood
## [1] -142.5969
= function(lambdas) {
loglike <- dpois(counts,lambdas,log=T)
logprobs <- sum(logprobs)
loglikelihood return(loglikelihood)
}<- seq(0,15,len=300)
lambdas <- exp(sapply(lambdas,loglike))
log_res <- lambdas[which(log_res == max(log_res))] optim_lambda
Question 11
The point of collecting this dataset was to try to determine if there is a region of the genome that has a higher palindrome rate than expected. We can create a plot and see the counts per location:
library(dagdata)
data(hcmv)
=seq(0,4000*round(max(locations)/4000),4000)
breaks=cut(locations,breaks)
tmp=as.numeric(table(tmp))
counts=(breaks[-1]+breaks[-length(breaks)])/2
binLocationplot(binLocation,counts,type="l",xlab=)
What is the center of the bin with the highest count?
which(counts == max(counts))] binLocation[
## [1] 94000
Question 12
What is the maximum count?
max(counts)
## [1] 14
Question 13
Once we have identified the location with the largest palindrome count, we want to know if we could see a value this big by chance. If X is a Poisson random variable with rate:
= mean(counts[-which.max(counts)]) lambda
What is the probability of seeing a count of 14 or more?
1 - ppois(13, lambda)
## [1] 0.00069799
You subtract ppois(13,optim_lambda)
because you need to exclude it. Since this distribution is discrete, 1 - ppois(13, optim_lambda)
will count probability from a seeing a count of 14 or more.
Question 14
So we obtain a p-value smaller than 0.001 for a count of 14. Why is it problematic to report this p-value as strong evidence of a location that is different?
The answer is B: We selected the highest region out of 57 and need to adjust for multiple testing. Answer A is wrong because we do use normal approximation in t-test to get a p-value, so there is nothing wrong with using approximation. Answer B is correct because the p-value that we obtained is from a comparison against the sample mean (z score = 0) rather than all other counts. Therefore, the p-value must be corrected (ex. Bonferroni’s procedure). Answer C is wrong because p value can also be a random variable, but this answer choice implies that p-value is not a random variable. Answer D is wrong because effect size is irrelevent.
Question 15
Use the Bonferroni correction to determine the p-value cut-off that guarantees a FWER of 0.05. What is this p-value cutoff?
0.05/length(counts)
## [1] 0.000877193
Question 16
Create a qq-plot to see if our Poisson model is a good fit:
<- (seq(along=counts) - 0.5)/length(counts)
ps <- mean( counts[ -which.max(counts)])
lambda <- qpois(ps,lambda)
poisq plot(poisq,sort(counts))
abline(0,1)
How would you characterize this qq-plot - A) Poisson is a terrible approximation. - B) Poisson is a very good approximation except for one point that we actually think is a region of interest. - C) There are too many 1s in the data. - D) A normal distribution provides a better approximation.
The answer is B. You can check whether the palindrome counts are well approximated by the normal distribution.
qqnorm(sort(counts))
Question 17
Load the tissuesGeneExpression
data library:
library(tissuesGeneExpression)
data("tissuesGeneExpression")
Then select the columns related to endometrium:
library(genefilter)
= e[,which(tissue=="endometrium")] y
This will give you a matrix y with 15 samples. Compute the across sample variance for the first three samples. Then make a qq-plot to see if the data follow a normal distribution. Which of the following is true? - A) With the exception of a handful of outliers, the data follow a normal distribution. - B) The variance does not follow a normal distribution, but taking the square root fixes this. - C) The normal distribution is not usable here: the left tail is over estimated and the right tail is underestimated. - D) The normal distribution fits the data almost perfectly.
<- rowVars(y[,1:3])
vars length(vars)
## [1] 22215
mypar(1,2)
qqnorm(sqrt(vars)) # choice B is false
qqnorm(vars) # choice A and D are false
The answer is C.
Question 18
Now fit an F-distribution with 14 degrees of freedom using the fitFDist
function in the limma
package. What are df2
and scale
?
library
<- fitFDist(vars,14)
res res
## $scale
## [1] 0.01139807
##
## $df2
## [1] 1.217793
Question 19
Now create a qq-plot of the observed sample variances versus the F-distribution quantiles. Which of the following best describes the qq-plot?
<- (seq(along=vars)-0.5)/length(vars)
pf <- qf(pf,14,res$df2) # theoretical quantiles from F distribution
theory mypar(1,2)
qqplot(theory, sort(vars), xlab = 'theory', ylab ='obs') # F approximation vs variance from the data
qqnorm(sort(vars)) # normal approximation vs variance from the data
The answer is D: If we exclude the highest 0.1% of the data, the F-distribution provides a good fit. Actually I do not think answer is entirely correct but this is the most appropriate one. Even if we exclude the top 0.1% there are still more outliers to remove. Here is a demonstration.
<- sort(vars)
vars_sort <- vars_sort[1:18000] # 18000 out of 22215 kept = 81%
vars_excl <- (seq(along=vars_excl)-0.5)/length(vars_excl)
pf_excl <- qf(pf_excl,14,res$df2)
theory
mypar(2,2)
qqplot(theory,vars_excl, xlab = 'theory',ylab='obs',
main = 'all data compared with F-approximation')
qqplot(vars_excl,theory, xlim = c(0,0.2),
ylim = c(0,100), main = 'zoomed in')
qqnorm(vars_excl, main = 'normal qqplot') # comparing with normal approximation with filtered variance (81% kept)
Even if I keep up to 81% of the values, F-distribution ## 7.7 Exercises {-}
Question 1
A test for cystic fibrosis has an accuracy of 99%. Specifically, we mean that: \[ \begin{align*} Prob(+|D) = 0.99, Prob(-|no D) = 0.99 \end{align*} \] The cystic fibrosis rate in the general population is 1 in 3,900, Prob(D) = 0:00025. If we select a random person and they test positive, what is probability that they have cystic fibrosis \(Prob(D|+)\) ? Hint: use Bayes Rule.
0.99*0.00025)/(0.99*0.00025 + 0.01*(1-0.00025)) (
## [1] 0.02415813
Question 2
First download some baseball statistics.
<- tempfile()
tmpfile <- tempdir()
tmpdir download.file("http://seanlahman.com/files/database/lahman-csv_2014-02-14.zip",tmpfile)
##this shows us files
<- unzip(tmpfile,list=TRUE)
filenames <- read.csv(unzip(tmpfile,files="Batting.csv",exdir=tmpdir),as.is=TRUE)
players unlink(tmpdir)
file.remove(tmpfile)
## [1] TRUE
We will use the dplyr
, which you can read about here to obtain data from 2010, 2011, and 2012, with more than 500 at bats (AB >= 500).
<- filter(players,yearID>=2010, yearID <=2012) %>% mutate(AVG=H/AB) %>% filter(AB>500) dat
What is the average of these batting averages?
mean(dat$AVG,na.rm=T)
## [1] 0.2753465
Question 3
What is the standard deviation of these batting averages?
sd(dat$AVG,na.rm=T)
## [1] 0.02741713
Question 4
Use exploratory data analysis to decide which of the following distributions approximates our AVG:
mypar(1,2)
hist(dat$AVG)
qqnorm(dat$AVG)
The answer is A: Normal.
Question 5
It is April and after 20 at bats, Jos Iglesias is batting .450 (which is very good). We can think of this as a binomial distribution with 20 trials, with probability of success p. Our sample estimate of p is .450. What is our estimate of standard deviation? Hint: This is the sum that is binomial divided by 20.
<- 0.45
p <- 20
n sqrt(p*(1-p)/n)
## [1] 0.111243
Question 6
The Binomial is approximated by normal, so our sampling distribution is approximately normal with mean \(Y\) = 0.45 and SD \(\sigma\) = 0.11. Earlier we used a baseball database to determine that our prior distribution is Normal with mean \(\mu\) = 0.275 and SD \(\tau\) = 0.027. We also saw that this is the posterior mean prediction of the batting average.
What is your Bayes prediction for the batting average going forward?
<- (0.11^2)/(0.11^2 + 0.027^2)
B <- 0.275*B + (1-B)*0.45
post_u post_u
## [1] 0.2849443
7.9 Exericses
library(Biobase)
library(SpikeInSubset)
data(rma95)
<- exprs(rma95)
y pData(rma95)
## 37777_at 684_at 1597_at 38734_at 39058_at 36311_at
## 1521a99hpp_av06 0.00 0.25 0.5 1 2 4
## 1532a99hpp_av04 0.00 0.25 0.5 1 2 4
## 2353a99hpp_av08 0.00 0.25 0.5 1 2 4
## 1521b99hpp_av06 0.25 0.50 1.0 2 4 8
## 1532b99hpp_av04 0.25 0.50 1.0 2 4 8
## 2353b99hpp_av08r 0.25 0.50 1.0 2 4 8
## 36889_at 1024_at 36202_at 36085_at 40322_at 407_at
## 1521a99hpp_av06 8 16 32 64 128 0.00
## 1532a99hpp_av04 8 16 32 64 128 0.00
## 2353a99hpp_av08 8 16 32 64 128 0.00
## 1521b99hpp_av06 16 32 64 128 256 0.25
## 1532b99hpp_av04 16 32 64 128 256 0.25
## 2353b99hpp_av08r 16 32 64 128 256 0.25
## 1091_at 1708_at 33818_at 546_at
## 1521a99hpp_av06 512 1024 256 32
## 1532a99hpp_av04 512 1024 256 32
## 2353a99hpp_av08 512 1024 256 32
## 1521b99hpp_av06 1024 0 512 64
## 1532b99hpp_av04 1024 0 512 64
## 2353b99hpp_av08r 1024 0 512 64
<- factor(rep(0:1,each=3))
g <- rownames(y) %in% colnames(pData(rma95)) spike
Question 1
Only these 16 genes are diferentially expressed since the six samples differ only due to sampling (they all come from the same background pool of RNA). Perform a t-test on each gene using the rowttest function.
What proportion of genes with a p-value < 0.01 (no multiple comparison correction) are not part of the artifcially added (false positive)?
<- rowttests(y,g)$p.value
pval head(pval < 0.01)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
nrow(y[spike,])
## [1] 16
sum(pval[spike] < 0.01) # 11
## [1] 11
nrow(y[pval < 0.01,]) - sum(pval[spike] < 0.01))/ nrow(y[pval < 0.01,]) (
## [1] 0.7608696
Question 2
Now compute the within group sample standard deviation for each gene (you can use group 1). Based on the p-value cut-off, split the genes into true positives, false positives, true negatives and false negatives. Create a boxplot comparing the sample SDs for each group. Which of the following best describes the boxplot?
<- pval < 0.01 # calls for significance
calls <- rowSds(y) # std for each gene
sds_y <- sds_y[calls & !spike] # false positive
fp <- sds_y[!calls & spike] # false negative
fn <- sds_y[!spike] # true negative
tn <- sds_y[spike] # true positive
tp <- list(tp,fp,tn,fn)
res names(res) <- c('tp','fp','tn','fn')
boxplot(res, ylim = c(0,1))
The answer is D: The false positives have smaller standard deviation.
Question 3
In the previous two questions, we observed results consistent with the fact that the random variability associated with the sample standard deviation leads to t-statistics that are large by chance.
The sample standard deviation we use in the t-test is an estimate and with just a pair of triplicate samples, the variability associated with the denominator in the t-test can be large.
The following steps perform the basic limma
analysis. We specify coef=2
because we are interested in the difference between groups, not the intercept. The eBayes
step uses a
hierarchical model that provides a new estimate of the gene specific standard error.
library(limma)
<- lmFit(y, design=model.matrix(~ g))
fit colnames(coef(fit))
## [1] "(Intercept)" "g1"
<- eBayes(fit) fit
Here is a plot of the original, new, hierarchical models based estimate versus the sample based estimate:
= fit$sigma
sampleSD = sqrt(fit$s2.post) posteriorSD
Which best describes what the hierarchical model does?
hist(sampleSD, xlim = c(0,.5))
hist(posteriorSD, xlim = c(0,.5))
mean(sampleSD)
## [1] 0.1241865
The answer choice is A: Moves all esimates of standard deviation closer to 0.12.
Question 4
Use these new estimates of standard deviation in the denominator of the t-test and compute p-values. You can do it like this:
library(limma)
= lmFit(y, design=model.matrix(~ g))
fit = eBayes(fit)
fit ##second coefficient relates to diffences between group
= fit$p.value[,2] pvals
What proportion of genes with a p-value < 0.01 (no multiple comparison correction) are not part of the artificially added (false positive)?
<- fit$p.value[,2]
pvals
nrow(y[pvals < 0.01,]) - sum(pvals[spike] < 0.01))/ nrow(y[pvals < 0.01,]) (
## [1] 0.6486486