library(tidyverse)
library(microbenchmark)
First, we read the data and see what kind of variables we have.
snpdata = readRDS(file = "gwas_allele_counts.julia.2008.rds")
head(snpdata)
## # A tibble: 6 x 5
## rsid chromosome chromosome_posit~ genome_position contingency_tab~
## <chr> <int> <int> <dbl> <list>
## 1 rs3934834 1 995669 995669 <dbl [2 x 2]>
## 2 rs3737728 1 1011278 1011278 <dbl [2 x 2]>
## 3 rs6687776 1 1020428 1020428 <dbl [2 x 2]>
## 4 rs9651273 1 1021403 1021403 <dbl [2 x 2]>
## 5 rs4970405 1 1038818 1038818 <dbl [2 x 2]>
## 6 rs12726255 1 1039813 1039813 <dbl [2 x 2]>
This dataset is described in the paper “Genome-wide association study of rheumatoid arthritis in the Spanish population: KLF12 as a risk locus for rheumatoid arthritis susceptibility” by Julia et al.
The first three columns are easy:
rsid = SNP id
chromosome = the name of the chromosome
chromosome_position = the position of the SNP within the chromosome
We have taken all the chromosome and glues them together from chromosome 1…chromosome 22, then we made a new coordinate system starting from 1 (chromosome 1) and ending in position ~3 gigabases (chromosome 22). This position is calculated in the column called “genome_position”. Finally we have counted the number of alleles in cases and controls and put this contingency table into the variable “contingency_table”.
table(snpdata$chromosome)
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 21427 23044 19745 17149 17448 18932 15185 16470 14467 14256 13293 13701
## 13 14 15 16 17 18 19 20 21 22
## 10470 9032 8085 8271 7639 9518 5394 7121 5029 4970
barplot(table(snpdata$chromosome), col = "deeppink4")
We can see that the number of SNPs shows a decreasing tendency - this is likely due to the fact that the chromosomes have different lengths, and the longer the chromosome is, the more likely it is to have more SNPs on it.
Now we will look at the position of SNPs in the genome in more detail.
hist(snpdata$chromosome_position, col = "deeppink4",
xlab = "Chromosome position",
main="Position of SNPs in the genome")
This tendency can also be explained by the chromosome size. All chromosomes are at least a few million bp long, but only the longest ones stretch to, say, \(1.5*10^8\) bps and beyond.
To illustrate this, let’s make a table of the chromosome positions per chromosome and make a plot.
x = table(snpdata$chromosome, snpdata$chromosome_position%/%10^7)
barplot(x, col=rainbow(25), xlab="Chromosome", ylab="Chromosome position / 10^7")
nbreaks = round(max(snpdata$genome_position) / 10^7)
hist(snpdata$genome_position, col="deeppink4", breaks = nbreaks, border = NA,
xlab = "Genome position", main = "Histogram of genome position")
Approximately 1000.
x = table(snpdata$genome_position %/% 10^7)
mean(x)
## [1] 998.7402
median(x)
## 64
## 1015
hist(x, breaks=30, col="deeppink4", main="Histogram of the number of SNPs per 10 megabases",
xlab="")
Take me back to the beginning!
We will look at SNP number 1 and SNP number 2059.
M = snpdata$contingency_table[[1]] # extracting the contingency table corresponding to SNP 1
M2 = snpdata$contingency_table[[2059]]
M
## allele1 allele2
## cases 143 657
## controls 136 664
addmargins(M) # adding the row and column sums
## allele1 allele2 Sum
## cases 143 657 800
## controls 136 664 800
## Sum 279 1321 1600
M2
## allele1 allele2
## cases 400 400
## controls 280 520
addmargins(M2)
## allele1 allele2 Sum
## cases 400 400 800
## controls 280 520 800
## Sum 680 920 1600
a = M[1,1] # getting the value from the first row and first column
c = M[2,1] # getting the value from the second row and first column
a+c
## [1] 279
a = M[1,1]
b = M[1,2]
c = M[2,1]
d = M[2,2]
OR = (a*d)/(b*c)
OR
## [1] 1.062673
SE = sqrt(1/a + 1/b + 1/c + 1/d)
SE
## [1] 0.1318106
low99 = exp(log(OR) - 2.58 * SE)
high99 = exp(log(OR) + 2.58 * SE)
low99
## [1] 0.7563254
high99
## [1] 1.493107
Yes, the confidence interval overlaps 1.0. Based on these results, we can’t conclude that there is an association between our variables - there might be a slight positive or slight negative effect, or no effect at all, so more analysis would be needed to pinpoint what exactly is going on.
a = M2[1,1]
b = M2[1,2]
c = M2[2,1]
d = M2[2,2]
OR = (a*d)/(b*c)
SE = sqrt(1/a + 1/b + 1/c + 1/d)
low99 = exp(log(OR) - 2.58 * SE)
high99 = exp(log(OR) + 2.58 * SE)
cat(low99, "<", OR, "<", high99, "\n")
## 1.425806 < 1.857143 < 2.418969
No, in this case the confidence interval for the odds ratio exceeds 1, which suggests an association between the variables (so an association between having the SNP and having the disease, in this case, rheumatoid arthritis).
mosaicplot(snpdata$contingency_table[[1]], col=c("firebrick", "goldenrod1"), cex.axis = 1.2,
sub = "Allele", dir = c("h","v"), ylab = "Group",
main ="Contingency table for SNP 1")
mosaicplot(snpdata$contingency_table[[2059]], col=c("firebrick", "goldenrod1"), cex.axis = 1.2,
sub = "Allele", dir = c("h","v"), ylab = "Group", main=paste("SNP 2059\n",
round(low99,2), "<", round(OR,2), "<", round(high99,2)))
We can also perform Fisher’s exact test for these SNPs.
fisher.test(M)
##
## Fisher's Exact Test for Count Data
##
## data: M
## p-value = 0.6927
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.813807 1.387945
## sample estimates:
## odds ratio
## 1.0626
fisher.test(M2)
##
## Fisher's Exact Test for Count Data
##
## data: M2
## p-value = 1.637e-09
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.511641 2.281964
## sample estimates:
## odds ratio
## 1.856437
As expected, we get a high p-value for SNP 1, which makes sense as the confidence interval for this SNP overlapped 1 so we can’t reject the null hypothesis of no association (or OR being 1). In contrast, however, the p-value for SNP 2059 is below 0.05, so we can reject the null hypothesis of no association. These results are visually supported by the mosaic plots.
Take me back to the beginning!
# first, we are making three vectors which are as long as many rows our SNPdata dataframe has - so as many SNPs we have in total
r1 = rep(NA, nrow(snpdata))
r2 = rep(NA, nrow(snpdata))
r3 = rep(NA, nrow(snpdata))
# then we loop through the entire dataset
for (i in 1:length(r1)) {
# we extract the i-th contingency table
M = snpdata$contingency_table[[i]]
# then we make our calculations on this contingency table
a = M[1,1]
b = M[1,2]
c = M[2,1]
d = M[2,2]
OR = (a*d)/(b*c)
SE = sqrt(1/a + 1/b + 1/c + 1/d)
low99 = exp(log(OR) - 2.58 * SE)
high99 = exp(log(OR) + 2.58 * SE)
# then we put the calculated values to the i-th position of our vectors
r1[i] = OR
r2[i] = low99
r3[i] = high99
}
# finally, we append these vectors as new columns to our dataset
snpdata$oddsratio = r1
snpdata$low99 = r2
snpdata$high99 = r3
plot(oddsratio ~ genome_position, data = snpdata, col = chromosome,
ylab = "Odds ratio", xlab = "Genome position",
main = "OR as function of genome position, colored by chromosome")
In this case, we want a subset of the data where either the lower bound for the confidence interval (low99) is above 1 or the upper bound (high99) is below 1, which are the conditions that we specify in the subset function. The | symbol between them means “or”, so we get all the datapoints where the CI spans either below or above 1. If we substituted | for &, that would mean “and” and we would get no datapoints as our conditions are mutually exclusive.
plot(oddsratio ~ genome_position, data = subset(x=snpdata, subset = low99 > 1.00 | high99 < 1.00),
col = chromosome, ylab = "Odds ratio", xlab = "Genome position",
main = "OR as function of genome position, colored by chromosome")
A gentle introduction to ggplot - it is an entirely different way of plotting in R, using the package “ggplot2” from the “tidyverse”. It’s pretty awesome, so if you’re interested in plotting, definitely look it up.
plot1 = ggplot(data = snpdata %>% filter(low99 > 1.0 | high99 < 1.0)) +
geom_point(mapping = aes(x=genome_position, y=oddsratio, color=factor(chromosome))) +
ggtitle("OR as function of genome position, colored by chromosome") +
xlab("Genome position") +
ylab("Odds ratio")
plot(plot1)
sum(snpdata$low99 > 1.0 | snpdata$high99 < 1.0)
## [1] 2764
Take me back to the beginning!
df = subset(x = snpdata, subset = low99 > 1.0 | high99 < 1.0)
x = table(df$chromosome)
x
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 205 267 216 151 153 262 158 121 158 129 116 114 90 79 72 89 73 106
## 19 20 21 22
## 38 55 54 58
E.g. it turns out that 129 SNPs on chromosome 8 has a 99% CI above or below 1.0. There are 16470 SNPs on chromosome 8 in total, so 129 / 16470 is 0.78%.
x = table(df$chromosome)
y = table(snpdata$chromosome)
sort(x/y)
##
## 19 8 20 12 13 11
## 0.007044865 0.007346691 0.007723634 0.008320561 0.008595989 0.008726397
## 14 5 4 15 10 17
## 0.008746678 0.008768913 0.008805178 0.008905380 0.009048822 0.009556225
## 1 7 21 16 9 3
## 0.009567368 0.010405005 0.010737721 0.010760488 0.010921407 0.010939478
## 18 2 22 6
## 0.011136793 0.011586530 0.011670020 0.013839003
barplot(sort(x/y), col = "deeppink4", main ="Percentage of SNPs where the 99% CI doesn't overlap 1")
We can see that chromosome 6 has the highest percentage of such SNPs.
df = subset(x = snpdata, subset = chromosome==6 & (low99 > 1.0 | high99 < 1.0))
plot(oddsratio ~ chromosome_position, data = df, col = "deeppink4",
ylab = "Odds ratio", xlab = "Genome position",
main = "OR as function of genome position")
ggplot(data = df) +
geom_point(mapping = aes(x=chromosome_position, y=oddsratio), col = "deeppink4") +
geom_hline(yintercept = 1) +
ggtitle("OR as function of genome position") +
ylab("Odds ratio") +
xlab("Genome position")
They seem to cluster together around 30 megabases.
table(df$chromosome_position %/% 1000000)
##
## 0 1 2 6 8 9 10 14 15 16 17 18 22 23 24 25 26 27
## 1 3 1 3 2 3 2 1 1 2 6 1 5 1 3 1 1 3
## 28 29 30 31 32 33 34 37 40 41 42 44 45 46 47 48 52 53
## 6 1 7 39 15 7 3 1 1 2 1 1 2 1 2 2 1 1
## 55 63 64 68 70 76 78 79 83 85 87 91 92 93 94 95 96 97
## 1 1 2 3 9 1 1 1 2 2 2 1 2 3 1 1 10 2
## 100 107 108 109 114 116 122 123 125 127 128 129 130 133 134 137 138 139
## 1 1 1 1 1 10 1 4 2 4 1 2 2 3 2 2 4 1
## 141 144 148 149 151 152 154 155 156 157 158 159 161 162 163 164 168 169
## 1 1 4 3 1 1 1 7 5 3 2 1 1 1 1 3 3 2
We see 39 SNPs between 31 and 32 mb.
Take me back to the beginning!
Let’s look at SNP number 2059.
Observed = snpdata$contingency_table[[2059]] # we extract the contingency table
# then we extract the values we need from the contingency table
a = Observed[1,1]
b = Observed[1,2]
c = Observed[2,1]
d = Observed[2,2]
n = a+b+c+d # number of observations in total
# calculating the probabilities for the rows and columns
Pr_row1 = (a+b)/n
Pr_row2 = (c+d)/n
Pr_col1 = (a+c)/n
Pr_col2 = (b+d)/n
# calculating the expected counts for each cell
E_a = Pr_row1 * Pr_col1 * n
E_b = Pr_row1 * Pr_col2 * n
E_c = Pr_row2 * Pr_col1 * n
E_d = Pr_row2 * Pr_col2 * n
# then we create a table from our calculated values
# c() concatenates the two numbers
# rbind() "glues together" the two rows of those concatenated numbers
# as.table() converts this structure into a table
Expected = as.table(rbind(c(E_a, E_b), c(E_c, E_d)))
colnames(Expected) = c("allele1", "allele2")
rownames(Expected) = c("cases", "controls")
mosaicplot(Observed, col=c("firebrick", "goldenrod1"), cex.axis = 1.2,
sub = "Allele", dir = c("h","v"), ylab = "Group", main="SNP 2059 - observed")
mosaicplot(Expected, col=c("firebrick", "goldenrod1"), cex.axis = 1.2,
sub = "Allele", dir = c("h","v"), ylab = "Group", main="SNP 2059 - expected")
# then we can do the entire chi^2 test by doing calculations with our tables
# this is going to subtract the corresponding cell values:
Observed-Expected
## allele1 allele2
## cases 60 -60
## controls -60 60
# so we can calculate chi^2 accordingly:
chisquare = sum((Observed-Expected)^2/Expected)
df = (nrow(Observed)-1)*(ncol(Observed) - 1) # Page 249 in Analysis of biological data, 2nd edition
p = pchisq(q = chisquare, df = df, lower.tail = FALSE)
cat("X-square:", chisquare, "df =", df, "p =", p, "\n")
## X-square: 36.82864 df = 1 p = 1.289812e-09
#### Built-in test in R ####
chisq.test(Observed, correct = F)
##
## Pearson's Chi-squared test
##
## data: Observed
## X-squared = 36.829, df = 1, p-value = 1.29e-09
Apart from differences due to rounding, they are the same.
Take me back to the beginning!
You can find information about this on page 251 in Analysis of biological data, 2nd edition.
# notice that the formula for chi^2 changes a bit - we subtract 1/2 and we also take the absolute value
chisquare = sum((abs(Observed-Expected)-1/2)^2/Expected)
df = (nrow(Observed)-1)*(ncol(Observed) - 1)
p = pchisq(q = chisquare, df = df, lower.tail = FALSE)
cat("X-square:", chisquare, "df =", df, "p =", p, "\n")
## X-square: 36.21739 df = 1 p = 1.764885e-09
# Compare with the built-in R version of continuity correction - we just need to specify correct = T
chisq.test(Observed, correct = T)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: Observed
## X-squared = 36.217, df = 1, p-value = 1.765e-09
We get the same results.
Take me back to the beginning!
# we can start by creating a new data structure
M = rbind(rbind(rbind(
# we create as many rows as many cases we had with Allele 1
data.frame("Group"=rep("cases", a), "Allele" = rep("1", a)),
# then as many controls we had with Allele 2
data.frame("Group"=rep("controls", c), "Allele" = rep("1", c))),
# and do the same with the cases and controls having Allele 2
data.frame("Group"=rep("cases", b), "Allele" = rep("2", b))),
data.frame("Group"=rep("controls", d), "Allele" = rep("2", d)))
# now we have a data frame with one row per individual
head(M)
## Group Allele
## 1 cases 1
## 2 cases 1
## 3 cases 1
## 4 cases 1
## 5 cases 1
## 6 cases 1
# we can turn it into a table and we get the original data back
addmargins(table(M))
## Allele
## Group 1 2 Sum
## cases 400 400 800
## controls 280 520 800
## Sum 680 920 1600
# let's randomly shuffle one of the columns in our data frame - the sample() function will do this for us
M$Group = sample(M$Group)
head(M)
## Group Allele
## 1 controls 1
## 2 controls 1
## 3 controls 1
## 4 controls 1
## 5 cases 1
## 6 controls 1
addmargins(table(M))
## Allele
## Group 1 2 Sum
## cases 332 468 800
## controls 348 452 800
## Sum 680 920 1600
# let's do this again - we get a different outcome due to the randomness of the shuffling
M$Group = sample(M$Group)
head(M)
## Group Allele
## 1 cases 1
## 2 controls 1
## 3 controls 1
## 4 cases 1
## 5 controls 1
## 6 controls 1
addmargins(table(M))
## Allele
## Group 1 2 Sum
## cases 343 457 800
## controls 337 463 800
## Sum 680 920 1600
# this is quite troublesome - R has a simpler way of doing this using the r2dtable() function
# we just need the row and column sums
rs = rowSums(Observed)
cs = colSums(Observed)
# and we give these values to the r2dtable() function, along with the number of simulations we need, so in this case, 5
r2dtable(n=5, r = rs, c = cs)
## [[1]]
## [,1] [,2]
## [1,] 337 463
## [2,] 343 457
##
## [[2]]
## [,1] [,2]
## [1,] 328 472
## [2,] 352 448
##
## [[3]]
## [,1] [,2]
## [1,] 358 442
## [2,] 322 478
##
## [[4]]
## [,1] [,2]
## [1,] 345 455
## [2,] 335 465
##
## [[5]]
## [,1] [,2]
## [1,] 344 456
## [2,] 336 464
# so now we just do this one million times
simulations = 1000000
# Simulate a lot of tables
set.seed(0)
simulated_tables = r2dtable(n = simulations,r = rs, c = cs)
# Check the first two tables
addmargins(simulated_tables[[1]])
## Sum
## 324 476 800
## 356 444 800
## Sum 680 920 1600
addmargins(simulated_tables[[2]])
## Sum
## 337 463 800
## 343 457 800
## Sum 680 920 1600
addmargins(Observed)
## allele1 allele2 Sum
## cases 400 400 800
## controls 280 520 800
## Sum 680 920 1600
# now we want to do a chi^2 test for all of them
# first we create a vector that is as long as many simulations we have
sim_chisquares = rep(NA, simulations)
# then we loop through all the tables and calculate the chi square for each of them
for (j in 1:simulations) {
sim_chisquares[j] = sum(((simulated_tables[[j]]-Expected)^2)/Expected)
}
# let's calculate the observed chi^2 as well
chisquare = sum((Observed-Expected)^2/Expected)
df = (nrow(Observed)-1)*(ncol(Observed) - 1)
# now, for plotting the theoretical chi^2 distribution, we create breaks that go from 0 to the maximum value of our simulated chi^2 values, by 0.5
breaks = seq(0,ceiling(max(sim_chisquares)), by = 0.5)
# and calculate the midpoints for each bin
midpoints = 1/2 * (breaks[2:length(breaks)] + breaks[1:(length(breaks)-1)])
# we basically calculate the expected frequency of chi^2 values between 0 and 1 - and multiply by the total number of simulations
# we have the midpoints on the x axis
x = midpoints
# and the probabilities on the y axis
y = (pchisq(q = breaks[2:length(breaks)], df=1) - pchisq(q=breaks[1:(length(breaks)-1)], df=1))*length(sim_chisquares)
# now we put the simulated chi^2 values on a histogram
hist(sim_chisquares, breaks=breaks, xlim = c(0, 40), col = "deeppink4",
main = "Theoretical chi^2 distribution", xlab = "Simulated chi^2 values")
# add line with the observed value
abline(v=chisquare)
# add the real chi square distribution
lines(y ~ x, lwd=2)
sum(sim_chisquares >= chisquare)
## [1] 0
sum(sim_chisquares >= chisquare) / length(sim_chisquares)
## [1] 0
chisq.test(Observed, correct = F, simulate.p.value = TRUE, B = 1000000)
##
## Pearson's Chi-squared test with simulated p-value (based on 1e+06
## replicates)
##
## data: Observed
## X-squared = 36.829, df = NA, p-value = 1e-06
The chi^2 value is very close to the one we obtained and we also get a very small p-value, so we can arrive to the same conclusion using both methods.
Take me back to the beginning!
fisher.test(Observed)
##
## Fisher's Exact Test for Count Data
##
## data: Observed
## p-value = 1.637e-09
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.511641 2.281964
## sample estimates:
## odds ratio
## 1.856437
x = fisher.test(Observed) # we store the results of the test in a variable
names(x)
## [1] "p.value" "conf.int" "estimate" "null.value" "alternative"
## [6] "method" "data.name"
x$conf.int
## [1] 1.511641 2.281964
## attr(,"conf.level")
## [1] 0.95
x$estimate # estimate for the true odds ratio
## odds ratio
## 1.856437
x$p.value
## [1] 1.636953e-09
microbenchmark(chisq.test(Observed))
## Unit: microseconds
## expr min lq mean median uq max neval
## chisq.test(Observed) 65.186 66.502 80.80606 68.489 76.4655 257.969 100
microbenchmark(chisq.test(Observed, correct = F))
## Unit: microseconds
## expr min lq mean median uq
## chisq.test(Observed, correct = F) 58.92 60.7465 70.524 61.5305 66.6115
## max neval
## 222.352 100
microbenchmark(chisq.test(Observed, simulate.p.value = TRUE))
## Unit: microseconds
## expr min lq mean
## chisq.test(Observed, simulate.p.value = TRUE) 437.924 448.669 528.3793
## median uq max neval
## 462.7065 534.875 1025.747 100
microbenchmark(chisq.test(Observed, simulate.p.value = TRUE, B = 100000))
## Unit: milliseconds
## expr min
## chisq.test(Observed, simulate.p.value = TRUE, B = 1e+05) 16.75983
## lq mean median uq max neval
## 17.68245 18.56991 18.22254 18.78174 44.61221 100
microbenchmark(fisher.test(Observed))
## Unit: milliseconds
## expr min lq mean median uq
## fisher.test(Observed) 1.109719 1.27573 2.021988 1.737481 1.956997
## max neval
## 17.20881 100
In itself, the chisq.test is fastest, so it’s a good solution when we can make sure that we can satisfy the assumptions of the chi^2 test. However, Fisher’s exact test wins if we compare it to doing a chi^2 test with simulation.
x = rep(NA, nrow(snpdata))
for (i in 1:length(x)) {
result =fisher.test(snpdata$contingency_table[[i]])
x[i] = result$p.value
}
snpdata$p.value = x
Take me back to the beginning!
plot(x = snpdata$genome_position, y=-log10(snpdata$p.value),
ylab = "-log10(p-value)", xlab = "Genome position")
ggplot(data = snpdata) +
geom_point(mapping = aes(x=genome_position, y=-log10(p.value), color=factor(chromosome))) +
xlab("Genome position") +
ylab("-log10(p-value)")
# we get the indices of the top 1000 significant SNPs and use these to plot
i = order(snpdata$p.value)[1:1000]
plot(x = snpdata$genome_position[i], y=-log10(snpdata$p.value[i]),
ylab = "-log10(p-value)", xlab = "Genome position")
Let’s adjust the p-values and add them to our data.
# Bonferroni
snpdata$bonferroni = p.adjust(p = snpdata$p.value, method = "bonferroni")
# False discovery rate
snpdata$fdr = p.adjust(p = snpdata$p.value, method = "fdr")
# the different adjustment methods
p.adjust.methods
## [1] "holm" "hochberg" "hommel" "bonferroni" "BH"
## [6] "BY" "fdr" "none"
sum(snpdata$p.value <= 0.05)
## [1] 9914
0.05*nrow(snpdata)
## [1] 14032.3
sum(snpdata$bonferroni <= 0.05)
## [1] 9
# another way of doing this:
sum(snpdata$p.value <= 0.05/nrow(snpdata) )
## [1] 9
snpdata[(snpdata$p.value <= 0.05/nrow(snpdata)),]
## # A tibble: 9 x 11
## rsid chromosome chromosome_positi~ genome_position contingency_tab~
## <chr> <int> <int> <dbl> <list>
## 1 rs3806308 1 20015453 20015453 <dbl [2 x 2]>
## 2 rs116795~ 2 2385495 249526800 <dbl [2 x 2]>
## 3 rs2597515 3 13502134 503336259 <dbl [2 x 2]>
## 4 rs4371616 4 14985834 704118331 <dbl [2 x 2]>
## 5 rs1924466 6 17880743 1078765480 <dbl [2 x 2]>
## 6 rs701308 7 83400100 1315014970 <dbl [2 x 2]>
## 7 rs2462692 10 7184261 1683985726 <dbl [2 x 2]>
## 8 rs985035 13 83160898 2161969337 <dbl [2 x 2]>
## 9 rs1975242 15 31588499 2330850156 <dbl [2 x 2]>
## # ... with 6 more variables: oddsratio <dbl>, low99 <dbl>, high99 <dbl>,
## # p.value <dbl>, bonferroni <dbl>, fdr <dbl>
After Bonferroni correction, we only have 9 SNPs left, which is quite a drastic cut after having 9914 such SNPs without correction. However, this is a good thing - in a GWAS study, we are looking for disease-causing alleles and it’s less helpful if we find thousands (especially since many of them are found significant due to the inflated type 1 error stemming from multiple testing).
sum(snpdata$fdr <= 0.1)
## [1] 72
sum(snpdata$fdr <= 0.2)
## [1] 166
Take me back to the beginning!
For the rest of the exercise we will focus on significant SNPs with an FDR <= 0.2 (false discovery rate).
df = subset(x = snpdata, subset = fdr <= 0.2)
table(snpdata$chromosome)
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 21427 23044 19745 17149 17448 18932 15185 16470 14467 14256 13293 13701
## 13 14 15 16 17 18 19 20 21 22
## 10470 9032 8085 8271 7639 9518 5394 7121 5029 4970
table(df$chromosome)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
## 15 18 14 8 6 23 7 3 13 9 5 3 3 7 6 3 2 11 1 3 3 3
table(df$chromosome) / table(snpdata$chromosome)
##
## 1 2 3 4 5
## 0.0007000513 0.0007811144 0.0007090403 0.0004664995 0.0003438790
## 6 7 8 9 10
## 0.0012148743 0.0004609812 0.0001821494 0.0008985968 0.0006313131
## 11 12 13 14 15
## 0.0003761378 0.0002189621 0.0002865330 0.0007750221 0.0007421150
## 16 17 18 19 20
## 0.0003627131 0.0002618144 0.0011557050 0.0001853912 0.0004212891
## 21 22
## 0.0005965401 0.0006036217
barplot(table(df$chromosome)/table(snpdata$chromosome), col="deeppink4")
barplot(sort(table(df$chromosome)/table(snpdata$chromosome)), col="deeppink4")
Chromosome 6.
table(snpdata$chromosome==6)
##
## FALSE TRUE
## 261714 18932
table(df$chromosome==6)
##
## FALSE TRUE
## 143 23
So a SNP can be significant or not, and it can be on chromosome 6 or not.
You can imagine this as a coin you toss n times - you get x number of heads (heads being landing on chromosome 6), and for each toss you expect to hit heads with probability p (which we estimate from how all the SNPs are distributed).
# p is the probability that a random SNP is on chromosome 6
p = sum(snpdata$chromosome==6) / nrow(snpdata)
# x is how many significant SNPs we actually observe on chromosome 6
x = sum(df$chromosome==6)
# n is the number of significant SNPs we have in total (out of which each falls on chromosome 6 with probability p)
n = nrow(df)
binom.test(x = x, n = n, p = p)
##
## Exact binomial test
##
## data: x and n
## number of successes = 23, number of trials = 166, p-value =
## 0.0009543
## alternative hypothesis: true probability of success is not equal to 0.06745865
## 95 percent confidence interval:
## 0.0899025 0.2006129
## sample estimates:
## probability of success
## 0.1385542
# we reject the null hypothesis - the number of significant SNPs on chromosome 6 indeed seems to be different than expected
x = snpdata$chromosome==6 # being on chromosome 6 or not
y = snpdata$fdr <= 0.2 # being significant or not
# we put all this data into a table
z = table(x,y)
addmargins(z)
## y
## x FALSE TRUE Sum
## FALSE 261571 143 261714
## TRUE 18909 23 18932
## Sum 280480 166 280646
# and we can easily perform a chi^2 test
chisq.test(z)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: z
## X-squared = 12.239, df = 1, p-value = 0.000468
# we can also perform a Fisher's test
fisher.test(z)
##
## Fisher's Exact Test for Count Data
##
## data: z
## p-value = 0.0009508
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.366136 3.472255
## sample estimates:
## odds ratio
## 2.224882
Using all methods, we come to the same conclusions, so it seems that there are more significant SNPs falling on chromosome 6 than we’d expect.
table(snpdata$chromosome)
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 21427 23044 19745 17149 17448 18932 15185 16470 14467 14256 13293 13701
## 13 14 15 16 17 18 19 20 21 22
## 10470 9032 8085 8271 7639 9518 5394 7121 5029 4970
barplot(table(snpdata$chromosome), col = "deeppink4")
# first of all, we create a new dataframe to store the chromosome size information
chromosome_sizes = aggregate(x = snpdata$chromosome_position, by = list(chr = unlist(snpdata$chromosome)), FUN = max)
chromosome_sizes = data.frame(chromosome_sizes)
colnames(chromosome_sizes) = c("chromosome", "size")
chromosome_sizes$size = as.numeric(chromosome_sizes$size)
chromosome_sizes$size/sum(as.numeric(chromosome_sizes$size))
## [1] 0.08624786 0.08469542 0.06955154 0.06670047 0.06303436 0.05958174
## [7] 0.05542261 0.05103705 0.04890244 0.04721178 0.04691498 0.04616643
## [13] 0.03982168 0.03711252 0.03497343 0.03094383 0.02744198 0.02656317
## [19] 0.02225672 0.02176843 0.01637046 0.01728109
# table of the number of SNPs in the chromosome
x = table(snpdata$chromosome)
x
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 21427 23044 19745 17149 17448 18932 15185 16470 14467 14256 13293 13701
## 13 14 15 16 17 18 19 20 21 22
## 10470 9032 8085 8271 7639 9518 5394 7121 5029 4970
#we divide all of them by the chromosome size, and sort them
sort(x / chromosome_sizes$size)
##
## 15 19 14 1 4
## 8.067608e-05 8.457711e-05 8.493104e-05 8.669939e-05 8.972488e-05
## 13 16 2 7 5
## 9.175508e-05 9.327965e-05 9.495131e-05 9.561605e-05 9.659870e-05
## 17 11 3 22 9
## 9.714582e-05 9.888137e-05 9.907256e-05 1.003664e-04 1.032407e-04
## 12 10 21 6 8
## 1.035688e-04 1.053781e-04 1.072072e-04 1.108885e-04 1.126188e-04
## 20 18
## 1.141607e-04 1.250457e-04
barplot(x / chromosome_sizes$size, col = "deeppink4")
# we can plot the number of SNPs per chromosome
pd = as.data.frame(table(snpdata$chromosome))
names(pd) = c("Chromosome", "Frequency")
ggplot(pd, aes(x=Chromosome, y=Frequency)) +
geom_col(fill="deeppink4") +
ylab("Number of SNPs")
# and the chromosome size per chromosome as well
pd = as.data.frame(chromosome_sizes)
names(pd) = c("Chromosome", "Frequency")
ggplot(pd, aes(x=Chromosome, y=Frequency)) +
geom_col(fill="deeppink4") +
ylab("Chromosome size")
# finally, the SNP "concentration" - so the number of SNPs divided by the chromosome size
pd = as.data.frame(x / chromosome_sizes$size)
names(pd) = c("Chromosome", "Frequency")
ggplot(pd, aes(x=Chromosome, y=Frequency)) +
geom_col(fill="deeppink4") +
ylab("Density of SNPs")
Chromosome 18 is the one with the largest concentration of SNPs.
\(H_A:\) The number of SNPs per megabase on this chromosome is NOT the same as the rest of the genome
We can do a binomial test to test this null hypothesis.
# we get the size of chromosome 18
size18 = chromosome_sizes$size[chromosome_sizes$chromosome==18]
# then the total size of the genome
totalsize = sum(chromosome_sizes$size)
# p will be the proportion of the genome that is chromosome 18
p = size18 / totalsize
# x is the number of SNPs we observe on chromosome 18
x = sum(snpdata$chromosome==18)
# n is the total number of SNPs
n = nrow(snpdata)
p*n # expected number of SNPs on chromosome 18 - it's less than what we observe
## [1] 7454.846
# with these, we can do a binomial test
binom.test(x = x, n = n, p = p)
##
## Exact binomial test
##
## data: x and n
## number of successes = 9518, number of trials = 280650, p-value <
## 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.02656317
## 95 percent confidence interval:
## 0.03324796 0.03459091
## sample estimates:
## probability of success
## 0.03391461
We reject the null hypothesis, the number of SNPs per megabase on this chromosome is not the same as the rest of the genome.
Actually, we cherry pick a little bit above. The real null hypothesis should be: \(H_0:\) The number of SNPs per base is the same for all chromosomes \(H_A:\) The number of SNPs per base is NOT the same for all chromosomes
We can use a chi^2 test.
# the number of SNPs per chromosome
snps = table(snpdata$chromosome)
# not SNPs - we just subtract the number of SNPs (as each have a length of 1) from the total length of the chromosome
not_snps = chromosome_sizes$size - snps
# we put these together in a table
z = rbind(snps, not_snps)
round(addmargins(z)/1000, 2)
## 1 2 3 4 5 6
## snps 21.43 23.04 19.75 17.15 17.45 18.93
## not_snps 247119.88 242669.78 199278.63 191111.55 180606.10 170711.20
## Sum 247141.30 242692.82 199298.37 191128.70 180623.54 170730.13
## 7 8 9 10 11 12
## snps 15.19 16.47 14.47 14.26 13.29 13.7
## not_snps 158797.06 146229.04 140114.37 135270.04 134420.52 132275.2
## Sum 158812.25 146245.51 140128.84 135284.29 134433.81 132288.9
## 13 14 15 16 17 18 19
## snps 10.47 9.03 8.09 8.27 7.64 9.52 5.39
## not_snps 114097.65 106336.07 100207.50 88660.59 78626.73 76106.63 63770.72
## Sum 114108.12 106345.10 100215.58 88668.86 78634.37 76116.15 63776.12
## 20 21 22 Sum
## snps 7.12 5.03 4.97 280.65
## not_snps 62369.84 46904.15 49513.59 2865196.78
## Sum 62376.96 46909.18 49518.56 2865477.42
# and do a chi^2 test
r = chisq.test(z)
r
##
## Pearson's Chi-squared test
##
## data: z
## X-squared = 2714.7, df = 21, p-value < 2.2e-16
# we can also check what would be the expected number of SNPs and not SNPs per chromosome
r$expected
## 1 2 3 4 5
## snps 24205.12 23769.43 19519.36 18719.22 17690.34
## not_snps 247117099.88 242669050.57 199278852.64 191109977.78 180605852.66
## 6 7 8 9 10
## snps 16721.38 15554.13 14323.34 13724.27 13249.8
## not_snps 170713411.62 158796692.87 146231188.66 140115111.73 135271043.2
## 11 12 13 14 15
## snps 13166.5 12956.42 11175.79 10415.48 9.815154e+03
## not_snps 134420645.5 132275912.58 114096945.21 106334681.52 1.002058e+08
## 16 17 18 19 20
## snps 8684.263 7701.481 7454.846 6246.258 6109.224
## not_snps 88660171.737 78626664.519 76108697.154 63769871.742 62370848.776
## 21 22
## snps 4594.303 4849.867
## not_snps 46904580.697 49513709.133
We reject the null hypothesis, the number of SNPs per base is not the same for all chromosomes.
Take me back to the beginning!