In this tutorial we will show how to re-analyze the hemagglutination inhibition (HAI) assay data from one study in ImmPort using R.
The data stored in ImmPort can be viewed and accessed through the web site. The database content is also available for download in MySQL and Tab Separated Value (TSV) files. To download data from ImmPort you must first register for a free account.
In ImmPort, each study has a study accession beginning with the letters SDY, followed by a number. For this example, we will focus on the study SDY212.
Study accession SDY212
Title: Apoptosis and other immune biomarkers predict influenza vaccine responsiveness
Principal Investigator: Mark M. Davis
Description: In an effort to identify benchmarks of immunological health, influenza vaccination was used in 30 young (20 to 30 years) and 59 older subjects (60 to 89 years) as models for strong and weak immune responses, respectively.
The data from this study can be downloaded here. If the page opens with an error, please make sure to log into ImmPort.
The hemagglutination inhibition (HAI) assay can be used to determine the level of antibodies to influenza virus in a serum sample.
A good explanation of the HAI assay is available on the virology blog. In short, a fixed amount of virus is added to each well of a 96-well plate. From each serum sample a two-fold dilution series is prepared and the series is added to the plate. A fixed amount of red blood cells are then added and the plate is incubated. In the absence of viral antibodies, the virus will cross-link the cells and form a visible agglutination. Antibodies to the virus present in the serum will prevent the virus from attaching to the red blood cells and this inhibition of hemagglutination is used to quantify the antiviral antibodies. The weakest dilution of serum that produces hemagglutination is used as the measure of antibody concentration. If 1:40 is the weakest dilution that prevents hemagglutination, the HAI titer is 40.
Seasonal influenza vaccine contains antigens representing three (trivalent vaccine) or four (quadrivalent vaccine) influenza virus strains: * Influenza type A subtype H1N1 virus strain * Influenza type A subtype H3N2 virus strain * One or two influenza type B virus strains
A hemagglutination inhibition assay (HAI) antibody titer of 40 or more is generally associated with a 50% reduction in risk of influenza illness. The World Health Organization defines effective and seropositive influenza vaccine responses as a 28-day post-vaccine antibody HAI titer greater than or equal to 40. The seroprotection rate is defined as the proportion of participants with HAI titers of 40 or more.
Seroconversion/Seroresponders are defined as subjects with a fourfold or greater increase in antibody titer after vaccination. Consequently, the Seroconversion rate is the proportion of subjects with a fourfold or greater increase in HAI antibody titer after vaccination relative to pre-vaccination.
Libraries that are not available yet in the R distribution need to be installed first. In the following commands, four required libraries are installed. Run these 4 commands if you do not have these packages installed.
install.packages(“scales”)
install.packages(“ggplot2”)
install.packages(“reshape”)
install.packages(“RMySQL”)
The four libraries then need to be loaded as in the following code snippet.
library(scales)
## Warning: package 'scales' was built under R version 3.3.2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.2
library(reshape)
library(RMySQL)
## Loading required package: DBI
The data we will be using for this tutorial is contained in the SDY212-DR14_MySQL.zip file.
An alternative to using the TSV files to analyze the study data in R is a MySQL database. At the moment, the ImmPort MySQL database can not be made publicly available to users not registered with ImmPort. In order to run this part of the code, a personal copy of the database needs to be installed. ImmPort provides the public MySQL database as one download containing the data of all studies or as separate download for each study. The queries in this section are developed to work with either of these options.
The table and column documentation for the public ImmPort database schema as well as the full Data Model are available on the ImmPort web site.
The next step is to connect and authenticate to the MySQL database. The connection parameters need to be adjusted to connect to a personal installation of the ImmPort MySQL database. The object con
is created that will be used to direct commands to the database engine.
con <- dbConnect(MySQL(), user="user", password="password",dbname="dbname",host="host")
Query the required information about the subjects and the HAI assay results.
hai.data.SDY212 <- dbGetQuery(con,"SELECT hr.subject_accession, hr.arm_accession, hr.study_time_collected, hr.study_time_collected_unit, hr.value_preferred, hr.virus_strain_reported, s.race, s.gender, a2s.min_subject_age, a2s.age_unit
FROM hai_result hr, subject s, arm_2_subject a2s
WHERE hr.study_accession = 'SDY212'
AND hr.subject_accession = s.subject_accession
AND a2s.subject_accession = s.subject_accession;")
hai.data.SDY212 <- cbind(hai.data.SDY212,ARM_NAME=NA)
hai.data.SDY212[which(hai.data.SDY212$arm_accession == "ARM894"),"ARM_NAME"] <- "Young"
hai.data.SDY212[which(hai.data.SDY212$arm_accession == "ARM895"),"ARM_NAME"] <- "Old"
Check if the age unit is the same for all subjects. It is uniquely ‘Years’ for all subjects in this study.
unique(hai.data.SDY212[,"age_unit"])
## [1] "Years"
Build a data frame with subject information and separate columns for pre-vaccination HAI titer (Day 0), post-vaccination HAI titer (Day 28) and their fold change (post titer/pre titer).
hai.SDY212.prepost <- cast(hai.data.SDY212, subject_accession + min_subject_age + gender + race + virus_strain_reported + arm_accession + ARM_NAME ~ study_time_collected, value='value_preferred')
colnames(hai.SDY212.prepost)[colnames(hai.SDY212.prepost)%in%c("0","28")] <- c("Day0","Day28")
hai.results.SDY212 <- cbind(hai.SDY212.prepost, fold.change= hai.SDY212.prepost[,"Day28"] / hai.SDY212.prepost[,"Day0"])
In order for this created data frame to be used for the later analysis steps (starting with the section Explore the study data) on the main page of this tutorial, the column names, expect ‘Day0’, ‘Day28’ and ‘fold.change’, need to be changed to upper case letters.
colnames(hai.results.SDY212) <- sapply(colnames(hai.results.SDY212),function(x){
if(x %in% c('Day0', 'Day28', 'fold.change')){
return(x)
}else{
return(toupper(x))
}})
When working with a new data set, it is usefull to get an overview of the data. Now that we built our data table, we will look first at the subject data. We show some descriptive statistics and some simple plots of the data, just to get a feel for the distribution of age, gender and race. Plotting is done using the ggplot2 package, which is based on the Grammar of Graphics (by Leland Wilkinson) to describe data graphics. A great page to learn about ggplot2 and as a look-up resource can be found in the Cookbook for R.
We should use the hai.results.SDY212
table to explore the subject characteristics in this study. However, hai.results.SDY212
contains three rows per subject, one for each virus strain. So, we create a new matrix containing only one row for each subject.
subjects.HAI <- unique(hai.results.SDY212[,c("SUBJECT_ACCESSION", "MIN_SUBJECT_AGE", "GENDER", "RACE","ARM_NAME")])
First, we get an overview of how many subjects per categorical variable (arm, gender, race) we have in our data set.
table(subjects.HAI$ARM_NAME)
##
## Old Young
## 60 29
table(subjects.HAI$GENDER)
##
## Female Male
## 54 35
table(subjects.HAI$RACE)
##
## American Indian or Alaska Native Asian
## 1 8
## Other White
## 7 73
As a second step, we create a few plots that show us the demographic data visually.
ggplot(data=subjects.HAI, aes(x=ARM_NAME,fill=GENDER)) + geom_bar(stat="count", position=position_dodge(), colour="black")
ggplot(data=subjects.HAI, aes(x=ARM_NAME,fill=RACE)) + geom_bar(stat="count", position=position_dodge(), colour="black")
ggplot(data=subjects.HAI, aes(x=GENDER,y=MIN_SUBJECT_AGE)) + geom_boxplot() + geom_point(aes(color=RACE),position=position_jitter(width=0.25,height=.1)) + theme(axis.text.x=element_text(angle=50, vjust=1.2,hjust=1.2)) + facet_wrap(~ ARM_NAME,ncol=2)
Plot of post (Day 28) vs. pre (Day 0) vaccination HAI titer. Colored by virus strain.
This plot shows the typical triangular shape. If a subject already has a certain HAI titer for one strain before vaccination (x axis), it will most likely not go down after vaccination (y axis). This produces the triangular shape seen in this plot.
ggplot(hai.results.SDY212, aes(x=Day0, y=Day28, color=VIRUS_STRAIN_REPORTED)) +
scale_y_continuous(trans=log2_trans()) + scale_x_continuous(trans=log2_trans()) +
geom_point(position=position_jitter(width=0.1,height=.1)) +
ylab("HAI titer post vaccination") + xlab("HAI titer pre vaccination")
Plot of HAI response (fold change) vs. pre (Day 0) vaccination HAI titer.
ggplot(hai.results.SDY212, aes(x=Day0, y=fold.change, color=VIRUS_STRAIN_REPORTED)) +
scale_y_continuous(trans=log2_trans()) + scale_x_continuous(trans=log2_trans()) +
geom_point(position=position_jitter(width=0.1,height=.1)) +
geom_hline(aes(yintercept=4), colour="#BB0000", linetype="dashed") +
ylab("HAI response (post/pre)") + xlab("HAI titer pre vaccination") +
annotate("text", label = "High responders", x = 900, y = 5, size = 3.5) + annotate("text", label = "Low responders", x = 900, y = 3.1, size = 3.5)
Plotting boxplots of the HAI responses by virus strain, we can see that the response varies by virus type.
ggplot(hai.results.SDY212, aes(x=VIRUS_STRAIN_REPORTED, y=fold.change)) + geom_boxplot() + xlab("") + theme(legend.position="none") + geom_point(aes(color=VIRUS_STRAIN_REPORTED),position=position_jitter(width=0.25,height=.2)) + scale_y_continuous(trans=log2_trans())+ ylab("HAI response (Post vaccination/Pre vaccination)")
Showing the response data per age group (arm) points to the idea that the HAI response also varies by arm.
ggplot(hai.results.SDY212, aes(x=ARM_NAME, y=fold.change)) + geom_boxplot() + xlab("") + theme(legend.position="none") + geom_point(aes(color=ARM_NAME),position=position_jitter(width=0.25,height=.2)) + scale_y_continuous(trans=log2_trans())+ ylab("HAI response (Post vaccination/Pre vaccination)")
Next, we focussed on the combination of these two factors. Do young people have a higher response for each of the three virus strains?
ggplot(hai.results.SDY212, aes(x=ARM_NAME, y=fold.change)) + geom_boxplot() + xlab("") + theme(legend.position="none") + geom_point(aes(color=VIRUS_STRAIN_REPORTED),position=position_jitter(width=0.25,height=.2)) + scale_y_continuous(trans=log2_trans())+ ylab("HAI response (Post vaccination/Pre vaccination)") + facet_wrap(~ VIRUS_STRAIN_REPORTED)
In this figure, it appears that the response to the H3N2 strain is independent of age. However, for the H1N1 and B strains, younger subjects respond stronger than older subjects. To test this, we run Wilcoxon rank sum tests.
wilcox.test(hai.results.SDY212[which(hai.results.SDY212[,"VIRUS_STRAIN_REPORTED"]=="B" & hai.results.SDY212[,"ARM_NAME"]=="Young"),"fold.change"], hai.results.SDY212[which(hai.results.SDY212[,"VIRUS_STRAIN_REPORTED"]=="B" & hai.results.SDY212[,"ARM_NAME"]=="Old"),"fold.change"])
##
## Wilcoxon rank sum test with continuity correction
##
## data: hai.results.SDY212[which(hai.results.SDY212[, "VIRUS_STRAIN_REPORTED"] == and hai.results.SDY212[which(hai.results.SDY212[, "VIRUS_STRAIN_REPORTED"] == "B" & hai.results.SDY212[, "ARM_NAME"] == "Young"), "fold.change"] and "B" & hai.results.SDY212[, "ARM_NAME"] == "Old"), "fold.change"]
## W = 1112, p-value = 0.02183
## alternative hypothesis: true location shift is not equal to 0
wilcox.test(hai.results.SDY212[which(hai.results.SDY212[,"VIRUS_STRAIN_REPORTED"]=="H1N1" & hai.results.SDY212[,"ARM_NAME"]=="Young"),"fold.change"], hai.results.SDY212[which(hai.results.SDY212[,"VIRUS_STRAIN_REPORTED"]=="H1N1" & hai.results.SDY212[,"ARM_NAME"]=="Old"),"fold.change"])
##
## Wilcoxon rank sum test with continuity correction
##
## data: hai.results.SDY212[which(hai.results.SDY212[, "VIRUS_STRAIN_REPORTED"] == and hai.results.SDY212[which(hai.results.SDY212[, "VIRUS_STRAIN_REPORTED"] == "H1N1" & hai.results.SDY212[, "ARM_NAME"] == "Young"), "fold.change"] and "H1N1" & hai.results.SDY212[, "ARM_NAME"] == "Old"), "fold.change"]
## W = 1190, p-value = 0.003155
## alternative hypothesis: true location shift is not equal to 0
wilcox.test(hai.results.SDY212[which(hai.results.SDY212[,"VIRUS_STRAIN_REPORTED"]=="H3N2" & hai.results.SDY212[,"ARM_NAME"]=="Young"),"fold.change"], hai.results.SDY212[which(hai.results.SDY212[,"VIRUS_STRAIN_REPORTED"]=="H3N2" & hai.results.SDY212[,"ARM_NAME"]=="Old"),"fold.change"])
##
## Wilcoxon rank sum test with continuity correction
##
## data: hai.results.SDY212[which(hai.results.SDY212[, "VIRUS_STRAIN_REPORTED"] == and hai.results.SDY212[which(hai.results.SDY212[, "VIRUS_STRAIN_REPORTED"] == "H3N2" & hai.results.SDY212[, "ARM_NAME"] == "Young"), "fold.change"] and "H3N2" & hai.results.SDY212[, "ARM_NAME"] == "Old"), "fold.change"]
## W = 926, p-value = 0.6221
## alternative hypothesis: true location shift is not equal to 0
The results of these statistical tests shows that the HAI response is significantly different between the age groups for the B and H1N1 strains, however not for the H3N2 strain.