In this tutorial we will show how to re-analyze the hemagglutination inhibition (HAI) assay data from one study in ImmPort using R.
The data stored in ImmPort can be viewed and accessed through the web site. The database content is also available for download in MySQL and Tab Separated Value (TSV) files. To download data from ImmPort you must first register for a free account.
In ImmPort, each study has a study accession beginning with the letters SDY, followed by a number. For this example, we will focus on the study SDY212.
Study accession SDY212
Title: Apoptosis and other immune biomarkers predict influenza vaccine responsiveness
Principal Investigator: Mark M. Davis
Description: In an effort to identify benchmarks of immunological health, influenza vaccination was used in 30 young (20 to 30 years) and 59 older subjects (60 to 89 years) as models for strong and weak immune responses, respectively.
The data from this study can be downloaded here. If the page opens with an error, please make sure to log into ImmPort.
The hemagglutination inhibition (HAI) assay can be used to determine the level of antibodies to influenza virus in a serum sample.
A good explanation of the HAI assay is available on the virology blog. In short, a fixed amount of virus is added to each well of a 96-well plate. From each serum sample a two-fold dilution series is prepared and the series is added to the plate. A fixed amount of red blood cells are then added and the plate is incubated. In the absence of viral antibodies, the virus will cross-link the cells and form a visible agglutination. Antibodies to the virus present in the serum will prevent the virus from attaching to the red blood cells and this inhibition of hemagglutination is used to quantify the antiviral antibodies. The weakest dilution of serum that produces hemagglutination is used as the measure of antibody concentration. If 1:40 is the weakest dilution that prevents hemagglutination, the HAI titer is 40.
Seasonal influenza vaccine contains antigens representing three (trivalent vaccine) or four (quadrivalent vaccine) influenza virus strains: * Influenza type A subtype H1N1 virus strain * Influenza type A subtype H3N2 virus strain * One or two influenza type B virus strains
A hemagglutination inhibition assay (HAI) antibody titer of 40 or more is generally associated with a 50% reduction in risk of influenza illness. The World Health Organization defines effective and seropositive influenza vaccine responses as a 28-day post-vaccine antibody HAI titer greater than or equal to 40. The seroprotection rate is defined as the proportion of participants with HAI titers of 40 or more.
Seroconversion/Seroresponders are defined as subjects with a fourfold or greater increase in antibody titer after vaccination. Consequently, the Seroconversion rate is the proportion of subjects with a fourfold or greater increase in HAI antibody titer after vaccination relative to pre-vaccination.
Libraries that are not available yet in the R distribution need to be installed first. In the following commands, three required libraries are installed. Run these 3 lines if you need to install these packages.
install.packages(“scales”)
install.packages(“ggplot2”)
install.packages(“reshape”)
The three libraries then need to be loaded as in the following code snippet.
library(scales)
## Warning: package 'scales' was built under R version 3.3.2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.2
library(reshape)
The data we will be using for this tutorial is contained in the SDY212-DR12_Tab.zip file. The text files in this zip file organize the data based on the ImmPort MySQL schema. Each text file represents a table in the MySQL schema. Details on the tables is available in the Table Definitions.
If you would like to use MySQL data, there is a second version of this analysis that uses MySQL rather than the Tab delimited files.
The following files will be used: * subjects.txt - general subject demographic information. * arm_2_subject.txt - mapping from subjects to study arms/cohorts. * arm_or_cohort.txt - study arm names and descriptions
If your path is not correct you will get an error message “No such file or directory”. If you receive this error message running the below code you have two options. You can modify the path (“./Data/SDY212/Tab”) within the code to agree with your directory structure. Conversely, you can modify your directory structure to agree with the code.
subject <- read.table(file=paste(getwd(),"/Data/SDY212/Tab/subject.txt",sep=""), header=TRUE, sep="\t", stringsAsFactors=FALSE)
arm_2_subject <- read.table(file=paste(getwd(),"/Data/SDY212/Tab/arm_2_subject.txt",sep="") ,header=TRUE, sep="\t", stringsAsFactors=FALSE)
arm_or_cohort <- read.table(file=paste(getwd(),"/Data/SDY212/Tab/arm_or_cohort.txt",sep=""), header=TRUE, sep="\t", stringsAsFactors=FALSE)
View the first 3 rows of each table that was read in to understand what information they contain.
subject[1:3,]
## SUBJECT_ACCESSION ANCESTRAL_POPULATION
## 1 SUB134268 NA
## 2 SUB134304 NA
## 3 SUB134309 NA
## DESCRIPTION
## 1 This subject record was used to consolidate duplicate subject records: SUB134268 Fluzone single-dose syringe|49281-008-50, SUB119337 Fluzone single-dose syringe|49281-008-50, SUB134181 Fluzone single-dose syringe|49281-009-50
## 2 This subject record was used to consolidate duplicate subject records: SUB134304 Fluzone single-dose syringe|49281-008-50, SUB116502 Fluzone single-dose syringe|49281-011-50, SUB134137 Fluzone single-dose syringe|49281-010-50, SUB134214 Fluzone single-dose syringe|49281-009-50
## 3 This subject record was used to consolidate duplicate subject records: SUB134309 Fluzone single-dose syringe|49281-008-50, SUB119368 Fluzone single-dose syringe|49281-008-50, SUB134140 Fluzone single-dose syringe|49281-010-50, SUB134218 Fluzone single-dose syringe|49281-009-50
## ETHNICITY GENDER RACE RACE_SPECIFY SPECIES STRAIN
## 1 Not Hispanic or Latino Female White Homo sapiens NA
## 2 Not Hispanic or Latino Female Asian Homo sapiens NA
## 3 Not Hispanic or Latino Male White Homo sapiens NA
## STRAIN_CHARACTERISTICS WORKSPACE_ID
## 1 NA 2883
## 2 NA 2883
## 3 NA 2883
arm_2_subject[1:3,]
## ARM_ACCESSION SUBJECT_ACCESSION AGE_EVENT AGE_EVENT_SPECIFY
## 1 ARM894 SUB134323 Age at Study Day 0 NA
## 2 ARM894 SUB134252 Age at Study Day 0 NA
## 3 ARM895 SUB134256 Age at Study Day 0 NA
## AGE_UNIT MAX_SUBJECT_AGE MIN_SUBJECT_AGE SUBJECT_PHENOTYPE
## 1 Years 23.82 23.82 Non-twin
## 2 Years 24.15 24.15 Non-twin
## 3 Years 83.89 83.89 Non-twin
arm_or_cohort
## ARM_ACCESSION
## 1 ARM895
## 2 ARM894
## DESCRIPTION NAME
## 1 Older participants aged 60 to 89 years, vaccinated with Fluzone Cohort_2
## 2 Young participants aged 20 to 30 years, vaccinated with Fluzone Cohort_1
## STUDY_ACCESSION TYPE WORKSPACE_ID
## 1 SDY212 Experimental 2883
## 2 SDY212 Experimental 2883
Check if the age unit is the same for all subjects. It is uniquely ‘Years’ for all subjects in this study.
unique(arm_2_subject[,"AGE_UNIT"])
## [1] "Years"
We can merge the information from these three tables into one data frame because the first two tables that get combined (arm_or_cohort and arm_2_subject) both contain the ARM_ACCESSION column. The last table (subject) can be merged with the result of the merging of the first two tables using the SUBJECT_ACCESSION and WORKSPACE_ID columns.
merged1 <- merge(arm_or_cohort, arm_2_subject, by=intersect(colnames(arm_or_cohort),colnames(arm_2_subject)),all=TRUE)
subjects.merged <- merge(merged1, subject, by=c("SUBJECT_ACCESSION","WORKSPACE_ID"),all=TRUE)
Next, we drop columns not needed for further analysis at this time. Also, a column with shorter arm names is added.
subjects.merged <- subjects.merged[,c('SUBJECT_ACCESSION', 'ARM_ACCESSION', 'DESCRIPTION.x', 'GENDER', 'RACE', 'MIN_SUBJECT_AGE')]
subjects.merged <- cbind(subjects.merged,ARM_NAME=NA)
subjects.merged[which(subjects.merged$ARM_ACCESSION == "ARM894"),"ARM_NAME"] <- "Young"
subjects.merged[which(subjects.merged$ARM_ACCESSION == "ARM895"),"ARM_NAME"] <- "Old"
hai_result <- read.table(file=paste(getwd(), "/Data/SDY212/Tab/hai_result.txt",sep=""), header=TRUE, sep="\t", stringsAsFactors=FALSE)
head(hai_result)
## RESULT_ID ARM_ACCESSION BIOSAMPLE_ACCESSION COMMENTS
## 1 7026 ARM895 BS697116 NA
## 2 6588 ARM895 BS694690 NA
## 3 6710 ARM895 BS694651 NA
## 4 6763 ARM895 BS694682 NA
## 5 6939 ARM895 BS697087 NA
## 6 6562 ARM895 BS694683 NA
## EXPERIMENT_ACCESSION EXPSAMPLE_ACCESSION STUDY_ACCESSION
## 1 EXP13382 ES760524 SDY212
## 2 EXP13382 ES742127 SDY212
## 3 EXP13382 ES742011 SDY212
## 4 EXP13382 ES742102 SDY212
## 5 EXP13382 ES760437 SDY212
## 6 EXP13382 ES742107 SDY212
## STUDY_TIME_COLLECTED STUDY_TIME_COLLECTED_UNIT SUBJECT_ACCESSION
## 1 28 Days SUB134316
## 2 0 Days SUB134280
## 3 0 Days SUB134239
## 4 0 Days SUB134272
## 5 28 Days SUB134287
## 6 0 Days SUB134273
## UNIT_PREFERRED UNIT_REPORTED VALUE_PREFERRED VALUE_REPORTED
## 1 NA NA 80 80
## 2 NA NA 10 10
## 3 NA NA 40 40
## 4 NA NA 10 10
## 5 NA NA 80 80
## 6 NA NA 10 10
## VIRUS_STRAIN_PREFERRED VIRUS_STRAIN_REPORTED WORKSPACE_ID
## 1 A/Brisbane/59/2007 H1N1 2883
## 2 B/Florida/4/2006 B 2883
## 3 A/Brisbane/10/2007 H3N2 2883
## 4 A/Brisbane/59/2007 H1N1 2883
## 5 A/Brisbane/59/2007 H1N1 2883
## 6 A/Brisbane/10/2007 H3N2 2883
Drop columns not necessary for this analysis and merge in information from the subjects data frame.
hai.result.subset <- hai_result[,c("ARM_ACCESSION", "SUBJECT_ACCESSION", "STUDY_TIME_COLLECTED", "VALUE_PREFERRED", "VIRUS_STRAIN_REPORTED")]
hai.result.merged <- merge(hai.result.subset, subjects.merged, by=c("SUBJECT_ACCESSION","ARM_ACCESSION"))
hai.results <- cast(hai.result.merged, SUBJECT_ACCESSION + MIN_SUBJECT_AGE + GENDER + RACE + VIRUS_STRAIN_REPORTED + ARM_ACCESSION + ARM_NAME ~ STUDY_TIME_COLLECTED, value='VALUE_PREFERRED')
colnames(hai.results)[which(colnames(hai.results)=="0")] <- "Day0"
colnames(hai.results)[which(colnames(hai.results)=="28")] <- "Day28"
Add column with fold change values: Day 28 HAI titer / Day 0 HAI titer. These fold change values represents the “HAI response” of each subject to each of the three virus strains.
hai.results.SDY212 <- cbind(hai.results, fold.change= hai.results[,"Day28"] / hai.results[,"Day0"])
When working with a new data set, it is usefull to get an overview of the data. Now that we built our data table, we will look first at the subject data. We show some descriptive statistics and some simple plots of the data, just to get a feel for the distribution of age, gender and race. Plotting is done using the ggplot2 package, which is based on the Grammar of Graphics (by Leland Wilkinson) to describe data graphics. A great page to learn about ggplot2 and as a look-up resource can be found in the Cookbook for R.
We are intersted in the subjects for which HAI results are available. If we would use the subjects.merged
table, we would get an overview of all subjects in the study but two of these subjects don’t have HAI data. This can be seen by counting how many unique subjects (SUBJECT_ACCESSION) are in each of the data tables.
length(unique(subjects.merged[,"SUBJECT_ACCESSION"]))
## [1] 91
length(unique(hai.results.SDY212[,"SUBJECT_ACCESSION"]))
## [1] 89
So, we should use the hai.results.SDY212
table to explore the subject characteristics in this study. However, hai.results.SDY212
contains three rows per subject, one for each virus strain. So, we create a new matrix containing only one row for each subject.
subjects.HAI <- unique(hai.results.SDY212[,c("SUBJECT_ACCESSION", "MIN_SUBJECT_AGE", "GENDER", "RACE","ARM_NAME")])
First, we get an overview of how many subjects per categorical variable (arm, gender, race) we have in our data set.
table(subjects.HAI$ARM_NAME)
##
## Old Young
## 60 29
table(subjects.HAI$GENDER)
##
## Female Male
## 54 35
table(subjects.HAI$RACE)
##
## American Indian or Alaska Native Asian
## 1 8
## Other White
## 7 73
As a second step, we create a few plots that show us the demographic data visually.
ggplot(data=subjects.HAI, aes(x=ARM_NAME,fill=GENDER)) + geom_bar(stat="count", position=position_dodge(), colour="black")
ggplot(data=subjects.HAI, aes(x=ARM_NAME,fill=RACE)) + geom_bar(stat="count", position=position_dodge(), colour="black")
ggplot(data=subjects.HAI, aes(x=GENDER,y=MIN_SUBJECT_AGE)) + geom_boxplot() + geom_point(aes(color=RACE),position=position_jitter(width=0.25,height=.1)) + theme(axis.text.x=element_text(angle=50, vjust=1.2,hjust=1.2)) + facet_wrap(~ ARM_NAME,ncol=2)
Plot of post (Day 28) vs. pre (Day 0) vaccination HAI titer. Colored by virus strain.
This plot shows the typical triangular shape. If a subject already has a certain HAI titer for one strain before vaccination (x axis), it will most likely not go down after vaccination (y axis). This produces the triangular shape seen in this plot.
ggplot(hai.results.SDY212, aes(x=Day0, y=Day28, color=VIRUS_STRAIN_REPORTED)) +
scale_y_continuous(trans=log2_trans()) + scale_x_continuous(trans=log2_trans()) +
geom_point(position=position_jitter(width=0.1,height=.1)) +
ylab("HAI titer post vaccination") + xlab("HAI titer pre vaccination")
Plot of HAI response (fold change) vs. pre (Day 0) vaccination HAI titer.
ggplot(hai.results.SDY212, aes(x=Day0, y=fold.change, color=VIRUS_STRAIN_REPORTED)) +
scale_y_continuous(trans=log2_trans()) + scale_x_continuous(trans=log2_trans()) +
geom_point(position=position_jitter(width=0.1,height=.1)) +
geom_hline(aes(yintercept=4), colour="#BB0000", linetype="dashed") +
ylab("HAI response (post/pre)") + xlab("HAI titer pre vaccination") +
annotate("text", label = "High responders", x = 900, y = 5, size = 3.5) + annotate("text", label = "Low responders", x = 900, y = 3.1, size = 3.5)
Plotting boxplots of the HAI responses by virus strain, we can see that the response varies by virus type.
ggplot(hai.results.SDY212, aes(x=VIRUS_STRAIN_REPORTED, y=fold.change)) + geom_boxplot() + xlab("") + theme(legend.position="none") + geom_point(aes(color=VIRUS_STRAIN_REPORTED),position=position_jitter(width=0.25,height=.2)) + scale_y_continuous(trans=log2_trans())+ ylab("HAI response (Post vaccination/Pre vaccination)")
Showing the response data per age group (arm) points to the idea that the HAI response also varies by arm.
ggplot(hai.results.SDY212, aes(x=ARM_NAME, y=fold.change)) + geom_boxplot() + xlab("") + theme(legend.position="none") + geom_point(aes(color=ARM_NAME),position=position_jitter(width=0.25,height=.2)) + scale_y_continuous(trans=log2_trans())+ ylab("HAI response (Post vaccination/Pre vaccination)")
Next, we focussed on the combination of these two factors. Do young people have a higher response for each of the three virus strains?
ggplot(hai.results.SDY212, aes(x=ARM_NAME, y=fold.change)) + geom_boxplot() + xlab("") + theme(legend.position="none") + geom_point(aes(color=VIRUS_STRAIN_REPORTED),position=position_jitter(width=0.25,height=.2)) + scale_y_continuous(trans=log2_trans())+ ylab("HAI response (Post vaccination/Pre vaccination)") + facet_wrap(~ VIRUS_STRAIN_REPORTED)
In this figure, it appears that the response to the H3N2 strain is independent of age. However, for the H1N1 and B strains, younger subjects respond stronger than older subjects. To test this, we run Wilcoxon rank sum tests.
wilcox.test(hai.results.SDY212[which(hai.results.SDY212[,"VIRUS_STRAIN_REPORTED"]=="B" & hai.results.SDY212[,"ARM_NAME"]=="Young"),"fold.change"], hai.results.SDY212[which(hai.results.SDY212[,"VIRUS_STRAIN_REPORTED"]=="B" & hai.results.SDY212[,"ARM_NAME"]=="Old"),"fold.change"])
##
## Wilcoxon rank sum test with continuity correction
##
## data: hai.results.SDY212[which(hai.results.SDY212[, "VIRUS_STRAIN_REPORTED"] == and hai.results.SDY212[which(hai.results.SDY212[, "VIRUS_STRAIN_REPORTED"] == "B" & hai.results.SDY212[, "ARM_NAME"] == "Young"), "fold.change"] and "B" & hai.results.SDY212[, "ARM_NAME"] == "Old"), "fold.change"]
## W = 1112, p-value = 0.02183
## alternative hypothesis: true location shift is not equal to 0
wilcox.test(hai.results.SDY212[which(hai.results.SDY212[,"VIRUS_STRAIN_REPORTED"]=="H1N1" & hai.results.SDY212[,"ARM_NAME"]=="Young"),"fold.change"], hai.results.SDY212[which(hai.results.SDY212[,"VIRUS_STRAIN_REPORTED"]=="H1N1" & hai.results.SDY212[,"ARM_NAME"]=="Old"),"fold.change"])
##
## Wilcoxon rank sum test with continuity correction
##
## data: hai.results.SDY212[which(hai.results.SDY212[, "VIRUS_STRAIN_REPORTED"] == and hai.results.SDY212[which(hai.results.SDY212[, "VIRUS_STRAIN_REPORTED"] == "H1N1" & hai.results.SDY212[, "ARM_NAME"] == "Young"), "fold.change"] and "H1N1" & hai.results.SDY212[, "ARM_NAME"] == "Old"), "fold.change"]
## W = 1190, p-value = 0.003155
## alternative hypothesis: true location shift is not equal to 0
wilcox.test(hai.results.SDY212[which(hai.results.SDY212[,"VIRUS_STRAIN_REPORTED"]=="H3N2" & hai.results.SDY212[,"ARM_NAME"]=="Young"),"fold.change"], hai.results.SDY212[which(hai.results.SDY212[,"VIRUS_STRAIN_REPORTED"]=="H3N2" & hai.results.SDY212[,"ARM_NAME"]=="Old"),"fold.change"])
##
## Wilcoxon rank sum test with continuity correction
##
## data: hai.results.SDY212[which(hai.results.SDY212[, "VIRUS_STRAIN_REPORTED"] == and hai.results.SDY212[which(hai.results.SDY212[, "VIRUS_STRAIN_REPORTED"] == "H3N2" & hai.results.SDY212[, "ARM_NAME"] == "Young"), "fold.change"] and "H3N2" & hai.results.SDY212[, "ARM_NAME"] == "Old"), "fold.change"]
## W = 926, p-value = 0.6221
## alternative hypothesis: true location shift is not equal to 0
The results of these statistical tests shows that the HAI response is significantly different between the age groups for the B and H1N1 strains, however not for the H3N2 strain.