Example re-analysis of HAI (hemagglutination inhibition) data from one influenza vaccination study (SDY212) from ImmPort in R

Introduction

In this tutorial we will show how to re-analyze the hemagglutination inhibition (HAI) assay data from one study in ImmPort using R.

The data stored in ImmPort can be viewed and accessed through the web site. The database content is also available for download in MySQL and Tab Separated Value (TSV) files. To download data from ImmPort you must first register for a free account.

The study

In ImmPort, each study has a study accession beginning with the letters SDY, followed by a number. For this example, we will focus on the study SDY212.

Study accession SDY212

Title: Apoptosis and other immune biomarkers predict influenza vaccine responsiveness

Principal Investigator: Mark M. Davis

Description: In an effort to identify benchmarks of immunological health, influenza vaccination was used in 30 young (20 to 30 years) and 59 older subjects (60 to 89 years) as models for strong and weak immune responses, respectively.

The data from this study can be downloaded here. If the page opens with an error, please make sure to log into ImmPort.

Influenza hemagglutination inhibition assay

The hemagglutination inhibition (HAI) assay can be used to determine the level of antibodies to influenza virus in a serum sample.

A good explanation of the HAI assay is available on the virology blog. In short, a fixed amount of virus is added to each well of a 96-well plate. From each serum sample a two-fold dilution series is prepared and the series is added to the plate. A fixed amount of red blood cells are then added and the plate is incubated. In the absence of viral antibodies, the virus will cross-link the cells and form a visible agglutination. Antibodies to the virus present in the serum will prevent the virus from attaching to the red blood cells and this inhibition of hemagglutination is used to quantify the antiviral antibodies. The weakest dilution of serum that produces hemagglutination is used as the measure of antibody concentration. If 1:40 is the weakest dilution that prevents hemagglutination, the HAI titer is 40.

Influenza vaccine

Seasonal influenza vaccine contains antigens representing three (trivalent vaccine) or four (quadrivalent vaccine) influenza virus strains: * Influenza type A subtype H1N1 virus strain * Influenza type A subtype H3N2 virus strain * One or two influenza type B virus strains

Seroprotection and seroconversion

A hemagglutination inhibition assay (HAI) antibody titer of 40 or more is generally associated with a 50% reduction in risk of influenza illness. The World Health Organization defines effective and seropositive influenza vaccine responses as a 28-day post-vaccine antibody HAI titer greater than or equal to 40. The seroprotection rate is defined as the proportion of participants with HAI titers of 40 or more.

Seroconversion/Seroresponders are defined as subjects with a fourfold or greater increase in antibody titer after vaccination. Consequently, the Seroconversion rate is the proportion of subjects with a fourfold or greater increase in HAI antibody titer after vaccination relative to pre-vaccination.

Analysis

Load R libraries used for this analysis

Libraries that are not available yet in the R distribution need to be installed first. In the following commands, three required libraries are installed. Run these 3 lines if you need to install these packages.

install.packages(“scales”)
install.packages(“ggplot2”)
install.packages(“reshape”)

The three libraries then need to be loaded as in the following code snippet.

library(scales)

## Warning: package 'scales' was built under R version 3.3.2

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.3.2

library(reshape)

Loading in the information from TSV files and merging the contents

The data we will be using for this tutorial is contained in the SDY212-DR12_Tab.zip file. The text files in this zip file organize the data based on the ImmPort MySQL schema. Each text file represents a table in the MySQL schema. Details on the tables is available in the Table Definitions.

If you would like to use MySQL data, there is a second version of this analysis that uses MySQL rather than the Tab delimited files.

The following files will be used: * subjects.txt - general subject demographic information. * arm_2_subject.txt - mapping from subjects to study arms/cohorts. * arm_or_cohort.txt - study arm names and descriptions

Reading in the raw information from TSV files

If your path is not correct you will get an error message “No such file or directory”. If you receive this error message running the below code you have two options. You can modify the path (“./Data/SDY212/Tab”) within the code to agree with your directory structure. Conversely, you can modify your directory structure to agree with the code.

subject <- read.table(file=paste(getwd(),"/Data/SDY212/Tab/subject.txt",sep=""), header=TRUE, sep="\t", stringsAsFactors=FALSE)
arm_2_subject <- read.table(file=paste(getwd(),"/Data/SDY212/Tab/arm_2_subject.txt",sep="") ,header=TRUE, sep="\t", stringsAsFactors=FALSE)
arm_or_cohort <- read.table(file=paste(getwd(),"/Data/SDY212/Tab/arm_or_cohort.txt",sep=""), header=TRUE, sep="\t", stringsAsFactors=FALSE)

View the first 3 rows of each table that was read in to understand what information they contain.

subject[1:3,]

##   SUBJECT_ACCESSION ANCESTRAL_POPULATION
## 1         SUB134268                   NA
## 2         SUB134304                   NA
## 3         SUB134309                   NA
##                                                                                                                                                                                                                                                                             DESCRIPTION
## 1                                                     This subject record was used to consolidate duplicate subject records: SUB134268 Fluzone single-dose syringe|49281-008-50, SUB119337 Fluzone single-dose syringe|49281-008-50, SUB134181 Fluzone single-dose syringe|49281-009-50
## 2 This subject record was used to consolidate duplicate subject records: SUB134304 Fluzone single-dose syringe|49281-008-50, SUB116502 Fluzone single-dose syringe|49281-011-50, SUB134137 Fluzone single-dose syringe|49281-010-50, SUB134214 Fluzone single-dose syringe|49281-009-50
## 3 This subject record was used to consolidate duplicate subject records: SUB134309 Fluzone single-dose syringe|49281-008-50, SUB119368 Fluzone single-dose syringe|49281-008-50, SUB134140 Fluzone single-dose syringe|49281-010-50, SUB134218 Fluzone single-dose syringe|49281-009-50
##                ETHNICITY GENDER  RACE RACE_SPECIFY      SPECIES STRAIN
## 1 Not Hispanic or Latino Female White              Homo sapiens     NA
## 2 Not Hispanic or Latino Female Asian              Homo sapiens     NA
## 3 Not Hispanic or Latino   Male White              Homo sapiens     NA
##   STRAIN_CHARACTERISTICS WORKSPACE_ID
## 1                     NA         2883
## 2                     NA         2883
## 3                     NA         2883

arm_2_subject[1:3,]

##   ARM_ACCESSION SUBJECT_ACCESSION          AGE_EVENT AGE_EVENT_SPECIFY
## 1        ARM894         SUB134323 Age at Study Day 0                NA
## 2        ARM894         SUB134252 Age at Study Day 0                NA
## 3        ARM895         SUB134256 Age at Study Day 0                NA
##   AGE_UNIT MAX_SUBJECT_AGE MIN_SUBJECT_AGE SUBJECT_PHENOTYPE
## 1    Years           23.82           23.82          Non-twin
## 2    Years           24.15           24.15          Non-twin
## 3    Years           83.89           83.89          Non-twin

arm_or_cohort

##   ARM_ACCESSION
## 1        ARM895
## 2        ARM894
##                                                       DESCRIPTION     NAME
## 1 Older participants aged 60 to 89 years, vaccinated with Fluzone Cohort_2
## 2 Young participants aged 20 to 30 years, vaccinated with Fluzone Cohort_1
##   STUDY_ACCESSION         TYPE WORKSPACE_ID
## 1          SDY212 Experimental         2883
## 2          SDY212 Experimental         2883

Check if the age unit is the same for all subjects. It is uniquely ‘Years’ for all subjects in this study.

unique(arm_2_subject[,"AGE_UNIT"])

## [1] "Years"

Merge the subject, arm_2_subjects and arm_or_cohort information into one table

We can merge the information from these three tables into one data frame because the first two tables that get combined (arm_or_cohort and arm_2_subject) both contain the ARM_ACCESSION column. The last table (subject) can be merged with the result of the merging of the first two tables using the SUBJECT_ACCESSION and WORKSPACE_ID columns.

merged1 <- merge(arm_or_cohort, arm_2_subject, by=intersect(colnames(arm_or_cohort),colnames(arm_2_subject)),all=TRUE)
subjects.merged <- merge(merged1, subject, by=c("SUBJECT_ACCESSION","WORKSPACE_ID"),all=TRUE)

Next, we drop columns not needed for further analysis at this time. Also, a column with shorter arm names is added.

subjects.merged <- subjects.merged[,c('SUBJECT_ACCESSION', 'ARM_ACCESSION', 'DESCRIPTION.x', 'GENDER', 'RACE', 'MIN_SUBJECT_AGE')]
subjects.merged <- cbind(subjects.merged,ARM_NAME=NA)
subjects.merged[which(subjects.merged$ARM_ACCESSION == "ARM894"),"ARM_NAME"] <- "Young"
subjects.merged[which(subjects.merged$ARM_ACCESSION == "ARM895"),"ARM_NAME"] <- "Old"

Read in the hai_result.txt data and merge with the subject information

hai_result <- read.table(file=paste(getwd(), "/Data/SDY212/Tab/hai_result.txt",sep=""), header=TRUE, sep="\t", stringsAsFactors=FALSE)
head(hai_result)

##   RESULT_ID ARM_ACCESSION BIOSAMPLE_ACCESSION COMMENTS
## 1      7026        ARM895            BS697116       NA
## 2      6588        ARM895            BS694690       NA
## 3      6710        ARM895            BS694651       NA
## 4      6763        ARM895            BS694682       NA
## 5      6939        ARM895            BS697087       NA
## 6      6562        ARM895            BS694683       NA
##   EXPERIMENT_ACCESSION EXPSAMPLE_ACCESSION STUDY_ACCESSION
## 1             EXP13382            ES760524          SDY212
## 2             EXP13382            ES742127          SDY212
## 3             EXP13382            ES742011          SDY212
## 4             EXP13382            ES742102          SDY212
## 5             EXP13382            ES760437          SDY212
## 6             EXP13382            ES742107          SDY212
##   STUDY_TIME_COLLECTED STUDY_TIME_COLLECTED_UNIT SUBJECT_ACCESSION
## 1                   28                      Days         SUB134316
## 2                    0                      Days         SUB134280
## 3                    0                      Days         SUB134239
## 4                    0                      Days         SUB134272
## 5                   28                      Days         SUB134287
## 6                    0                      Days         SUB134273
##   UNIT_PREFERRED UNIT_REPORTED VALUE_PREFERRED VALUE_REPORTED
## 1             NA            NA              80             80
## 2             NA            NA              10             10
## 3             NA            NA              40             40
## 4             NA            NA              10             10
## 5             NA            NA              80             80
## 6             NA            NA              10             10
##   VIRUS_STRAIN_PREFERRED VIRUS_STRAIN_REPORTED WORKSPACE_ID
## 1     A/Brisbane/59/2007                  H1N1         2883
## 2       B/Florida/4/2006                     B         2883
## 3     A/Brisbane/10/2007                  H3N2         2883
## 4     A/Brisbane/59/2007                  H1N1         2883
## 5     A/Brisbane/59/2007                  H1N1         2883
## 6     A/Brisbane/10/2007                  H3N2         2883

Drop columns not necessary for this analysis and merge in information from the subjects data frame.

hai.result.subset <- hai_result[,c("ARM_ACCESSION", "SUBJECT_ACCESSION", "STUDY_TIME_COLLECTED", "VALUE_PREFERRED", "VIRUS_STRAIN_REPORTED")]
hai.result.merged <- merge(hai.result.subset, subjects.merged, by=c("SUBJECT_ACCESSION","ARM_ACCESSION"))
hai.results <- cast(hai.result.merged, SUBJECT_ACCESSION + MIN_SUBJECT_AGE + GENDER + RACE + VIRUS_STRAIN_REPORTED + ARM_ACCESSION + ARM_NAME ~ STUDY_TIME_COLLECTED, value='VALUE_PREFERRED')
colnames(hai.results)[which(colnames(hai.results)=="0")] <- "Day0"
colnames(hai.results)[which(colnames(hai.results)=="28")] <- "Day28"

Add column with fold change values: Day 28 HAI titer / Day 0 HAI titer. These fold change values represents the “HAI response” of each subject to each of the three virus strains.

hai.results.SDY212 <- cbind(hai.results, fold.change= hai.results[,"Day28"] / hai.results[,"Day0"])

Explore the study data

When working with a new data set, it is usefull to get an overview of the data. Now that we built our data table, we will look first at the subject data. We show some descriptive statistics and some simple plots of the data, just to get a feel for the distribution of age, gender and race. Plotting is done using the ggplot2 package, which is based on the Grammar of Graphics (by Leland Wilkinson) to describe data graphics. A great page to learn about ggplot2 and as a look-up resource can be found in the Cookbook for R.

Table of subjects with HAI data

We are intersted in the subjects for which HAI results are available. If we would use the subjects.merged table, we would get an overview of all subjects in the study but two of these subjects don’t have HAI data. This can be seen by counting how many unique subjects (SUBJECT_ACCESSION) are in each of the data tables.

length(unique(subjects.merged[,"SUBJECT_ACCESSION"]))

## [1] 91

length(unique(hai.results.SDY212[,"SUBJECT_ACCESSION"]))

## [1] 89

So, we should use the hai.results.SDY212 table to explore the subject characteristics in this study. However, hai.results.SDY212 contains three rows per subject, one for each virus strain. So, we create a new matrix containing only one row for each subject.

subjects.HAI <- unique(hai.results.SDY212[,c("SUBJECT_ACCESSION", "MIN_SUBJECT_AGE", "GENDER", "RACE","ARM_NAME")])

Descriptive statistics

First, we get an overview of how many subjects per categorical variable (arm, gender, race) we have in our data set.

table(subjects.HAI$ARM_NAME)

## 
##   Old Young 
##    60    29

table(subjects.HAI$GENDER)

## 
## Female   Male 
##     54     35

table(subjects.HAI$RACE)

## 
## American Indian or Alaska Native                            Asian 
##                                1                                8 
##                            Other                            White 
##                                7                               73

Chart descriptive statistics

As a second step, we create a few plots that show us the demographic data visually.

ggplot(data=subjects.HAI, aes(x=ARM_NAME,fill=GENDER)) + geom_bar(stat="count", position=position_dodge(), colour="black")

ggplot(data=subjects.HAI, aes(x=ARM_NAME,fill=RACE)) + geom_bar(stat="count", position=position_dodge(), colour="black")

ggplot(data=subjects.HAI, aes(x=GENDER,y=MIN_SUBJECT_AGE)) + geom_boxplot() + geom_point(aes(color=RACE),position=position_jitter(width=0.25,height=.1)) + theme(axis.text.x=element_text(angle=50, vjust=1.2,hjust=1.2)) + facet_wrap(~ ARM_NAME,ncol=2)

Analyzing the HAI results

Initial plotting of data

Plot of post (Day 28) vs. pre (Day 0) vaccination HAI titer. Colored by virus strain.

This plot shows the typical triangular shape. If a subject already has a certain HAI titer for one strain before vaccination (x axis), it will most likely not go down after vaccination (y axis). This produces the triangular shape seen in this plot.

ggplot(hai.results.SDY212, aes(x=Day0, y=Day28, color=VIRUS_STRAIN_REPORTED)) + 
        scale_y_continuous(trans=log2_trans()) + scale_x_continuous(trans=log2_trans()) +
            geom_point(position=position_jitter(width=0.1,height=.1)) +
            ylab("HAI titer post vaccination") + xlab("HAI titer pre vaccination")

Plot of HAI response (fold change) vs. pre (Day 0) vaccination HAI titer.

ggplot(hai.results.SDY212, aes(x=Day0, y=fold.change, color=VIRUS_STRAIN_REPORTED)) + 
        scale_y_continuous(trans=log2_trans()) + scale_x_continuous(trans=log2_trans()) +
            geom_point(position=position_jitter(width=0.1,height=.1)) +
            geom_hline(aes(yintercept=4), colour="#BB0000", linetype="dashed") +
            ylab("HAI response (post/pre)") + xlab("HAI titer pre vaccination") +
      annotate("text", label = "High responders", x = 900, y = 5, size = 3.5) + annotate("text", label = "Low responders", x = 900, y = 3.1, size = 3.5)

Examining if subject age and virus strain influence the influenza vaccine response

Plotting boxplots of the HAI responses by virus strain, we can see that the response varies by virus type.

ggplot(hai.results.SDY212, aes(x=VIRUS_STRAIN_REPORTED, y=fold.change)) + geom_boxplot() + xlab("") + theme(legend.position="none") + geom_point(aes(color=VIRUS_STRAIN_REPORTED),position=position_jitter(width=0.25,height=.2)) + scale_y_continuous(trans=log2_trans())+ ylab("HAI response (Post vaccination/Pre vaccination)")

Showing the response data per age group (arm) points to the idea that the HAI response also varies by arm.

ggplot(hai.results.SDY212, aes(x=ARM_NAME, y=fold.change)) + geom_boxplot() + xlab("") + theme(legend.position="none") + geom_point(aes(color=ARM_NAME),position=position_jitter(width=0.25,height=.2)) + scale_y_continuous(trans=log2_trans())+ ylab("HAI response (Post vaccination/Pre vaccination)")

Next, we focussed on the combination of these two factors. Do young people have a higher response for each of the three virus strains?

ggplot(hai.results.SDY212, aes(x=ARM_NAME, y=fold.change)) + geom_boxplot() + xlab("") + theme(legend.position="none") + geom_point(aes(color=VIRUS_STRAIN_REPORTED),position=position_jitter(width=0.25,height=.2)) + scale_y_continuous(trans=log2_trans())+ ylab("HAI response (Post vaccination/Pre vaccination)") + facet_wrap(~ VIRUS_STRAIN_REPORTED)

In this figure, it appears that the response to the H3N2 strain is independent of age. However, for the H1N1 and B strains, younger subjects respond stronger than older subjects. To test this, we run Wilcoxon rank sum tests.

wilcox.test(hai.results.SDY212[which(hai.results.SDY212[,"VIRUS_STRAIN_REPORTED"]=="B" & hai.results.SDY212[,"ARM_NAME"]=="Young"),"fold.change"],  hai.results.SDY212[which(hai.results.SDY212[,"VIRUS_STRAIN_REPORTED"]=="B" & hai.results.SDY212[,"ARM_NAME"]=="Old"),"fold.change"])

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  hai.results.SDY212[which(hai.results.SDY212[, "VIRUS_STRAIN_REPORTED"] ==  and hai.results.SDY212[which(hai.results.SDY212[, "VIRUS_STRAIN_REPORTED"] ==     "B" & hai.results.SDY212[, "ARM_NAME"] == "Young"), "fold.change"] and     "B" & hai.results.SDY212[, "ARM_NAME"] == "Old"), "fold.change"]
## W = 1112, p-value = 0.02183
## alternative hypothesis: true location shift is not equal to 0

wilcox.test(hai.results.SDY212[which(hai.results.SDY212[,"VIRUS_STRAIN_REPORTED"]=="H1N1" & hai.results.SDY212[,"ARM_NAME"]=="Young"),"fold.change"],  hai.results.SDY212[which(hai.results.SDY212[,"VIRUS_STRAIN_REPORTED"]=="H1N1" & hai.results.SDY212[,"ARM_NAME"]=="Old"),"fold.change"])

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  hai.results.SDY212[which(hai.results.SDY212[, "VIRUS_STRAIN_REPORTED"] ==  and hai.results.SDY212[which(hai.results.SDY212[, "VIRUS_STRAIN_REPORTED"] ==     "H1N1" & hai.results.SDY212[, "ARM_NAME"] == "Young"), "fold.change"] and     "H1N1" & hai.results.SDY212[, "ARM_NAME"] == "Old"), "fold.change"]
## W = 1190, p-value = 0.003155
## alternative hypothesis: true location shift is not equal to 0

wilcox.test(hai.results.SDY212[which(hai.results.SDY212[,"VIRUS_STRAIN_REPORTED"]=="H3N2" & hai.results.SDY212[,"ARM_NAME"]=="Young"),"fold.change"],  hai.results.SDY212[which(hai.results.SDY212[,"VIRUS_STRAIN_REPORTED"]=="H3N2" & hai.results.SDY212[,"ARM_NAME"]=="Old"),"fold.change"])

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  hai.results.SDY212[which(hai.results.SDY212[, "VIRUS_STRAIN_REPORTED"] ==  and hai.results.SDY212[which(hai.results.SDY212[, "VIRUS_STRAIN_REPORTED"] ==     "H3N2" & hai.results.SDY212[, "ARM_NAME"] == "Young"), "fold.change"] and     "H3N2" & hai.results.SDY212[, "ARM_NAME"] == "Old"), "fold.change"]
## W = 926, p-value = 0.6221
## alternative hypothesis: true location shift is not equal to 0

The results of these statistical tests shows that the HAI response is significantly different between the age groups for the B and H1N1 strains, however not for the H3N2 strain.