Example re-analysis of hemagglutination inhibition data from an ImmPort Study

Version 2.0, last modified April 2017 Python 3.6 Numpy 1.12.1 Pandas 0.19.2 Plotly 1.12.9 Scipy 0.19.0

What is ImmPort ?

The Immunology Database and Analysis Portal (ImmPort) project provides tools and services for the collection, QC, curation, and sharing of immunological data. The purpose of ImmPort is to provide a long term, sustainable source for promoting the re-use of these data provided by NIAID DAIT and DMID funded investigators.

The study chosen for this tutorial is one of 229 currently available for download as of DR20.

SDY212
Title: Apoptosis and other immune biomarkers predict influenza vaccine (TIV 2008) responsiveness
Principal Investigator: Mark M. Davis
Description: In an effort to indentify benchmarks of immunological health, influenza vaccination was used in 30 young (20 to 30 years) and 59 older subjects (60 to 89 years) as models for strong and weak immune responses, respectively.

We will focus on the HAI results for this tutorial.

Pre-requisites:

Study packages can be downloaded from ImmPort after free registration. Study download packages are available in several formats. For this tutorial we will be using tab-separated text files (meaning a "tab" character is used as the column separator), found in the following zip archive:
SDYxxx-DRnn_Tab.zip where xxx is the study accession and nn the data release version.
Details on the tables included in this archive are available here.

To get the archive, go to the study page, in our case SDY212, and click on 'download'. If you haven't registered to ImmPort yet, you will need to create an account (it only takes a minute!). You might need to install Aspera if you haven't already.
Once logged in, you should see a list of directories and files. Look for SDY212-DR20_Tab.zip and click on it to download.

For this tutorial, we will also use python packages: Numpy, Scipy, Pandas and Plotly. Each of these packages should be installed for this notebook to run smoothly. For instructions on how to do this with conda, check out this tutorial.

Additional Resources

Jupyter notebooks can be used for data analysis and as a teaching tool. There are a multitude of very good tutorials online about how to use notebooks, below are a selected few:

  • Teaching Numerical Methods with IPython Notebooks by D.I. Ketcheson and A. Ahmadia at the 2014 SciPy conference.
    Part 1 Part 2 Part 3

  • Statistcial Data Analysis in Python by C. Fonnesbeck at the 2013 SciPy conference.
    Part 1 Part 2 Part 3 Part 4.

  • Exploring Machine Learning with Scikit-learn by J. Vanderplas and O. Grisel at the 2014 PyCon.
    Part 1

A more extensive list of interesting tutorials can be found here.

Tutorial

Load in the Python libraries used for this analysis

In [1]:
import numpy as np
import pandas as pd
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
from scipy import stats

Explore the Overall Study Design and Subject Characteristics

In this section you will be introduced to some of the content available for download from ImmPort, which can be used for re-analysis of the study. We demonstrate here how to load in information from tab-separated text files, merge their content and generate simple descriptive statistics.
The files needed for this section can be found in the SDY212-DR20_Tab.zip archive. Once this archive is unzipped, you should have a new folder called SDY212-DR20, itself containing a 'Tab' folder. This is where you will find the following files:

  • subjects.txt - general subject demographic information.
  • arm_2_subject.txt - mapping from subjects to study arms/cohorts.
  • arm_or_cohort - study arm names and descriptions

Load in study files

To load the files we'll be using in memory, you can either type in the path to your files, or use the input() function, which in a Jupyter notebook will print a question, and capture your answer in a text box as a string. Either line of code works, though note that tab-complete will work in a Jupyter notebook if you hit the tab key within double quotes, but not with the input() function text box.
In the three code cells below, don't forget to change the path to where your files are located!

In [2]:
#subject_file = input("Enter the path to subject.txt: ")
subject_file = "/Users/thomascg/Desktop/immport_workshop/SDY212-DR20_Tab/Tab/subject.txt"
In [3]:
#arm_2_subject_file = input("Enter the path to arm_2_subject.txt: ")
arm_2_subject_file = "/Users/thomascg/Desktop/immport_workshop/SDY212-DR20_Tab/Tab/arm_2_subject.txt"
In [4]:
#arm_or_cohort_file = input("Enter the path to arm_or_cohort.txt: ")
arm_or_cohort_file = "/Users/thomascg/Desktop/immport_workshop/SDY212-DR20_Tab/Tab/arm_or_cohort.txt"

The files are read in as pandas dataframe with the pd.read_table() function:

In [5]:
subjects = pd.read_table(subject_file, sep="\t")
arm_2_subject = pd.read_table(arm_2_subject_file, sep="\t")
arm_or_cohort = pd.read_table(arm_or_cohort_file, sep="\t")

Review file content

The pandas method df.head(), where 'df' indicates a dataframe, returns the first 5 rows of a dataframe object and is a good first step to look briefly at the data you will be working with:

In [6]:
subjects.head()
Out[6]:
SUBJECT_ACCESSION ANCESTRAL_POPULATION DESCRIPTION ETHNICITY GENDER RACE RACE_SPECIFY SPECIES STRAIN STRAIN_CHARACTERISTICS WORKSPACE_ID
0 SUB134268 NaN This subject record was used to consolidate du... Not Hispanic or Latino Female White NaN Homo sapiens NaN NaN 2883
1 SUB134304 NaN This subject record was used to consolidate du... Not Hispanic or Latino Female Asian NaN Homo sapiens NaN NaN 2883
2 SUB134309 NaN This subject record was used to consolidate du... Not Hispanic or Latino Male White NaN Homo sapiens NaN NaN 2883
3 SUB134324 NaN This subject record was used to consolidate du... Not Hispanic or Latino Male Other White, Asian Homo sapiens NaN NaN 2883
4 SUB134240 NaN This subject record was used to consolidate du... Not Hispanic or Latino Female White NaN Homo sapiens NaN NaN 2883
In [7]:
arm_2_subject.head()
Out[7]:
ARM_ACCESSION SUBJECT_ACCESSION AGE_EVENT AGE_EVENT_SPECIFY AGE_UNIT MAX_SUBJECT_AGE MIN_SUBJECT_AGE SUBJECT_PHENOTYPE
0 ARM894 SUB134323 Age at Study Day 0 NaN Years 23.82 23.82 Non-twin
1 ARM894 SUB134252 Age at Study Day 0 NaN Years 24.15 24.15 Non-twin
2 ARM895 SUB134256 Age at Study Day 0 NaN Years 83.89 83.89 Non-twin
3 ARM895 SUB134262 Age at Study Day 0 NaN Years 84.17 84.17 Non-twin
4 ARM895 SUB134265 Age at Study Day 0 NaN Years 85.82 85.82 Non-twin
In [8]:
arm_or_cohort.head()
Out[8]:
ARM_ACCESSION DESCRIPTION NAME STUDY_ACCESSION TYPE WORKSPACE_ID
0 ARM895 Older participants aged 60 to 89 years, vaccin... Cohort_2 SDY212 Experimental 2883
1 ARM894 Young participants aged 20 to 30 years, vaccin... Cohort_1 SDY212 Experimental 2883

Pandas dataframes column headings are also accessible as follows:

In [9]:
subjects.columns
Out[9]:
Index(['SUBJECT_ACCESSION', 'ANCESTRAL_POPULATION', 'DESCRIPTION', 'ETHNICITY',
       'GENDER', 'RACE', 'RACE_SPECIFY', 'SPECIES', 'STRAIN',
       'STRAIN_CHARACTERISTICS', 'WORKSPACE_ID'],
      dtype='object')
In [10]:
arm_2_subject.columns
Out[10]:
Index(['ARM_ACCESSION', 'SUBJECT_ACCESSION', 'AGE_EVENT', 'AGE_EVENT_SPECIFY',
       'AGE_UNIT', 'MAX_SUBJECT_AGE', 'MIN_SUBJECT_AGE', 'SUBJECT_PHENOTYPE'],
      dtype='object')
In [11]:
arm_or_cohort.columns
Out[11]:
Index(['ARM_ACCESSION', 'DESCRIPTION', 'NAME', 'STUDY_ACCESSION', 'TYPE',
       'WORKSPACE_ID'],
      dtype='object')

To get an idea of the number of rows and/or columns in a table, you can use shape. The first dimension reported by shape is the number of rows and the second one is the number of columns.
Note here that the number of rows reported for the subjects and arms_2_subject files match. In these two files, each row corresponds to a subject. The fact that they have the same number of rows is a good indicator that there should be a one to one mapping between the two files.

In [12]:
subjects.shape
Out[12]:
(91, 11)
In [13]:
arm_2_subject.shape
Out[13]:
(91, 8)
In [14]:
arm_or_cohort.shape
Out[14]:
(2, 6)

To make sure that the set of subject accessions in each table contains only one instance of each subject, you can take advantage of the built-in Python function set(), and of the built-in Pandas access to column content through column header. We can compare the number of unique elements returned by set() to the number of elements including potential duplicates. If they are the same, there are no duplicates.

In [15]:
if len(set(subjects.SUBJECT_ACCESSION)) != len(subjects.SUBJECT_ACCESSION):
    print("some subject accessions are duplicated in subject.txt")
In [16]:
if len(set(arm_2_subject.SUBJECT_ACCESSION)) != len(arm_2_subject.SUBJECT_ACCESSION):
    print("some subject accessions are duplicated in arm_2_subject.txt")

Of course, to determine beyond any doubt that the set of subject accessions are the same in both tables, you need to compare them. The pandas df.isin() method checks if all elements from one set exist in another given set. The resulting structure contains boolean ('True' or 'False') which can then be counted with df.sum(). Here, we expect the sum to be the same than the number of subject accessions (in other words, the number of rows in subjects), because 'True' counts as 1.

In [17]:
df = subjects.SUBJECT_ACCESSION.isin(arm_2_subject.SUBJECT_ACCESSION)
df.sum()
Out[17]:
91

*Side note: ImmPort data is de-identified and curated such that accessions are unique identifiers*

An alternative to df.head() is to review the content by looking at a single row in each file with df.ix. This makes looking at many columns a bit easier.

In [18]:
subjects.ix[0]
Out[18]:
SUBJECT_ACCESSION                                                 SUB134268
ANCESTRAL_POPULATION                                                    NaN
DESCRIPTION               This subject record was used to consolidate du...
ETHNICITY                                            Not Hispanic or Latino
GENDER                                                               Female
RACE                                                                  White
RACE_SPECIFY                                                            NaN
SPECIES                                                        Homo sapiens
STRAIN                                                                  NaN
STRAIN_CHARACTERISTICS                                                  NaN
WORKSPACE_ID                                                           2883
Name: 0, dtype: object
In [19]:
arm_2_subject.ix[10]
Out[19]:
ARM_ACCESSION                    ARM894
SUBJECT_ACCESSION             SUB134326
AGE_EVENT            Age at Study Day 0
AGE_EVENT_SPECIFY                   NaN
AGE_UNIT                          Years
MAX_SUBJECT_AGE                   25.39
MIN_SUBJECT_AGE                   25.39
SUBJECT_PHENOTYPE              Non-twin
Name: 10, dtype: object

A pandas dataframe content can also be accessed by using the headers. Note that while ix returns the full content of a row (one value per column), the following returns all rows for the specified columns (one value per row):

In [20]:
arm_or_cohort[['ARM_ACCESSION','NAME','DESCRIPTION']]
Out[20]:
ARM_ACCESSION NAME DESCRIPTION
0 ARM895 Cohort_2 Older participants aged 60 to 89 years, vaccin...
1 ARM894 Cohort_1 Young participants aged 20 to 30 years, vaccin...

In this particular case, the names of each arm aren't very descriptive, We can add a column to the dataframe with a more meaningful name:

In [21]:
arm_or_cohort['ARM_NAME'] = ['Old','Young']
In [22]:
arm_or_cohort[['ARM_ACCESSION','NAME','ARM_NAME','DESCRIPTION']]
Out[22]:
ARM_ACCESSION NAME ARM_NAME DESCRIPTION
0 ARM895 Cohort_2 Old Older participants aged 60 to 89 years, vaccin...
1 ARM894 Cohort_1 Young Young participants aged 20 to 30 years, vaccin...

Merge data into one table

The goal of this section is to create one dataframe containing only the information we want to proceed. We've seen above that subjects and arm_2_subject contain information pertaining to the same sets of subject accessions, so we know we can merge them together along subject accession with little issues.
The other table we have, arm_or_cohort, contains information about each arm mentioned in the mapping table, arm_2_subject. We can merge them together thanks to the arm accession.
There are several methods in pandas to merge dataframes together. Here, we use the function pd.merge(), which does all the hard work of merging gracefully together the dataframes.

In [23]:
# as a reminder:
print(subjects.columns, arm_or_cohort.columns, arm_2_subject.columns, sep="\n")
Index(['SUBJECT_ACCESSION', 'ANCESTRAL_POPULATION', 'DESCRIPTION', 'ETHNICITY',
       'GENDER', 'RACE', 'RACE_SPECIFY', 'SPECIES', 'STRAIN',
       'STRAIN_CHARACTERISTICS', 'WORKSPACE_ID'],
      dtype='object')
Index(['ARM_ACCESSION', 'DESCRIPTION', 'NAME', 'STUDY_ACCESSION', 'TYPE',
       'WORKSPACE_ID', 'ARM_NAME'],
      dtype='object')
Index(['ARM_ACCESSION', 'SUBJECT_ACCESSION', 'AGE_EVENT', 'AGE_EVENT_SPECIFY',
       'AGE_UNIT', 'MAX_SUBJECT_AGE', 'MIN_SUBJECT_AGE', 'SUBJECT_PHENOTYPE'],
      dtype='object')
In [24]:
subjects_merged_1 = pd.merge(subjects, arm_2_subject, left_on='SUBJECT_ACCESSION', right_on='SUBJECT_ACCESSION')
subjects_merged_2 = pd.merge(subjects_merged_1, arm_or_cohort, left_on='ARM_ACCESSION',right_on='ARM_ACCESSION')
subjects_merged_2.ix[0]
Out[24]:
SUBJECT_ACCESSION                                                 SUB134268
ANCESTRAL_POPULATION                                                    NaN
DESCRIPTION_x             This subject record was used to consolidate du...
ETHNICITY                                            Not Hispanic or Latino
GENDER                                                               Female
RACE                                                                  White
RACE_SPECIFY                                                            NaN
SPECIES                                                        Homo sapiens
STRAIN                                                                  NaN
STRAIN_CHARACTERISTICS                                                  NaN
WORKSPACE_ID_x                                                         2883
ARM_ACCESSION                                                        ARM894
AGE_EVENT                                                Age at Study Day 0
AGE_EVENT_SPECIFY                                                       NaN
AGE_UNIT                                                              Years
MAX_SUBJECT_AGE                                                       29.71
MIN_SUBJECT_AGE                                                       29.71
SUBJECT_PHENOTYPE                                                  Non-twin
DESCRIPTION_y             Young participants aged 20 to 30 years, vaccin...
NAME                                                               Cohort_1
STUDY_ACCESSION                                                      SDY212
TYPE                                                           Experimental
WORKSPACE_ID_y                                                         2883
ARM_NAME                                                              Young
Name: 0, dtype: object

Note in the list of columns above that pandas modified the name of some columns by appending a '_x' or '_y'.

The new dataframe created, subjects_merged contains all columns from all 3 tables, most of which are irrelevant at this time. We can drop the unnecessary columns by creating a new dataframe with only the columns of interest as follows:

In [25]:
subjects_merged = subjects_merged_2[['SUBJECT_ACCESSION','ARM_ACCESSION','ARM_NAME','GENDER','RACE','MIN_SUBJECT_AGE']]
subjects_merged.shape
Out[25]:
(91, 6)
In [26]:
subjects_merged.ix[0]
Out[26]:
SUBJECT_ACCESSION    SUB134268
ARM_ACCESSION           ARM894
ARM_NAME                 Young
GENDER                  Female
RACE                     White
MIN_SUBJECT_AGE          29.71
Name: 0, dtype: object

Simple descriptive statistics

Below we show some descriptive statistics just to get a feel for the distribution of age, gender and race.

We are going to use two pandas methods here, chained by a dot operator. The object returned by the first function is taken as an argument by the next function. For instance below, the df.groupby() method groups rows by values contained in the specified column. It returns an object on which the next function, df.count(), operates. The df.count() method returns the number of observations in a given column.

  • How many subjects are in each arm?
In [27]:
subjects_merged.groupby('ARM_NAME').count()['SUBJECT_ACCESSION']
Out[27]:
ARM_NAME
Old      61
Young    30
Name: SUBJECT_ACCESSION, dtype: int64
  • How many subjects in each gender?
In [28]:
subjects_merged.groupby('GENDER').count()['SUBJECT_ACCESSION']
Out[28]:
GENDER
Female    54
Male      37
Name: SUBJECT_ACCESSION, dtype: int64
  • What is the race distribution overall?
In [29]:
subjects_merged.groupby('RACE').count()['SUBJECT_ACCESSION']
Out[29]:
RACE
American Indian or Alaska Native     1
Asian                                8
Other                                7
White                               75
Name: SUBJECT_ACCESSION, dtype: int64
  • What is the age distribution in each arm?

df.describe() is a very useful method which returns standard summary statistics (count, mean, median, standard dev, quartiles) on a given series of numbers.
Note in the following example that the only column in our dataframe containing numbers is MIN_SUBJECT_AGE, so it is the only one returned.

In [30]:
subjects_by_arm = subjects_merged.groupby('ARM_NAME')
subjects_by_arm.describe()
Out[30]:
MIN_SUBJECT_AGE
ARM_NAME
Old count 61.000000
mean 77.151803
std 9.558915
min 61.200000
25% 68.080000
50% 78.750000
75% 85.170000
max 90.000000
Young count 30.000000
mean 25.403000
std 2.670702
min 20.680000
25% 23.882500
50% 25.050000
75% 27.115000
max 30.750000

Simple Descriptive Plots

In this version of the tutorial, we will be using the Plotly library for graphs instead of MatPlotLib. For more advanced users, there is a way to feed directly pandas dataframe into Plotly by using yet another library.

Plotly generates interactive graphs. By default, plots are saved to a server in a Plotly account that you have to set up. To enable local generation of your plots instead, use the following line:

In [31]:
plotly.offline.init_notebook_mode()
  • Count of subjects by arm and gender

For this plot, we want to look at the counts of each gender in each arm. The df.crosstab() method lets us do that by computing a simple cross-tabulation of specified factors:

In [32]:
arm_counts_gender = pd.crosstab(subjects_merged.ARM_NAME, subjects_merged.GENDER)
arm_counts_gender.head()
Out[32]:
GENDER Female Male
ARM_NAME
Old 40 21
Young 14 16

We can now use this new object as input to generate a simple bar plot.

Plotly is a graphing library which generates interactive plots. You can zoom in, zoom out, see information about the data when mousing over the graph, etc. You can also save the plot as a png by clicking on the camera icon in the Plotly toolbar at the top right of each plot.
Plotly has a typical syntax to code for plots. Each call to make a plot takes in two things, data and layout. Data contains the actual data to plot, along with details such as how to plot it, what to call it, how to color it, etc. Layout is an object containing the plot configuration - chart title, legend font, location, color, etc.
Plotly made a cheat sheet of their most common plots for quick reference. The following plots will be barplots.

In [33]:
# define traces:
trace1 = go.Bar(
    x = arm_counts_gender.index,
    y = arm_counts_gender.Female,
    name = "Female",
    marker = {
        "color" : "orange"
    }
)
trace2 = go.Bar(
    x = arm_counts_gender.index,
    y = arm_counts_gender.Male,
    name = "Male",
    marker = {
        "color" : "teal"
    }
)
# store traces into a single object:
data1 = [trace1, trace2]

# define layout:
layout1 = go.Layout(
    barmode = 'group',
    bargroupgap = 0.1,
    title = "Counts of Subjects by Arm and Gender",
    xaxis = {
        "tickfont" : {
            "size" : 14,
            "color" : "rgb(107, 107, 107)"
        }
    },
    yaxis = {
        "title" : "Counts",
        "titlefont" : {
            "size" : 16,
            "color" : "rgb(107, 107, 107)"
        },
        "tickfont" : {
            "size" : 14,
            "color" : "rgb(107, 107, 107)"
        }
    },
)

# create a figure object:
fig1 = go.Figure(data=data1, layout=layout1)

# call the plot:
plotly.offline.iplot(fig1)
  • Count of subjects by arm and race

For this plot, we want to look at the distribution of race in each arm.

In [34]:
arm_counts_race = pd.crosstab(subjects_merged.ARM_NAME, subjects_merged.RACE)
arm_counts_race.head()
Out[34]:
RACE American Indian or Alaska Native Asian Other White
ARM_NAME
Old 1 1 0 59
Young 0 7 7 16

This time let's use a loop so we don't have to type in (or copy and paste) for each race. This method has one advantage: you might not now how many races there are in your data. Since we're using a loop, let's also set up a color palette so we don't have to hard-code the color everytime (though we could also use the default Plotly palette by not declaring any color).

The colors below are from the standard HTML color palette. We could get fancier and use RGB values too. More info on html color names here

In [35]:
palette = ['orange','teal','purple','maroon','lavender']
In [36]:
# defining traces. 
# 1. set up the data object
data2 = []
i = 0
# 2. loop over races:
for race in arm_counts_race:
    # set up a trace object
    trace = go.Bar(
        x = arm_counts_race.index,
        y = arm_counts_race[race],
        name = race,
        marker = {
            "color": palette[i]
        }
    )
    # increment i to change color at each iteration of the loop
    i += 1
    data2.append(trace)
    
# define layout:
layout2 = go.Layout(
    barmode = 'group',
    bargroupgap = 0.1,
    title = "Counts of Subjects by Arm and Race",
    xaxis = {
        "tickfont" : {
            "size" : 14,
            "color" : "rgb(107, 107, 107)"
        }
    },
    yaxis = {
        "title" : "Counts",
        "titlefont" : {
            "size" : 16,
            "color" : "rgb(107, 107, 107)"
        },
        "tickfont" : {
            "size" : 14,
            "color" : "rgb(107, 107, 107)"
        }
    },
)

# create a figure object:
fig2 = go.Figure(data=data2, layout=layout2)

# call the plot:
plotly.offline.iplot(fig2)    
  • Age at enrollment by arm and gender

For this plot, we want to look at the distribution of age in each of or arms, separated by gender. Just as for the previous plots, we need to massage the data into an object that we can use to plot. Namely, we want each the MIN_SUBJECT_AGE series of values for each arm, for each gender.
One way to do this is to subset the subjects_merged dataframe into one for each gender, and plot along ARM_NAME:

In [37]:
subjects_female = subjects_merged[subjects_merged['GENDER']=="Female"]
subjects_male = subjects_merged[subjects_merged['GENDER']=="Male"]
subjects_female.head()
Out[37]:
SUBJECT_ACCESSION ARM_ACCESSION ARM_NAME GENDER RACE MIN_SUBJECT_AGE
0 SUB134268 ARM894 Young Female White 29.71
1 SUB134304 ARM894 Young Female Asian 24.59
8 SUB134249 ARM894 Young Female Other 26.01
12 SUB134323 ARM894 Young Female Other 23.82
14 SUB134251 ARM894 Young Female White 29.23

For this plot, we'll do a boxplot.

In [38]:
# define traces for each group:
fem_bp = go.Box(
    y = subjects_female.MIN_SUBJECT_AGE,
    x = subjects_female.ARM_NAME,
    name = "Female",
    jitter = 0.5,
    pointpos = -2,
    boxpoints = "all",
    marker = {
        "color" : palette[0],
        "size" : 4
    }
)
male_bp = go.Box(
    y = subjects_male.MIN_SUBJECT_AGE,
    x = subjects_male.ARM_NAME,
    name = "Male",
    jitter = 0.5,
    pointpos = -2,
    boxpoints = "all",
    marker = {
        "color" : palette[1],
        "size" : 4
    }
)

# store traces in a list:
data3 = [fem_bp, male_bp]

# define layout:
layout3 = go.Layout(
    boxmode = "group",
    boxgroupgap = 0.5,
    title = "Age at Enrollment by Arm and Gender",
    xaxis = {
        "tickfont" : {
            "size" : 14,
            "color" : "rgb(107, 107, 107)"
        }
    },
    yaxis = {
        "title" : "Age at Enrollment",
        "titlefont" : {
            "size" : 16,
            "color" : "rgb(107, 107, 107)"
        },
        "tickfont" : {
            "size" : 14,
            "color" : "rgb(107, 107, 107)"
        }
    },
)

# create a figure object:
fig3 = go.Figure(data=data3, layout=layout3)

# call the plot:
plotly.offline.iplot(fig3)    

Re-analysis of the HAI Results

HAI is a standard assay to measure viral concentration in a sample.

Data Exploration

The HAI results are contained in hai_result.txt. Let's read the table in memory first.
Don't forget to change the path to where the file is on your system.

In [39]:
#hai_result_file = input("Enter the path to subject.txt: ")
hai_result_file = "/Users/thomascg/Desktop/immport_workshop/SDY212-DR20_Tab/Tab/hai_result.txt"
In [40]:
hai_result_tmp = pd.read_table(hai_result_file)
hai_result_tmp.columns
Out[40]:
Index(['RESULT_ID', 'ARM_ACCESSION', 'BIOSAMPLE_ACCESSION', 'COMMENTS',
       'EXPERIMENT_ACCESSION', 'EXPSAMPLE_ACCESSION', 'STUDY_ACCESSION',
       'STUDY_TIME_COLLECTED', 'STUDY_TIME_COLLECTED_UNIT',
       'SUBJECT_ACCESSION', 'UNIT_PREFERRED', 'UNIT_REPORTED',
       'VALUE_PREFERRED', 'VALUE_REPORTED', 'VIRUS_STRAIN_PREFERRED',
       'VIRUS_STRAIN_REPORTED', 'WORKSPACE_ID'],
      dtype='object')

We don't need all these columns for further analysis, so let's trim the table:

In [41]:
hai_result = hai_result_tmp[['SUBJECT_ACCESSION','STUDY_TIME_COLLECTED','VALUE_PREFERRED','VIRUS_STRAIN_REPORTED']]
hai_result.head()
Out[41]:
SUBJECT_ACCESSION STUDY_TIME_COLLECTED VALUE_PREFERRED VIRUS_STRAIN_REPORTED
0 SUB134316 28 80 H1N1
1 SUB134280 0 10 B
2 SUB134239 0 40 H3N2
3 SUB134272 0 10 H1N1
4 SUB134287 28 80 H1N1
In [42]:
hai_result.shape
Out[42]:
(534, 4)

Note here that there are a LOT more lines than the number of subjects above. There are several results per subject accession. Before going forward, let's check how many subjects are represented in this file:

In [43]:
len(set(hai_result.SUBJECT_ACCESSION))
Out[43]:
89

Let's explore this table before we move on. We've seen that there are several records per subject accession. STUDY_TIME_COLLECTED indicates when the samples were collected for analysis. Typically, samples are collected at Day 0 of the study - which in our case also happens to be day subjects received the flu vaccine, and some time later - here, the first 5 rows seem to indicate 28 days later. Let's check that our assumption that there are only 2 values in STUDY_TIME_COLLECTED is accurate:

In [44]:
set(hai_result.STUDY_TIME_COLLECTED)
Out[44]:
{0, 28}

VIRUS_STRAIN_REPORTED indicates which virus strain was used for the hemagglutination inhibition assay. Just as we did with STUDY_TIME_COLLECTED, we can check how many unique values there are in this column:

In [45]:
set(hai_result.VIRUS_STRAIN_REPORTED)
Out[45]:
{'B', 'H1N1', 'H3N2'}
In [46]:
# sanity check
534 / 89
Out[46]:
6.0

It seems then that unless there is missing data, we should expect 3 results per study time collected for each subject. Let's go with this assumption for now. To make further analysis easier, we can reorganize the data so that VALUE_PREFERRED for day 0 and day 28 are in two separate columns. The end result is going to be a table with a row for each combination of SUBJECT_ACCESSION and VIRUS_STRAIN_REPORTED, and values at Day 0 and Day 28.

In [47]:
# 1. subset data from day 0.
study_day_0 = hai_result[hai_result['STUDY_TIME_COLLECTED'] == 0].copy()
# 2. to avoid later confusion, rename the VALUE_PREFERRED column to 'Day0'
# and drop the now pointless STUDY_TIME_COLLECTED column.
study_day_0.rename(columns = {"VALUE_PREFERRED" : "Day0"}, inplace=True)
study_day_0.drop(['STUDY_TIME_COLLECTED'], axis=1, inplace=True)
study_day_0.head()
Out[47]:
SUBJECT_ACCESSION Day0 VIRUS_STRAIN_REPORTED
1 SUB134280 10 B
2 SUB134239 40 H3N2
3 SUB134272 10 H1N1
5 SUB134273 10 H3N2
6 SUB134317 40 H1N1
In [48]:
# 3. subset data from day 28.
study_day_28 = hai_result[hai_result['STUDY_TIME_COLLECTED'] == 28].copy()
# 4. to avoid later confusion, rename the VALUE_PREFERRED column to 'Day28'
# and drop the now pointless STUDY_TIME_COLLECTED column.
study_day_28.rename(columns={'VALUE_PREFERRED':'Day28'}, inplace=True)
study_day_28.drop(['STUDY_TIME_COLLECTED'], axis=1, inplace=True)
study_day_28.head()
Out[48]:
SUBJECT_ACCESSION Day28 VIRUS_STRAIN_REPORTED
0 SUB134316 80 H1N1
4 SUB134287 80 H1N1
7 SUB134236 160 H3N2
8 SUB134266 160 H3N2
9 SUB134290 5 B
In [49]:
study_day_0.shape
Out[49]:
(267, 3)
In [50]:
study_day_28.shape
Out[50]:
(267, 3)

Our assumption seems to be holding up so far, since there is an identical number of records for each study day. Now we can merge the two tables into one dataframe:

In [51]:
join_columns = ['SUBJECT_ACCESSION','VIRUS_STRAIN_REPORTED']
hai_values = pd.merge(study_day_0, study_day_28, left_on=join_columns, right_on=join_columns)
hai_values.head()
Out[51]:
SUBJECT_ACCESSION Day0 VIRUS_STRAIN_REPORTED Day28
0 SUB134280 10 B 10
1 SUB134239 40 H3N2 320
2 SUB134272 10 H1N1 10
3 SUB134273 10 H3N2 320
4 SUB134317 40 H1N1 80

We can make sure that our assumption of three tests per subjects per day collected was right by checking the shape of hai_analysis. If we still have 267 rows, it means that the set of unique subject accession/virus strain was the same in both study_day_0 and study_day_28.

In [52]:
hai_values.shape
Out[52]:
(267, 4)

Re-analysis of the data

Let's start by calculating the fold change between day 0 and day 28 and convert it to log2 values to improve visualization later.

In [53]:
hai_values['FOLD_CHANGE'] = hai_values['Day28'] / hai_values['Day0']
hai_values['FOLD_CHANGE_LOG2'] = np.log2(hai_values['FOLD_CHANGE'])
# Let's add a categorical column here for future boxplot
hai_values['response'] = np.where(hai_values.FOLD_CHANGE_LOG2>=2, "High","Low")
hai_values.head()
Out[53]:
SUBJECT_ACCESSION Day0 VIRUS_STRAIN_REPORTED Day28 FOLD_CHANGE FOLD_CHANGE_LOG2 response
0 SUB134280 10 B 10 1.0 0.0 Low
1 SUB134239 40 H3N2 320 8.0 3.0 High
2 SUB134272 10 H1N1 10 1.0 0.0 Low
3 SUB134273 10 H3N2 320 32.0 5.0 High
4 SUB134317 40 H1N1 80 2.0 1.0 Low

Now that we have fold changes, we can integrate the subject information to the HAI results.

You might have noticed that here is a difference of 2 subjects between the list. We could use the same method as in the first section to figure out which, or we can just merge the hai_result table with subjects_merged, which will result in a table containing only the subjects in common to both.

In [54]:
hai_analysis = pd.merge(hai_values, subjects_merged, left_on='SUBJECT_ACCESSION', right_on='SUBJECT_ACCESSION')
hai_analysis.head()
Out[54]:
SUBJECT_ACCESSION Day0 VIRUS_STRAIN_REPORTED Day28 FOLD_CHANGE FOLD_CHANGE_LOG2 response ARM_ACCESSION ARM_NAME GENDER RACE MIN_SUBJECT_AGE
0 SUB134280 10 B 10 1.0 0.0 Low ARM895 Old Female White 77.27
1 SUB134280 20 H1N1 20 1.0 0.0 Low ARM895 Old Female White 77.27
2 SUB134280 40 H3N2 640 16.0 4.0 High ARM895 Old Female White 77.27
3 SUB134239 40 H3N2 320 8.0 3.0 High ARM895 Old Female White 61.84
4 SUB134239 20 H1N1 80 4.0 2.0 High ARM895 Old Female White 61.84

We can check how many subjects we still have:

In [55]:
len(set(hai_analysis.SUBJECT_ACCESSION))
Out[55]:
89

Visualization of the data

  • Plot of post (Day 28) vs. pre (Day 0) vaccination HAI titer, colored by virus strain.

We're going to generate a scatter plot to look at pre vs. post vaccination titers. Because it's a scatter plot, overlap between plotted dots is likely and makes seeing all the data at once difficult. One way around it is to use transparency in the colors, which is what we'll do in the following plot.

First, just as for the previous plots we need to group the data in order to generate plots.

In [56]:
hai_by_virus = hai_analysis.groupby('VIRUS_STRAIN_REPORTED')

To implement the transparency, we can set up a new palette in which the colors are defined with rgba() -- Red, Green, Blue, Alpha, where alpha determines saturation.

In [57]:
rgba_palette = ["rgba(255,165,0,0.5)", 
                "rgba(0,128,128,0.5)",
                "rgba(128,0,128,0.5)",
                "rgba(128,0,0,0.5)",
                "rgba(230,230,250,0.5)"]

We can now use the Plotly scatter plot function to generate traces.
To iterate over the virus strains in the groupby object hai_by_virus, we can use a for loop as described here.

In [58]:
# set up an empty list to store traces:
data4 = []
# set up counter for changing color at each iteration of the loop
j = 0

for virus_strain, group in hai_by_virus:
    # Define traces for each virus strain:
    virus_trace = go.Scatter(
        x = group.Day0,
        y = group.Day28,
        name = virus_strain + " Strain",
        mode = "markers",
        marker = {
            "size" : 10,
            "color" : rgba_palette[j],
            "line" : {
                "width": 2,
                "color" : palette[j]
            }
        }
    )
    # increment counter!
    j += 1
    # add trace to data
    data4.append(virus_trace)

# define layout for this plot
layout4 = go.Layout(
    title = "Pre vs. Post-Vaccination HAI Titers",
    xaxis = {
        "title": "HAI Titer - Pre-Vaccination",
        "titlefont" : {
            "size" : 16,
            "color" : "rgb(107, 107, 107)"
        },
        "tickfont" : {
            "size" : 12,
            "color" : "rgb(107, 107, 107)"
        }
    },
    yaxis ={
        "title": "HAI Titer - Post-Vaccination",
        "titlefont" : {
            "size" : 16,
            "color" : "rgb(107, 107, 107)"
        },
        "tickfont" : {
            "size" : 12,
            "color" : "rgb(107, 107, 107)"
        }
    },
)

# create a figure object:
fig4 = go.Figure(data=data4, layout=layout4)

# call the plot:
plotly.offline.iplot(fig4)    

Even with the transparency, there are too many data points stacked together to be able to see all of them. An alternative to using transparency is to introduce a jitter to separate the dots on the plot (though this is discouraged by some statisticians).
Jitter is built-in in Plotly's boxplots as we've seen above, but for some reason hasn't made it to scatter plots yet. We can use instead one of the following two functions to modify values slightly and create a jitter for dots stacked on top of each other.

In [59]:
def rand_jitter(arr):
    stdev = .001*(max(arr)-min(arr))
    return arr + np.random.randn(len(arr)) * stdev

# for log value of data
def rand_jitter_log(arr):
    stdev = .01*(max(arr)-min(arr))
    return arr + np.random.randn(len(arr)) * stdev

To separate the data further, we can also try plotting the log values.

In [60]:
# set up an empty list to store traces:
data5 = []
# set up counter for changing color at each iteration of the loop
k = 0

for v_strain, grp in hai_by_virus:
    # Define traces for each virus strain:
    v_trace = go.Scatter(
        x = rand_jitter_log(np.log2(grp.Day0)),
        y = rand_jitter_log(np.log2(grp.Day28)),
        name = v_strain + " Strain",
        mode = "markers",
        marker = {
            "size" : 10,
            "color" : rgba_palette[k],
            "line" : {
                "width": 2,
                "color" : palette[k]
            }
        }
    )
    # increment counter!
    k += 1
    # add trace to data
    data5.append(v_trace)

# we can re-use layout4 since this is basically the same plot.
# create a figure object:
fig5 = go.Figure(data=data5, layout=layout4)

# call the plot:
plotly.offline.iplot(fig5)    

Yet another way to try and see every data point is to plot each category separately in subplots. In Plotly, one way to generate subplots is to change the configuration of the layout, more specifically to re-define what portion of the screen will be dedicated to each trace.
Plotly traces have attributes called xaxis and yaxis, which specify where the range of x and y values respectively will be for that given trace. By default, they are not defined:

In [61]:
print(data5[0]['xaxis'], data5[0]['yaxis'], data5[1]['xaxis'], data5[1]['yaxis'], data5[2]['xaxis'], data5[2]['yaxis'])
None None None None None None

Let's give these attributes a value. We don't need to change the first trace as the default value will map to xaxis by default.

In [62]:
# add axis name:
data5[1]['yaxis'] = 'y2'
data5[1]['xaxis'] = 'x2'
data5[2]['yaxis'] = 'y3'
data5[2]['xaxis'] = 'x3'

Now we can set up a new layout. We need to define domains for our new axes, add a title for the plot and axes names.

In [63]:
layout5 = go.Layout(
    title = "Pre vs. Post-Vaccination HAI Titers",
    xaxis = {
        "domain" : [0, 0.3],
        "titlefont" : {
            "size" : 16,
            "color" : "rgb(107, 107, 107)"
        },
        "tickfont" : {
            "size" : 12,
            "color" : "rgb(107, 107, 107)"
        }
    },
    yaxis ={
        "title": "HAI Titer - Post-Vaccination",
        "titlefont" : {
            "size" : 16,
            "color" : "rgb(107, 107, 107)"
        },
        "tickfont" : {
            "size" : 12,
            "color" : "rgb(107, 107, 107)"
        }
    },
    xaxis2 = {
        "domain" : [0.35, 0.65],
        "title": "HAI Titer - Pre-Vaccination"
    },
    xaxis3 = {"domain" : [0.7, 1]},
    yaxis2 = {"anchor" : 'x2'},
    yaxis3 = {"anchor" : 'x3'}
)


# create a figure object:
fig5_2 = go.Figure(data=data5, layout=layout5)

# call the plot:
plotly.offline.iplot(fig5_2)    

Conclusion: These plots show a pretty typical triangular shape, seen when subjects already have a certain HAI titer for one strain before vaccination (x axis). HAI titer will most likely not go down after vaccination (y axis).

  • Plot of HAI response vs. pre (Day 0) vaccination HAI titer

Now let's look at the HAI fold changes compared to pre-vaccination titers.

In [64]:
# set up an empty list to store traces:
data6 = []
# set up counter for changing color at each iteration of the loop
m = 0
for vstrain, gp in hai_by_virus:
    # Define traces for each virus strain:
    vtrace = go.Scatter(
        x = rand_jitter_log(np.log2(gp.Day0)),
        y = rand_jitter_log(gp.FOLD_CHANGE_LOG2),
        name = vstrain + " Strain",
        mode = "markers",
        marker = {
            "size" : 10,
            "color" : rgba_palette[m],
            "line" : {
                "width": 2,
                "color" : palette[m]
            }
        }
    )
    # increment counter!
    m += 1
    # add trace to data
    data6.append(vtrace)

# add annotation to the plot
note = go.Scatter(
    x = [10.3, 10.3],
    y = [1, 3],
    mode='text',
    textfont = {"color" : "rgb(100,100,100)"},
    text = ["Low Response", "High Response"],
    showlegend=False
)
data6.append(note)

# define layout for this plot
layout6 = go.Layout(
    title = "HAI Response vs. Pre-Vaccination HAI Titers",
    xaxis = {
        "title": "HAI Titer - Pre-Vaccination",
        "titlefont" : {
            "size" : 16,
            "color" : "rgb(107, 107, 107)"
        },
        "tickfont" : {
            "size" : 12,
            "color" : "rgb(107, 107, 107)"
        }
    },
    yaxis ={
        "title": "HAI Response (Post/Pre-Vaccination)",
        "titlefont" : {
            "size" : 16,
            "color" : "rgb(107, 107, 107)"
        },
        "tickfont" : {
            "size" : 12,
            "color" : "rgb(107, 107, 107)"
        }
    },
    # add a horizontal line to separate high and low responders
    shapes = [{
            'type': 'line',
            'x0': 2,
            'y0': 2,
            'x1': 11,
            'y1': 2,
            'line': {
                'color': 'rgb(100, 100, 100)',
                'width': 2,
                'dash': 'dot',
            },
        },
    ]
     
)

# create a figure object:
fig6 = go.Figure(data=data6, layout=layout6)

# call the plot:
plotly.offline.iplot(fig6)

The scatter plot here doesn't allow us to see clearly whether or not there is a difference between responses depending on the virus strain. Let's investigate this further. We can now take advantage of the categorical column response created above to group the data by virus strain and type of response. The pandas GroupBy.size() method returns each group size, and df.unstack() returns a dataframe nicer to look at, and easier to work with:

In [65]:
hai_by_virus_by_response = hai_analysis.groupby(['VIRUS_STRAIN_REPORTED', 'response']).size().unstack()

hai_by_virus_by_response
Out[65]:
response High Low
VIRUS_STRAIN_REPORTED
B 19 70
H1N1 27 62
H3N2 55 34

Conclusion: There are more subjects showing a low response to the B and H1N1 strains, while the opposite is true for H3N2.

Now let's examine whether the age of the subjects and the virus strain influence the influenza vaccine response.

  • HAI response per virus strain
In [66]:
# define traces for each group:
data7 = []
n = 0
for virus, groups in hai_by_virus:
    boxplot = go.Box(
            y = groups.FOLD_CHANGE_LOG2,
            name = virus + " Strain",
            jitter = 0.5,
            pointpos = -2,
            boxpoints = "all",
            marker = {
                "color" : palette[n],
                "size" : 4
            }
    )
    n += 1
    data7.append(boxplot)
    

# define layout:
layout7 = go.Layout(
    boxmode = "group",
    boxgroupgap = 0.5,
    title = "HAI Response per Virus Strain",
    xaxis = {
        "tickfont" : {
            "size" : 14,
            "color" : "rgb(107, 107, 107)"
        }
    },
    yaxis = {
        "title" : "HAI Response (Post/Pre-Vaccination)",
        "titlefont" : {
            "size" : 16,
            "color" : "rgb(107, 107, 107)"
        },
        "tickfont" : {
            "size" : 14,
            "color" : "rgb(107, 107, 107)"
        }
    },
)

# create a figure object:
fig7 = go.Figure(data=data7, layout=layout7)

# call the plot:
plotly.offline.iplot(fig7)    

Conclusion: The HAI responses seem to vary by virus types.

  • HAI response by age group
In [67]:
# define traces for each group:
old_vs_young = go.Box(
    y = hai_analysis.FOLD_CHANGE_LOG2,
    x = hai_analysis.ARM_NAME,
    jitter = 0.5,
    pointpos = -2,
    boxpoints = "all",
    marker = {
        "color" : palette[3],
        "size" : 4
    }
)

data8 = [old_vs_young]

# define layout:
layout8 = go.Layout(
    boxmode = "group",
    boxgroupgap = 0.5,
    title = "HAI Response per Arm",
    xaxis = {
        "tickfont" : {
            "size" : 14,
            "color" : "rgb(107, 107, 107)"
        }
    },
    yaxis = {
        "title" : "HAI Response (Post/Pre-Vaccination)",
        "titlefont" : {
            "size" : 16,
            "color" : "rgb(107, 107, 107)"
        },
        "tickfont" : {
            "size" : 14,
            "color" : "rgb(107, 107, 107)"
        }
    },
)

# create a figure object:
fig8 = go.Figure(data=data8, layout=layout8)

# call the plot:
plotly.offline.iplot(fig8)    

Conclusion: The HAI responses seem to vary by age group (arm)

  • Response varies by 2 factors: Strain and Age

Do younger people have a higher response for each of the three virus strains ?

In [68]:
# define traces for each group:
hai_by_age = hai_analysis.groupby('ARM_NAME')
data9 = []
p = 2
for age, data in hai_by_age:
    bp_age = go.Box(
            y = data.FOLD_CHANGE_LOG2,
            x = data.VIRUS_STRAIN_REPORTED,
            name = age,
            jitter = 0.5,
            pointpos = -2,
            boxpoints = "all",
            marker = {
                "color" : palette[p],
                "size" : 4
            }
    )
    p += 1
    data9.append(bp_age)
    

# define layout:
layout9 = go.Layout(
    boxmode = "group",
    boxgroupgap = 0.5,
    title = "HAI Response by Arm",
    xaxis = {
        "tickfont" : {
            "size" : 14,
            "color" : "rgb(107, 107, 107)"
        }
    },
    yaxis = {
        "title" : "HAI Response (Post/Pre-Vaccination)",
        "titlefont" : {
            "size" : 16,
            "color" : "rgb(107, 107, 107)"
        },
        "tickfont" : {
            "size" : 14,
            "color" : "rgb(107, 107, 107)"
        }
    },
)

# create a figure object:
fig9 = go.Figure(data=data9, layout=layout9)

# call the plot:
plotly.offline.iplot(fig9)    

Conclusion: It seems the response to the H3N2 strain is independent from the age group, while young people tend to respond on average more strongly than older ones to the H1N1 and B Strains.

To test our conclusion we can compare the HAI reponse values between arms with a Wilcoxon rank-sum tests, also called a Mann-Whitney U test, thanks to a function from the scipy package called mannwhitneyu. The first two arguments to this function are the two lists being compared. The third argument indicates whether or not to apply a correction for continuity, and the last one indicates that the p-value should represent a one or two-sided hypothesis.

The first step is to get the sets of fold changes for each arm, for each virus strain, which we can do thanks to pandas dataframes indexing and slicing capabilities. For the first test for instance, we want all values of the HAI response - in other words the content of FOLD_CHANGE_LOG2 - for all subjects with ARM_NAME 'Young' and VIRUS_STRAIN_REPORTED 'B'. This is what this first line does: t1_B is a list containing the HAI response values.

In [69]:
t1_B = hai_analysis.FOLD_CHANGE_LOG2[(hai_analysis.ARM_NAME=='Young') & (hai_analysis.VIRUS_STRAIN_REPORTED=='B')]
t2_B = hai_analysis.FOLD_CHANGE_LOG2[(hai_analysis.ARM_NAME=='Old') & (hai_analysis.VIRUS_STRAIN_REPORTED=='B')]

t1_H1N1 = hai_analysis.FOLD_CHANGE_LOG2[(hai_analysis.ARM_NAME=='Young') & (hai_analysis.VIRUS_STRAIN_REPORTED=='H1N1')]
t2_H1N1 = hai_analysis.FOLD_CHANGE_LOG2[(hai_analysis.ARM_NAME=='Old') & (hai_analysis.VIRUS_STRAIN_REPORTED=='H1N1')]

t1_H3N2 = hai_analysis.FOLD_CHANGE_LOG2[(hai_analysis.ARM_NAME=='Young') & (hai_analysis.VIRUS_STRAIN_REPORTED=='H3N2')]
t2_H3N2 = hai_analysis.FOLD_CHANGE_LOG2[(hai_analysis.ARM_NAME=='Old') & (hai_analysis.VIRUS_STRAIN_REPORTED=='H3N2')]
In [70]:
u_stat_B, p_val_B = stats.mannwhitneyu(t1_B, t2_B, True, 'two-sided')
u_stat_H1N1, p_val_H1N1 = stats.mannwhitneyu(t1_H1N1, t2_H1N1, True, 'two-sided')
u_stat_H3N2, p_val_H3N2 = stats.mannwhitneyu(t1_H3N2, t2_H3N2, True, 'two-sided')

print("--B Strain--\nU statistics: " + str(u_stat_B) + "\np-value: " + str(p_val_B))
print("\n\n--H1N1 Strain--\nU statistics: " + str(u_stat_H1N1) + "\np-value: " + str(p_val_H1N1))
print("\n\n--H3N2 Strain--\nU statistics: " + str(u_stat_H3N2) + "\np-value: " + str(p_val_H3N2))
--B Strain--
U statistics: 1112.0
p-value: 0.0218306583033


--H1N1 Strain--
U statistics: 1190.0
p-value: 0.00315539046574


--H3N2 Strain--
U statistics: 926.0
p-value: 0.622074450053

Conclusion: The results of this statistical tests show that the HAI response is significantly different between the age groups at a 0.05 level for the B and H1N1 strains, but not for the H3N2 strain