Version 1.0, last modified November 2017
Python 3.6
pandas 0.21.0
seaborn 0.8.1
matplotlib 2.1.0
This is a tutorial on using Python tools to prepare ImmPort study information for analysis. This tutorial should NOT be considered as a real scientific analysis of this study, but is ONLY intended to show how be prepare data for real analysis. This tutorial will use SDY736 as the example study, so we will be using by downloading the SDY736-DR24_Tab.zip file from the Immport Data Browser site.
The Data Release study packages available for download contain many types of data that you can explore. The data in the SDY736-DR24_Tab.zip package was extracted from a MySQL database and the content of each file in the package contains the data from a table.
An overview of the ImmPort data model is available here, and the table definitions are available here
For this analysis we will start by creating a top level directory named SDY736. Below the SDY736 directory three directories where created: data, downloads and notebooks. The SDY736-DR24_Tab.zip file was downloaded from the ImmPort Data Browser web site to the downloads directory.
The following commands were used to unzip and move the contents to the data directory
cd downloads
unzip SDY736-DR24_Tab.zip
cd SDY736-DR24_Tab/Tab
mv * ../../../data
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
META_DATA_DIR="../data/Tab"
In the ImmPort model a study can have one or more arms_or_cohort records, and a subject is assigned to arm_or_cohort using the arm_2_subject table. Read in the study, arm_or_cohort and arm_2_subject data, so we can build a DataFrame that includes the study_accession, arm_accession and subject_accession. For the last step in this section, we will merge in the subject demographic information into the study_design DataFrame.
study = pd.read_table(META_DATA_DIR + "/study.txt",sep="\t")
arm_or_cohort = pd.read_table(META_DATA_DIR + "/arm_or_cohort.txt",sep="\t")
arm_2_subject = pd.read_table(META_DATA_DIR + "/arm_2_subject.txt",sep="\t")
subject = pd.read_table(META_DATA_DIR + "/subject.txt",sep="\t")
# Uncomment a line below to review DataFrame information
#study.head()
#arm_or_cohort.head()
#arm_2_subject.head()
#subject.head()
# Uncomment a line below to review the columns in a DataFrame
#print("study: ",study.columns)
#print("arm_or_cohort: ",arm_or_cohort.columns)
#print("arm_2_subjct: ",arm_2_subject.columns)
#print("subject: ",subject.columns)
In some cases the ARM_NAME can be quite long, so when we later merge tables together the ARM_NAME will not be included. So we can refer to the information below to understand the ARM_NAME for each ARM.
arm_or_cohort
In this section, we will merge table information together to simplify downstream analysis, plus remove columns not necessary for this tutorial. At the end of this process we will have a DataFrame study_design containing the merged content of the arm_or_cohort table, the arm_2_subject table and subject table.
In the ImmPort data model a subject may be assigned to multiple studies, and their age information may change over time. For this reason when a subject is assigned to an arm within a study, the age and phenotype information is contained in the arm_2_subject table.
# Uncomment a line below to review the number of rows and columns in a DataFrame
#print("study: ",study.shape)
#print("arm_or_cohort: ",arm_or_cohort.shape)
#print("arm_2_subject: ",arm_2_subject.shape)
#print("subject: ",subject.shape)
arm_or_cohort_short = arm_or_cohort[['STUDY_ACCESSION','ARM_ACCESSION']]
study_arm_subject = pd.merge(arm_or_cohort_short,arm_2_subject,
left_on='ARM_ACCESSION', right_on='ARM_ACCESSION')
study_design = pd.merge(study_arm_subject,subject,
left_on='SUBJECT_ACCESSION',right_on='SUBJECT_ACCESSION')
study_design.head()
arm_or_cohort[['ARM_ACCESSION','NAME']]
study_design.groupby('ARM_ACCESSION').count()['SUBJECT_ACCESSION']
study_design.groupby('GENDER').count()['SUBJECT_ACCESSION']
pd.crosstab(study_design.ARM_ACCESSION,study_design.GENDER)
sns.boxplot(x='ARM_ACCESSION',y='MIN_SUBJECT_AGE',data=study_design);
sns.boxplot(x='ARM_ACCESSION',y='MIN_SUBJECT_AGE',hue='GENDER',data=study_design);
Let's start by looking at the flow cytometry derived data. The data is parsed and is available in the fcs_analyzed_result.txt file. This file can get quite wide so we can change the default parameters for pandas DataFrame display in order to be able to see the table.
pd.options.display.max_columns = 30
fcs_analyzed_result = pd.read_table(META_DATA_DIR + '/fcs_analyzed_result.txt',sep="\t")
fcs_analyzed_result.head()
fcs_analyzed_result.groupby('STUDY_TIME_COLLECTED').count()['SUBJECT_ACCESSION']
We can focus on a subset of the information by filtering out columns that are not of interest. Similarly, most of the measures are taken at Study Day 0, and we can filter out results obtained at any other STUDY_TIME_COLLECTED.
keep = ['ARM_ACCESSION','SUBJECT_ACCESSION','POPULATION_DEFNITION_REPORTED',
'POPULATION_NAME_REPORTED','POPULATION_STAT_UNIT_REPORTED',
'POPULATION_STATISTIC_REPORTED','STUDY_TIME_COLLECTED']
fcs_analyzed_result_short = fcs_analyzed_result[fcs_analyzed_result['STUDY_TIME_COLLECTED']==0][keep]
fcs_analyzed_result_short.groupby('STUDY_TIME_COLLECTED').count()['SUBJECT_ACCESSION']
fcs_analyzed_result_short.groupby('POPULATION_DEFNITION_REPORTED').count()['SUBJECT_ACCESSION']
fcs_analyzed_result_short.head(10)
For ease of manipulation, we can subset each reported population into individual data structures, and look at the repartition of data between each arm of the study.
name = 'CD3+, CD4-, CD8+, CD28int, CD95low'
column_keep = ['ARM_ACCESSION','SUBJECT_ACCESSION','POPULATION_STAT_UNIT_REPORTED','POPULATION_STATISTIC_REPORTED']
allCD3pos = fcs_analyzed_result_short[fcs_analyzed_result_short['POPULATION_DEFNITION_REPORTED']==name][column_keep]
name = 'CD3-'
allCD3neg = fcs_analyzed_result_short[fcs_analyzed_result_short['POPULATION_DEFNITION_REPORTED']==name][column_keep]
allCD3pos.shape
allCD3neg.shape
arm_or_cohort[['ARM_ACCESSION','NAME']]
In the case of SDY736 the fcs derived analysis results were provided under 2 formats, cell counts and percentages. Both are captured in the same table, and need to be separated for proper subsequent analysis.
pd.crosstab(allCD3pos.ARM_ACCESSION, allCD3pos.POPULATION_STAT_UNIT_REPORTED)
pd.crosstab(allCD3neg.ARM_ACCESSION, allCD3neg.POPULATION_STAT_UNIT_REPORTED)
name = 'cells/ul'
CD3pos = allCD3pos[allCD3pos['POPULATION_STAT_UNIT_REPORTED']==name]
CD3neg = allCD3neg[allCD3neg['POPULATION_STAT_UNIT_REPORTED']==name]
name = 'percentage'
pCD3pos = allCD3pos[allCD3pos['POPULATION_STAT_UNIT_REPORTED']==name]
pCD3neg = allCD3neg[allCD3neg['POPULATION_STAT_UNIT_REPORTED']==name]
There are now 4 dataframes, each containing data corresponding to 2 different measures - cell counts and percentages - of 2 reported cell populations, differentiated by their CD3 status.
The POPULATION_STATISTIC_REPORTED column contains the reported value of the assay. We can look at the mean of these values (cell counts or percentages) in each group.
CD3pos.groupby('ARM_ACCESSION').mean()['POPULATION_STATISTIC_REPORTED']
pCD3pos.groupby('ARM_ACCESSION').mean()['POPULATION_STATISTIC_REPORTED']
CD3neg.groupby('ARM_ACCESSION').mean()['POPULATION_STATISTIC_REPORTED']
pCD3neg.groupby('ARM_ACCESSION').mean()['POPULATION_STATISTIC_REPORTED']
#fcs_header = pd.read_table(META_DATA_DIR + '/fcs_header.txt',sep="\t")
#fcs_header
CD3p_CMVp = CD3pos[CD3pos['ARM_ACCESSION']=='ARM3054']['POPULATION_STATISTIC_REPORTED']
CD3p_CMVn = CD3pos[CD3pos['ARM_ACCESSION']=='ARM3055']['POPULATION_STATISTIC_REPORTED']
CD3n_CMVp = CD3neg[CD3neg['ARM_ACCESSION']=='ARM3054']['POPULATION_STATISTIC_REPORTED']
CD3n_CMVn = CD3neg[CD3neg['ARM_ACCESSION']=='ARM3055']['POPULATION_STATISTIC_REPORTED']
pCD3p_CMVp = pCD3pos[pCD3pos['ARM_ACCESSION']=='ARM3054']['POPULATION_STATISTIC_REPORTED']
pCD3p_CMVn = pCD3pos[pCD3pos['ARM_ACCESSION']=='ARM3055']['POPULATION_STATISTIC_REPORTED']
pCD3n_CMVp = pCD3neg[pCD3neg['ARM_ACCESSION']=='ARM3054']['POPULATION_STATISTIC_REPORTED']
pCD3n_CMVn = pCD3neg[pCD3neg['ARM_ACCESSION']=='ARM3055']['POPULATION_STATISTIC_REPORTED']
plt.style.use('ggplot')
kwargs = dict(histtype='stepfilled',alpha=0.3,normed=True,bins=40)
plt.hist(CD3p_CMVp,**kwargs,label='CD3+, CMV+', color='blue')
plt.hist(CD3p_CMVn,**kwargs,label='CD3+, CMV-', color='red')
plt.legend();
plt.style.use('fivethirtyeight')
kwargs = dict(histtype='stepfilled',alpha=0.3,normed=True,bins=40)
plt.hist(CD3p_CMVp,**kwargs,label='CD3+, CMV+')
plt.hist(CD3p_CMVn,**kwargs,label='CD3+, CMV-')
plt.legend();
sns.kdeplot(CD3p_CMVp,shade=True,label='CD3+. CMV+',color="blue")
sns.kdeplot(CD3p_CMVn,shade=True,label='CD3+, CMV-', color="red")
plt.xlabel("cells/ul");
sns.kdeplot(pCD3p_CMVp,shade=True,label='CD3+. CMV+',color="blue")
sns.kdeplot(pCD3p_CMVn,shade=True,label='CD3+, CMV-', color="red");
plt.xlabel("percentage");
sns.kdeplot(CD3n_CMVp,shade=True,label='CD3-. CMV+', color="orange")
sns.kdeplot(CD3n_CMVn,shade=True,label='CD3-, CMV-', color="teal")
plt.xlabel("cells/ul");
sns.kdeplot(pCD3n_CMVp,shade=True,label='CD3-. CMV+', color="orange")
sns.kdeplot(pCD3n_CMVn,shade=True,label='CD3-, CMV-', color="teal")
plt.xlabel("percentage");
plt.style.use('ggplot')
g = sns.factorplot('ARM_ACCESSION','POPULATION_STATISTIC_REPORTED',data=CD3pos,kind="box")
g.set_xticklabels(['CMV+', 'CMV-']).set_axis_labels("CD3+", "cells/ul");
plt.style.use('ggplot')
g = sns.factorplot('ARM_ACCESSION','POPULATION_STATISTIC_REPORTED',data=pCD3pos,kind="box")
g.set_xticklabels(['CMV+', 'CMV-']).set_axis_labels("CD3+", "percentages");
plt.style.use('ggplot')
g = sns.factorplot('ARM_ACCESSION','POPULATION_STATISTIC_REPORTED',data=CD3neg,kind="box", palette=["orange","teal"])
g.set_xticklabels(['CMV+', 'CMV-']).set_axis_labels("CD3-", "cells/ul");
plt.style.use('ggplot')
g = sns.factorplot('ARM_ACCESSION','POPULATION_STATISTIC_REPORTED',data=pCD3neg,kind="box",palette=["orange","teal"])
g.set_xticklabels(['CMV+', 'CMV-']).set_axis_labels("CD3-", "percentages");