A. ANOVA and other stat¶

Author: Shanshan Yang, Bioinformatics Core, ASU Time: Dec 11, 2017

import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt

datafile="A.anova.csv"
data=pd.read_csv(datafile)

data.head()
data

change column data type¶

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 3 columns):
cellline         19 non-null object
Phenotype        19 non-null float64
mutation_type    19 non-null int64
dtypes: float64(1), int64(1), object(1)
memory usage: 528.0+ bytes

data['mutation_type']=data['mutation_type'].astype('category')

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 3 columns):
cellline         19 non-null object
Phenotype        19 non-null float64
mutation_type    19 non-null category
dtypes: category(1), float64(1), object(1)
memory usage: 499.0+ bytes

create a boxplot¶

data.boxplot('Phenotype',by='mutation_type')

<matplotlib.axes._subplots.AxesSubplot at 0x111952550>

ANOVA¶

wildtype = data['Phenotype'][data['mutation_type'] == 0]
stop = data['Phenotype'][data['mutation_type'] == 1]
mutation = data['Phenotype'][data['mutation_type'] == 2]
#print wildtype

from scipy import stats
 
F, p = stats.f_oneway(wildtype,stop,mutation)
print "ANOVA P-value for p53 mutation", p

ANOVA P-value for p53 mutation 0.0173000919169

Tukey's range test (multiple means comparison)¶

Tukey's range test, also known as the Tukey's test, Tukey method, Tukey's honest significance test, Tukey's HSD (honest significant difference) test, is a single-step multiple comparison procedure and statistical test. It can be used on raw data or in conjunction with an ANOVA (post-hoc analysis) to find means that are significantly different from each other.

from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.stats.multicomp import MultiComparison

mc=MultiComparison(data['Phenotype'],data['mutation_type'])
Tukey_result=mc.tukeyhsd()
print Tukey_result

Multiple Comparison of Means - Tukey HSD,FWER=0.05
=================================================
group1 group2  meandiff   lower    upper   reject
-------------------------------------------------
  0      1     -72.9021 -225.0812  79.277  False 
  0      2    -141.7095 -256.7461 -26.6729  True 
  1      2     -68.8074 -197.4222 59.8074  False 
-------------------------------------------------

OLS¶

Linear models allows estimation by ordinary least squares (OLS).

import statsmodels.api as sm
from statsmodels.formula.api import ols
 
mod = ols('Phenotype ~ mutation_type',
                data=data).fit()
                
aov_table = sm.stats.anova_lm(mod, typ=2)
print aov_table
print mod.summary()

/Users/shanshanyang/anaconda/envs/gl-env/lib/python2.7/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

                     sum_sq    df         F  PR(>F)
mutation_type  63059.633366   2.0  5.284155  0.0173
Residual       95469.770093  16.0       NaN     NaN
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              Phenotype   R-squared:                       0.398
Model:                            OLS   Adj. R-squared:                  0.323
Method:                 Least Squares   F-statistic:                     5.284
Date:                Wed, 13 Dec 2017   Prob (F-statistic):             0.0173
Time:                        14:23:17   Log-Likelihood:                -107.92
No. Observations:                  19   AIC:                             221.8
Df Residuals:                      16   BIC:                             224.7
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept            179.8493     38.623      4.657      0.000      97.973     261.726
mutation_type[T.1]   -72.9021     58.997     -1.236      0.234    -197.971      52.166
mutation_type[T.2]  -141.7095     44.598     -3.178      0.006    -236.252     -47.167
==============================================================================
Omnibus:                        0.302   Durbin-Watson:                   1.117
Prob(Omnibus):                  0.860   Jarque-Bera (JB):                0.399
Skew:                           0.247   Prob(JB):                        0.819
Kurtosis:                       2.491   Cond. No.                         5.09
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

/Users/shanshanyang/anaconda/envs/gl-env/lib/python2.7/site-packages/scipy/stats/stats.py:1334: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=19
  "anyway, n=%i" % int(n))

T-test¶

from scipy.stats import ttest_ind, ttest_ind_from_stats
t,p=ttest_ind(wildtype,stop, equal_var=False)
print 't-test p-value for tumor subtype',p

t-test p-value for tumor subtype 0.48455905737

from scipy.stats import ttest_ind, ttest_ind_from_stats
t,p=ttest_ind(wildtype,mutation, equal_var=False)
print 't-test p-value for tumor subtype',p

t-test p-value for tumor subtype 0.106616090795

	cellline	Phenotype	mutation_type
0	cellline1	1.006932	2
1	cellline2	1.127197	2
2	cellline3	8.203515	2
3	cellline4	8.871560	2
4	cellline5	9.141132	1
5	cellline6	9.571941	2
6	cellline7	10.964782	2
7	cellline8	35.892193	2
8	cellline9	41.879357	0
9	cellline10	44.055486	2
10	cellline11	49.090788	2
11	cellline12	63.386971	1
12	cellline13	75.683290	2
13	cellline14	102.801630	2
14	cellline15	105.925373	0
15	cellline16	110.407862	2
16	cellline17	248.313311	1
17	cellline18	281.190083	0
18	cellline19	290.402265	0