Evaluation logistic regression using deviance in Spark

Evaluation using Maximum likelihood ratio and Deviance in Spark

For a project I wanted to test whether my logistic regression model was better than a null model. To do this in R using the Deviance this is very simple. However, I have enormous amounts of data and I use Spark to handle this. I wanted to do the same test I did in R, but now in Spark. I could not find much online so when I succeeded to get the same results, I decided to put this online if anyone else might be looking for it.

The example dataset I use here consists of four variables:

  • GRE, Graduate Record Exam scores (treated as continuous)
  • GPA, Grade Point Average (treated as continuous)
  • Rank, the prestige of the undergraduate instituation (treated as categorical, 1 to 4)
  • Admit, the admission into graduate school, this response variable is a binary variable (0,1)

These data can be downloaded from: http://www.ats.ucla.edu/stat/data/binary.csv

Using R

Below you can find an example of fitting a generalized linear model on the dataset. You can find the null deviance and p-value of the null model versus actual model. Below this example you can find the code to do this in Pyspark.

# Read in the data
data = read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
# Convert rank to a factor to indicate it should be treated as categorical
data$rank <- factor(data$rank) # Build the logistic regression model model = glm(admit ~ gre + gpa + rank, data = data, family = 'binomial') summary(model) ################################################################################################ Call: glm(formula = admit ~ gre + gpa + rank, family = "binomial", data = data) Deviance Residuals: Min 1Q Median 3Q Max -1.6268 -0.8662 -0.6388 1.1490 2.0790 Coefficients: Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.989979   1.139951  -3.500 0.000465 ***
gre          0.002264   0.001094   2.070 0.038465 *  
gpa          0.804038   0.331819   2.423 0.015388 *  
rank2       -0.675443   0.316490  -2.134 0.032829 *  
rank3       -1.340204   0.345306  -3.881 0.000104 ***
rank4       -1.551464   0.417832  -3.713 0.000205 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 499.98  on 399  degrees of freedom
Residual deviance: 458.52  on 394  degrees of freedom
AIC: 470.52

Number of Fisher Scoring iterations: 4
################################################################################################
#get p-value null model versus actual model
with(model, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))
################################################################################################
7.578194e-08

From the model summary we can identify that gre, gpa and rank order 2, 3 and 4 are statistically significant.

“We may also wish to see measures of how well our model fits. This can be particularly useful when comparing competing models. The output produced by summary(model) included indices of fit (shown below the coefficients), including the null and deviance residuals and the AIC. One measure of model fit is the significance of the overall model. This test asks whether the model with predictors fits significantly better than a model with just an intercept (i.e., a null model). The test statistic is the difference between the residual deviance for the model with predictors and the null model. The test statistic is distributed chi-squared with degrees of freedom equal to the differences in degrees of freedom between the current and the null model (i.e., the number of predictor variables in the model).” (https://stats.idre.ucla.edu/r/dae/logit-regression/)

To find the difference in deviance for the two models (i.e., the test statistic) we can do:

with(model, null.deviance - deviance)

The degrees of freedom for the difference between the two models is equal to the number of predictor variables in the mode, and can be obtained using:

with(model, df.null - df.residual)

Finally, the p-value can be obtained using:

with(model, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))

which results in 7.578194e-08.

Using Spark

Load data and show

In [3]:
# Necessary imports
import pandas as pd
# Initialize Spark context
sqlContext = SQLContext(sc)

# Load the data and transform to Spark dataframe
df = pd.read_csv('https://stats.idre.ucla.edu/stat/data/binary.csv')
df[['admit']] = df[['admit']].astype(float)    # regression algorithm will not work unless float
df = sqlContext.createDataFrame(df)
In [4]:
print(df.show(5))
print("Nr of rows: {}".format(df.count()))
+-----+---+----+----+
|admit|gre| gpa|rank|
+-----+---+----+----+
|  0.0|380|3.61|   3|
|  1.0|660|3.67|   3|
|  1.0|800| 4.0|   1|
|  1.0|640|3.19|   4|
|  0.0|520|2.93|   4|
+-----+---+----+----+
only showing top 5 rows

None
Nr of rows: 400
In [8]:
df.groupBy('admit').count().show()
+-----+-----+
|admit|count|
+-----+-----+
|  0.0|  273|
|  1.0|  127|
+-----+-----+

Pre-processing to pipeline

In [11]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler

# indexer for categorical features
rank_indexer = StringIndexer(inputCol = 'rank', outputCol = 'rank_ind', handleInvalid="skip")
# encoder for categorical features
rank_encoder = OneHotEncoder(inputCol = 'rank_ind', outputCol = 'rank_enc')

assembler = VectorAssembler(inputCols=['gre','gpa','rank_enc'], outputCol="features")

from pyspark.ml.regression import GeneralizedLinearRegression
glr = GeneralizedLinearRegression(family="binomial", link = 'logit', maxIter=10, labelCol="admit", featuresCol="features")

stages = [rank_indexer] + [rank_encoder] + [assembler] + [glr]

from pyspark.ml import Pipeline
final_pipeline = Pipeline(
    stages = stages
)
In [18]:
pipelineModel = final_pipeline.fit(df)
glr = pipelineModel.stages[3]
transformed_data = pipelineModel.transform(df)
In [19]:
print ('Model Intercept: ', glr.intercept)
print ('Model coefficients: ', glr.coefficients)
('Model Intercept: ', -3.989979073330664)
('Model coefficients: ', DenseVector([0.0023, 0.804, -0.6754, -1.3402, -1.5515]))
In [7]:
transformed_data.show(5)
+-----+---+----+----+--------+-------------+--------------------+
|admit|gre| gpa|rank|rank_ind|     rank_enc|            features|
+-----+---+----+----+--------+-------------+--------------------+
|  0.0|380|3.61|   3|     1.0|(3,[1],[1.0])|[380.0,3.61,0.0,1...|
|  1.0|660|3.67|   3|     1.0|(3,[1],[1.0])|[660.0,3.67,0.0,1...|
|  1.0|800| 4.0|   1|     3.0|    (3,[],[])|(5,[0,1],[800.0,4...|
|  1.0|640|3.19|   4|     2.0|(3,[2],[1.0])|[640.0,3.19,0.0,0...|
|  0.0|520|2.93|   4|     2.0|(3,[2],[1.0])|[520.0,2.93,0.0,0...|
+-----+---+----+----+--------+-------------+--------------------+
only showing top 5 rows

Check summary

In [20]:
summary = glr.summary
print("Coefficient Standard Errors: " + str(summary.coefficientStandardErrors))
print("T Values: " + str(summary.tValues))
print("P Values: " + str(summary.pValues))
print("Dispersion: " + str(summary.dispersion))
print("Null Deviance: " + str(summary.nullDeviance))
print("Residual Degree Of Freedom Null: " + str(summary.residualDegreeOfFreedomNull))
print("Deviance: " + str(summary.deviance))
print("Residual Degree Of Freedom: " + str(summary.residualDegreeOfFreedom))
print("AIC: " + str(summary.aic))
Coefficient Standard Errors: [0.0010939976579436175, 0.3318193045590884, 0.31648966326451633, 0.34530642335869066, 0.41783163745144813, 1.1399509620258108]
T Values: [2.069863467930688, 2.4231186619732323, -2.1341705792111876, -3.881201813253692, -3.713131170203388, -3.500132204143289]
P Values: [0.038465131844474865, 0.01538789939973162, 0.03282882008806931, 0.00010394154054749194, 0.00020471071816796638, 0.00046502746691179375]
Dispersion: 1.0
Null Deviance: 499.976517555
Residual Degree Of Freedom Null: 399
Deviance: 458.517492476
Residual Degree Of Freedom: 394
AIC: 470.517492476

You can see the all the values are the same as the result in R

Let’s calculate whether our model is better than a null model

In [22]:
# what is the critical value?
from scipy.stats import chi2
crit = chi2.ppf(q = 0.95, df = 5)
print("Critical value: ", crit)

# what is the p-value?
p_value = 1 - chi2.cdf(x=summary.nullDeviance - summary.deviance, df=5)
print("P value: ", p_value)   # same as in R!
('Critical value: ', 11.070497693516351)
('P value: ', 7.5781942276975656e-08)

Yes it is! And again, the result is the same as in R