Drivers of Health Care Costs

An Exploratory Data Analysis using the Healthcare Challenge Data Set

Xue Li

December 28, 2021

Using the unbalanced panel dataset, the multiple linear regression with random effects model is used to determine the major contributing factors of the healthcare costs for patients hospitalized for a certain condition.

Research Methodology

  1. Importing and Merging Data Sets
  2. Data Cleaning
  3. Regression Model
  4. Residual Diagnostics
  5. Regression Diagnostics

Importing & Merging Datasets

  • The 4 data sets are merged initially by matching patient_id and bill_id.

  • In the merged data set, each unique patient_id may matched to several bill_ids with the same admission date, the observations are collapsed further to consolidate the total bill incurred for each hospitalization event, with 3,000 unique patient ids and 3,400 unique id-time observations.

Data Cleaning

  • Inconsistency in data input of variables such as gender, resident_status and medical_history_3 are cleaned.

  • Additional continuous variables are created for regression model:

    1. bmi - standardized scale for comparison which takes into account both height and weight
    2. medication - standardized scale (0 - 1) on number of preop_medications taken
    3. sym_count - standardized scale (0 - 1) on number of symptoms experienced
    4. comorbid - standardized scale (0 - 1) on number of medical histories applicable

Correlation Table between Variables

Regression Model

  • The plm package is used to run multiple linear regression model on panel data with individual fixed effects and random effects.

  • Hausman Test is subsequently used to decide between fixed or random effects, where null hypothesis suggest that the preferred model is random effects.

Individual Fixed Effects Model - Total Bill as Outcome Variable

Raw Estimate Robust S.E. T Value P Value Raw Estimate - 2.5% Raw Estimate - 97.5%
bmi 292.846 212.158 1.380 0.168 -352.421 938.114
comorbid 14645.946 1059.273 13.826 0.000 11832.715 17459.176
sym_count 23392.564 857.009 27.296 0.000 21352.864 25432.265
medication 3188.338 866.957 3.678 0.000 1121.654 5255.021

Random Effects Model - Total Bill as Outcome Variable

Raw Estimate Robust S.E. T Value P Value Raw Estimate - 2.5% Raw Estimate - 97.5%
(Intercept) -22609.775 1125.024 -20.097 0.000 -24730.253 -20489.297
gender 1273.798 233.574 5.454 0.000 814.137 1733.460
age 224.668 8.748 25.681 0.000 208.965 240.371
as.factor(race)Indian 4097.974 391.305 10.473 0.000 3315.259 4880.689
as.factor(race)Malay 10304.238 365.727 28.175 0.000 9726.741 10881.734
as.factor(race)Others 2449.738 554.500 4.418 0.000 1419.387 3480.090
bmi 332.370 28.361 11.719 0.000 277.102 387.637
comorbid 14060.735 753.794 18.653 0.000 12603.719 15517.751
sym_count 23994.405 486.033 49.368 0.000 23014.970 24973.840
medication 1012.538 546.013 1.854 0.064 -1.322 2026.397

Random Effects Model - Log(Total Bill) as Outcome Variable

Raw Estimate Robust S.E. T Value P Value Exp(Estimate) - 2.5% Exp(Estimate) - 97.5%
(Intercept) 7.758 0.043 181.335 0.000 2158.972 2538.562
gender 0.062 0.009 6.953 0.000 1.045 1.083
age 0.010 0.000 32.491 0.000 1.010 1.011
as.factor(race)Indian 0.201 0.015 13.326 0.000 1.187 1.260
as.factor(race)Malay 0.436 0.011 40.540 0.000 1.514 1.582
as.factor(race)Others 0.102 0.021 4.941 0.000 1.065 1.152
bmi 0.016 0.001 14.483 0.000 1.014 1.018
comorbid 0.659 0.029 22.488 0.000 1.824 2.047
sym_count 1.232 0.020 62.135 0.000 3.298 3.562
medication 0.057 0.021 2.760 0.006 1.018 1.102

Pooled OLS Model for Comparison - Log(Total Bill) as Outcome Variable

Raw Estimate Robust S.E. T Value P Value Exp(Estimate) - 2.5% Exp(Estimate) - 97.5%
(Intercept) 7.765 0.042 186.455 0.000 2177.662 2550.141
gender 0.063 0.009 7.373 0.000 1.047 1.083
age 0.010 0.000 34.201 0.000 1.010 1.011
as.factor(race)Indian 0.197 0.014 13.852 0.000 1.184 1.253
as.factor(race)Malay 0.438 0.010 41.823 0.000 1.517 1.583
as.factor(race)Others 0.101 0.020 5.148 0.000 1.066 1.149
bmi 0.015 0.001 14.964 0.000 1.014 1.018
comorbid 0.656 0.030 21.721 0.000 1.817 2.045
sym_count 1.230 0.020 60.814 0.000 3.288 3.556
medication 0.050 0.021 2.351 0.019 1.009 1.095

Individual Fixed Effects Model for Comparison - Log(Total Bill) as Outcome Variable

Raw Estimate Robust S.E. T Value P Value Exp(Estimate) - 2.5% Exp(Estimate) - 97.5%
bmi 0.029 0.011 2.705 0.007 1.000 1.059
comorbid 0.687 0.048 14.375 0.000 1.756 2.249
sym_count 1.217 0.033 37.121 0.000 3.087 3.693
medication 0.140 0.035 4.054 0.000 1.051 1.260

Random Effects Model with Lab Result Variables - Log(Total Bill) as Outcome Variable

Raw Estimate Robust S.E. T Value P Value Exp(Estimate) - 2.5% Exp(Estimate) - 97.5%
(Intercept) 7.738 0.078 99.136 0.000 1972.949 2664.881
gender 0.062 0.009 6.936 0.000 1.045 1.083
age 0.010 0.000 32.480 0.000 1.010 1.011
as.factor(race)Indian 0.201 0.015 13.281 0.000 1.187 1.260
as.factor(race)Malay 0.436 0.011 40.508 0.000 1.514 1.582
as.factor(race)Others 0.102 0.021 4.900 0.000 1.064 1.151
bmi 0.016 0.001 14.475 0.000 1.014 1.018
comorbid 0.659 0.029 22.481 0.000 1.825 2.048
sym_count 1.232 0.020 62.073 0.000 3.299 3.562
medication 0.058 0.021 2.794 0.005 1.019 1.103
lab_result_1 0.000 0.002 -0.017 0.986 0.995 1.005
lab_result_2 0.000 0.002 -0.127 0.899 0.996 1.003
lab_result_3 0.000 0.000 0.967 0.333 1.000 1.001

Correlation between No. of Comorbidities and Log(Total Bill)

Correlation between No. of Symptoms and Log(Total Bill)

Correlation between No. of Medications taken and Log(Total Bill)

Comparison of Regression Models

Residual Diagnostics

Distribution of Residuals

The residuals are approximately normally distributed which means the multiple linear regression model could be reasonably used to draw accurate inferences of the model estimates.

Q-Q Plot of Residuals

Visual display of residuals is reasonably close to the straight line, except for the extreme values at the lower and upper tails where residuals are larger than expected.

Residual Plot of PLM Regression with R.E.

The unstructured cloud of points centered at zero which suggested independence between residuals and fitted values, which satisfies homoskedasticity assumption of linear regression.

Regression Diagnostics

Fixed or Random Effects: Hausman Test

Since the null hypothesis cannot be rejected, the random effects model is preferred.

## 
##  Hausman Test
## 
## data:  log_total_bill ~ gender + age + as.factor(race) + bmi + comorbid +  ...
## chisq = 5.3069, df = 4, p-value = 0.2572
## alternative hypothesis: one model is inconsistent

Lagrange Multiplier Test for Random Effects

The null hypothesis in the LM test is that variances across individuals is zero.

Since the null hypothesis can be rejected in favor of the alternative, this means that there is significant difference across units (i.e. panel effect exists).

## 
##  Lagrange Multiplier Test - (Breusch-Pagan) for unbalanced panels
## 
## data:  log_total_bill ~ gender + age + as.factor(race) + bmi + comorbid +  ...
## chisq = 50.108, df = 1, p-value = 1.455e-12
## alternative hypothesis: significant effects

Discussion Points

  • Need for specific domain knowledge to assess of there could be other confounders to deduce if causal inference coud be made
  • Alternative model? PGLM with Gamma distribution with log link, that performs satisfactorily with distributions with long right tails