Topic: regression analysis


EXERCISE

Download the file with Dutch reimbursed healthcare costs per municipality in 2018: vektis2018_extended.xlsx. This file has two worksheet, one with meta data and one with the data. Importing the data in R: openxlsx::read.xlsx(“datafiles/vektis2018_extended.xlsx”, sheet = “data”).

Draw a random sample of 100 observations from this data set, using the sample() function (the dplyr package is part of the tidyverse package).

  1. Create a correlation matrix with the numerical variables in the data set. Which variable has the highest correlation with COSTS_PER_INSURED_YEAR?
  2. Comment on what can be seen in the correlation matrix.
  3. MODEL1: Generate a linear regression model with COSTS_PER_INSURED_YEAR as response variable (Y-variable) and AGE_AVERAGE as explanatory variable (X-variable).
  4. MODEL2: Generate a linear regression model with COSTS_PER_INSURED_YEAR as response variable (Y-variable) and AGE_MEDIAN as explanatory variable (X-variable).
  5. Compare MODEL1 with MODEL2; which model is the most usefull to explain the variation in the COSTS_PER_INSURED_YEAR for the different municipalities.
  6. MODEL3: Generate a multiple linear regression model with COSTS_PER_INSURED_YEAR as response variable (Y-variable) and a couple of explanatory variables; use the correlation matrix for making a selection of features (explanatory variables) in the model.