A health insurance company can only be profitable if it generates more revenue than it spends on the medical care of its beneficiaries. On the other hand, although some conditions are more common to certain segments of the population, medical costs are difficult to predict because most of the money comes from rare conditions. This use case demonstrates how one can accurately predict medical care costs based for each individual. The example analysis can be found here.
The data set we have is a list of individuals with information about their age, sex, BMI, number of children, smoking habit, region and how much the healthcare cost they claim in that year. Table 1 shows some example rows from the data set.
Before building the model, let's do some data exploration by using our Correlational Analysis to look at the correlation between the predictor and the remaining variables.
Choose Correlational Analytics > Enter your data source
After clicking the Run button, a series of plots are shown to the user. The first chart shows us the correlations of the selected variables with charges sorted in descending order.
The following plots show distributions of charges broken down by each compared variable. The below figure shows the distributions of charges for smokers and non-smokers.
This plot shows a scatter plot with age and charges for each dimension. The green line shows a positive correlation between age and chages.
After exploring correlations between different variables, we move next to building a predictive model to predict the cost for each client. As charges is a continuous value, we choose Regression (vs Classification for categorical values) to build a predictive model for it. After selecting Regression from Choose Analysis, we set charges as the predicted target, age, bmi, children, region, sex and smoker as predictors. To optimize for accuracy, we set Optimize for quality and a training time limit of 2 hours.
Now click Run and wait for the analysis to finish. When it's done, we will see the performance of the trained model and importances of predictors.
The performance shows that the model has positive predictability (R2=0.87). On average, the difference between predicted cost and real cost is $2,432.3 (Mean Absolute Error).
Now to use the trained model with new data, we can select the Live Model tab where you can generate predictions with a new data set or you can interactively input predictor values in a form and generate the predictions on the fly.
You can also use an API to integrate the the model into your existing application (web app, mobile app, etc.) . Click on Live API tab and all the details of the API are shown:
To make API calls, you first need to generate API tokens by following instructions on the page.
The above example shows how one can build a predictive model to predict health care costs within a couple of hours. If you want to learn more, book a call with us to see how it can apply to your data and business.