All State Claim Severity

10 min readMay 2, 2021

1.Business Problem

Allstate is an largest insurance company in the United States of America. This insurance industry has massive scale, and serves as an important function by helping individuals and companies management risk. This company offers the insurance. The main objective of this case study (or) problem statement is to predict the severity loss value of an insurance claim using machine learning Regression techniques. Allstate is currently developing automated methods of predicting the cost, and hence severity of claims.

2.Use of Machine Learning Algorithm:

Machine Learning is changing the way of data extraction and interpretation that involves high value prediction and guides for better decision and smart action in real time without human intervention. It is continuously evolving field and is much important for data scientists for making crucial decisions and also helps in analyzing the large chunks of data.

3.Source of DATA:

Allstate Claims Severity

How severe is an insurance claim?

www.kaggle.com

3.1 Overview of the Data

There are 2 datasets available in the competition train.csv and test.csv.

a.Train.csv: This csv contains 1,88,318 data points and 132 features including target variable “Loss”

This 132 features has combination of categorical features and continuous features.

b.Test.csv: This csv contains 1,25,546 data points and 131 features. With the help of these data need to predict the loss.

The data includes 116 categorical features and 14 continuous features

4.Performance Metrics:

4.1 MAE(Mean Absolute Error)

MAE is the average of absolute difference between actual value and predicted value.

It is robust to outliers. Since the dataset contains some outliers it will handle those outliers. MAE can be easily interpretable as it is simple and easy to calculate.

5.Exploratory Data Analysis:

separating cataegeorical and continuous features.

By separating categorical and continuous features we have 116 categorical features and 14 continuous features.

To find unique number of categories in each categorical features we will use bar plot to plot 116 features.

Observations:

We can draw a conclusion from the above plots that most of the features had minimum of 2 features and maximum of 326 .
As observed 116th cat has maximum features.
So overall we can see we have some categorical features with high cardinality.

5.1 Univariate analysis on continuous features by using distplot, boxplot, violin plot.

A. Distplot:

Observations:

We can say that from the above distributions each of the continuous features vary a lot of spikes in each of the plots.so there is no uniformity in PDF of these continuous features.

B. Boxplot:

observations:

1.From the above box plot we can see some outliers in cont7,cont9,cont10.

2.we can observe that the cont13,cont14 features have high variance.

3.We can see most of the mean values for continuous features are lie around 0.5.All values lie in between 0to1.

5.2 Applying polynomial transformations on continuous features.

Observations:

when we apply the polynomial transformation to our continuous features our distributions are almost similar to our original distributions, so by adding polynomial features we cannot get any important features.
our plots lie in between 0 to1.cont11,cont12 distributions are same. so we can drop any of the feature.

5.3 Target Feature Analysis[loss]:

Observations:

By above plot we can say that the our target feature follows log-normal distribution to convert into normal distribution or Gaussian distribution by applying log to our loss feature. The loss feature has right skewed distribution.

5.3.1 Applying Log to our Loss feature:

Observations:

Here even by applying log to the loss feature it is very clear that it does not follow Gaussian distribution. Let’s see by applying shift to the loss feature.

5.3.2 Applying Log to our loss feature and adding some shift:

Observations:

By applying some shift our loss feature follows the Gaussian Distribution. when we apply models we will add some shift to our loss feature.

5.4 Doing bi-variate analysis on important features:

Feature Importance using Random Forest Regressor.

5.4.1 Categorical feature analysis on top important features by Count Plot:

Observations:

we can say that some levels are very high and some levels are very rare means have only one level.

Summary:

from the above analysis we observed that most of the categorical features had only maximum of 2 categories.
Analyzing the every continuous feature by distributions.
There is no significance of applying polynomial features since the polynomial features replicate the original data.
Loss feature is having log-normal distribution. So to convert log-normal distribution into normal distribution we can apply log to the Loss feature and a shift to have a perfectly gaussian distributed plot.
Applying RandomForestRegressor to find feature importance for bi-variate analysis.
By applying countplot we can see how many unique levels are there in a category.

6.Feature Engineering :

Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.

6.1 PCA Feature Engineering:

An important machine learning method for dimensionality reduction is called Principal Component Analysis.
It is a method that uses simple matrix operations from linear algebra and statistics to calculate a projection of the original data into the same number or fewer dimensions.
Principle Component Analysis (PCA) is a common feature extraction method in data science.
PCA finds the eigenvectors of a covariance matrix with the highest eigenvalues and then uses those to project the data into a new subspace of equal or less dimensions

6.2 SVD feature Engineering:

SVD is a technique from linear algebra that can be used to automatically perform dimensionality reduction.
SVD is a data decomposition approach similar to principal component analysis (PCA).
It has many applications in signal processing and statistics, such as feature extraction of a signal, matrix approximation, and pattern recognition.
We can use SVD to calculate a projection of a dataset and select a number of dimensions or principal components of the projection to use as input to a model.
The scikit-learn library provides the TruncatedSVD class that can be fit on a dataset and used to transform a training dataset and any additional dataset in the future.

7.Modelling :

7.1 Linear Regression :

Applying Linear Regression on train and test data set

7.2 Ridge Regression:

Applying Ridge Regression with hyperparameter tuning on train and test data set

7.3 Lasso Regression :

Applying Lasso Regression with hyperparameter tuning on train and test data set

7.4 KNN regression :

Applying KNN Regression with hyperparameter tuning on train and test data set.

7.5 Decision Tree Regressor :

Applying Decision Tree Regression with hyperparameter tuning on train and test data set.

7.6 Random forest Regressor :

Applying RandomForest Regression with hyperparameter tuning on train and test data set

7.7 XgBoost Regressor :

Applying XgBoost Regression with hyperparameter tuning on train and test data set.

7.8 AdaBoost Regressor :

Applying AdaBoost Regression on train and test data set.

7.9 SVR :

Applying SVR Regression with hyperparameter tuning on train and test data set.

7.10 Implementing custom Ensemble model:

1.Split your whole data into train and test(80–20)

2. Now in the 80% train set, split the train set into D1 and D2.(50–50).

2.1)Now from this D1 do sampling with replacement to create d1,d2,d3….dk (k samples).

2.2)Now create ‘k’ models and train each of these models with each of these k samples.

3) Now pass the D2 set to each of these k models, now you will get k predictions for D2, from each of these models.

4) Now using these k predictions create a new dataset, and for D2, you already know it’s corresponding target values, so now you train a meta model with these k predictions.

5) Now for model evaluation, you have can use the 20% data that you have kept as the test set. Pass that test set to each of the base models and you will get ‘k’ predictions. Now you create a new dataset with these k predictions and pass it to your metamodel and you will get the final prediction. Now using this final prediction as well as the targets for the test set, you can calculate the models performance score.

Here I trained with the Decision tree ,XGboost and Adaboost as base model. also trained with different base models with different sample ratio. I found that the XGBoost model as base model performance is good with base models as 50 with sample ratio as 0.9.

7.11 MLP:

8.Models Comparison

To decrease our MAE we will do some of the preprocessing on the Categorical features.

Handling mismatch level in categorical features:

It means Handling categorical features having diff unique values in train and test data.
For example consider cat1 feature has unique categories A,B in train set but in test set has A,C here in test set C is replaced by NAN.
So finally we have train set A,B and test set A,NAN This NAN will be replaced by “-1”.