Data Engineering - Understanding Trends in Insurance Fraud

Introduction

In a world full of uncertainties there is only few ways that we as individuals make sure that our assets are protected from natural as well as accidental disasters. One such measure to take is buying an insurance.

Currently we have insurances for everything from property, automobile, health to liability. We pay handsome amounts in premium to keep the coverage as high as possible so we can be compensated for the loss, if and when it may occur.

The providers, payers and customers are part of the agreement that is defined by external factors and these external factors can be influenced by bad actors that can cause liabilities to all parties involved, normally termed as frauds.

In this case study we are going to look at the dataset to understand the correlation between premiums amounts and claim amounts, the potential to anticipate frauds and a automated way to keep the data up-to date and clean.

Let’s dive in.

Initial Preparations

To begin understanding the problems in Insurance domain, we went about exploring for dataset. As it is with any protected information, it is not easy to get production dataset easily and especially for an exploratory analysis like we are going to do in this case study.

We focused on getting relevant dataset from reliable websites like Kaggle dot com. The entire analysis is driven by findings of this dataset.

Let’s outline a typical approach.

Our Approach

For a typical data analysis project, there are multiple ways to approach the data collection and initial processing phase of the project.

With tools like SQL and Python, it is easy to set up the data extraction process directly from the source either by scraping the website or through an API.

In our case, it was easier to download the dataset locally from the website and reference that in the project. The reason being we did not plan to set up a data extraction pipeline with the source but a one-off data download for analysis purposes.

Objective

The objective of the analysis is to understand any patterns of insurance fraud emerging from the dataset.

Tools & Techniques

The consideration for tooling depends on the quantity of data and the type of analysis that needs to be done on the data. The objective or goal of the analysis needs to be scoped out for efficient use of resources and selection of tool appropriately.

To align with the objective, the dataset needs to be cleaned and modeled to effectively understand the data and prepare for analysis.

The tools we decided to go ahead with are open source that fit the purpose. We wanted to try new analytical tool and decided to go with DuckDB for quick storage of data and DBT for modeling purposes.

As DBT works well with lot of open source orchestration platforms like airflow, it seemed the perfect choice, in case we need to do recurring data ingestions.

We decided to go with visualization platform that will be easily extensible and accessible. We decided to go with Streamlit.

Data Ingestion

The dataset for the project was downloaded from Kaggle but it needs to be ingested into the database of our choice.

In this case, we need to write ingestion script to load dataset into Duckdb. We create a script file called ingestion.py to write the script to load data from csv to Duckdb.

import pandas as pd
import duckdb
import os

# File paths
DATA_DIR = '../data'
DUCKDB_FILE = 'insurance.db'


def load_csv_to_duckdb(table_name, csv_file):
    """
    Load a CSV file into DuckDB.
    Args:
        table_name (str): Name of the table in DuckDB.
        csv_file (str): Path to the CSV file.
    """
    try:
        print(f"Loading {csv_file} into DuckDB table {table_name}...")

        # Read CSV into a Pandas DataFrame
        df = pd.read_csv(csv_file)

        # Connect to DuckDB
        con = duckdb.connect(DUCKDB_FILE)

        # Write DataFrame to DuckDB table
        con.execute(f"CREATE TABLE IF NOT EXISTS {table_name} AS SELECT * FROM df")
        print(f"Data loaded successfully into {table_name}!")

        # Optional: Display first few rows
        print(con.execute(f"SELECT * FROM {table_name} LIMIT 5").fetchdf())
    except Exception as e:
        print(f"Error loading {csv_file} to DuckDB: {e}")


def main():
    # File-to-table mapping
    files = {
        "employees": os.path.join(DATA_DIR, "employee_data.csv"),
        "insurance": os.path.join(DATA_DIR, "insurance_data.csv"),
        "vendors": os.path.join(DATA_DIR, "vendor_data.csv")
    }

    # Load each CSV into DuckDB
    for table_name, file_path in files.items():
        if os.path.exists(file_path):
            load_csv_to_duckdb(table_name, file_path)
        else:
            print(f"File {file_path} not found!")


if __name__ == "__main__":
    main()

The dataset is in three different csv files -

employee_data.csv
insurance_data.csv
vendor_data.csv

Each of this data file is stored as a table in duckdb as table to easily query the data for initial exploration and understanding. In DuckDb, the database file can be stored as filename.duckdb or filename.db and in our case we stored it as insurance.db file.

Understanding the Data

The dataset is divided into three different files representing the employee, vendor and insurance data. The employee data file represents all the agents that have worked on providing insurance policy and processing claims. The vendor data file represents all the vendors licensed to provide insurance policy to the customers. Finally the insurance data file represents all the insurance data processed by the agents and vendors. It also has all the claim details to understand more about the claims processing or any anomalies that can be detected.

The best way to make sense of the data is to load the data into a database. Each of the files can be saved as a separate table while creating more tables to aggregate information as needed for analysis and/or visualization. We are using DuckDB with DBT to create our staging tables.

We are relying on DBT to model the data. Our approach is to create a staging area for loading the initial files into respective tables and then modeling the data warehouse from the staging tables.

DBT SQL for creating employee table - stg_employee_data

with employee_data as (
    select * from employees
)
select
    AGENT_ID,
    AGENT_NAME,
    DATE_OF_JOINING,
    ADDRESS_LINE1 as STREET_NAME,
    ADDRESS_LINE2 as APARTMENT_NAME,
    CITY,
    STATE,
    POSTAL_CODE,
    EMP_ROUTING_NUMBER as EMPLOYEE_ROUTING_NUMBER,
    EMP_ACCT_NUMBER as EMPLOYEE_ACCOUNT_NUMBER
from employee_data

DBT SQL for creating vendor table - stg_vendor_data

with vendor_data as (
    select * from vendors
)
select
    VENDOR_ID,
    VENDOR_NAME,
    ADDRESS_LINE1 as STREET_NAME,
    ADDRESS_LINE2 as APARTMENT_NAME,
    CITY,
    STATE,
    POSTAL_CODE
from vendor_data

DBT SQL for creating vendor table - stg_insurance_data

with insurance_data as (
    select * from insurance
)
select
    TXN_DATE_TIME,
    TRANSACTION_ID,
    CUSTOMER_ID,
    POLICY_NUMBER,
    POLICY_EFF_DT as POLICY_EFFECTIVE_FROM,
    LOSS_DT as LOSS_DATE,
    REPORT_DT as REPORTING_DATE,
    INSURANCE_TYPE,
    PREMIUM_AMOUNT,
    CLAIM_AMOUNT,
    CUSTOMER_NAME,
    ADDRESS_LINE1 as STREET_NAME,
    ADDRESS_LINE2 as APARTMENT_NAME,
    CITY,
    STATE,
    POSTAL_CODE,
    SSN,
    MARITAL_STATUS,
    AGE,
    TENURE,
    EMPLOYMENT_STATUS,
    NO_OF_FAMILY_MEMBERS,
    RISK_SEGMENTATION,
    HOUSE_TYPE,
    SOCIAL_CLASS,
    ROUTING_NUMBER,
    ACCT_NUMBER,
    CUSTOMER_EDUCATION_LEVEL,
    CLAIM_STATUS,
    INCIDENT_SEVERITY,
    AUTHORITY_CONTACTED,
    ANY_INJURY,
    POLICE_REPORT_AVAILABLE,
    INCIDENT_STATE,
    INCIDENT_CITY,
    INCIDENT_HOUR_OF_THE_DAY,
    AGENT_ID,
    VENDOR_ID
from insurance_data

Data Preparation

Once the data is loaded into the staging table, it is advisable to perform initial quality checks and remove any data that might skew the analysis or modeling of the data.

With DBT, we can create the initial checks and more in a yaml file and make use of their inbuilt checks to perform them against specific columns of the tables. Here the checks can also be created to define relationship between columns of two tables or more.

Here we have tried to keep it simple and check for any null values in the columns that can be deemed as required for our analysis.

version: 2

models:
  - name: insurance_analytics
    database: travel_insurance
    schema: main
    tables:
      - name: stg_employee_data
        columns:
          - name: AGENT_ID
            tests:
              - not_null
      - name: stg_insurance_data
        columns:
          - name: POLICY_NUMBER
            tests:
              - not_null
      - name: stg_vendor_data
        columns:
          - name: VENDOR_ID
            tests:
              - not_null

Data Modeling & Transformation

The staging tables allows us to work with the data with simple SQL queries. In DuckDB it is easy to switch between SQL table and data frame to alternate between python and sql for querying data.

Here we are using DBT to create a data mart of sorts to store varied information about the different dimensions of the data. We are also creating some aggregate tables to answer certain business queries that may provide clarity for any questions the business may have about the data.

Here is one such table where we are trying to find out incidents as per different states -

{{
  config(
    materialized='table'
  )
}}

with fct_incident_state_severity_count as (
     select
     incident_state,
     incident_severity,
     COUNT(*) as incident_count
     from dim_claim_details
     group by incident_state, incident_severity
     order by incident_state, incident_severity
)

select *
from fct_incident_state_severity_count

Data Validation

The data mart created with multitude of information derived from the staging table needs to be validated and as we used a yaml file in validation of staging table, we used similar yaml file with validation checks for data mart.

version: 2

models:
  - name: insurance_analytics
    database: travel_insurance
    schema: main
    tables:
      - name: dim_employee
        description: "Dimension Table for Employees/Agents"
        columns:
          - name: agent_id
            description: "Column to identify the employee"
            tests:
              - not_null
              - unique
          - name: employee_id
            description: "Unique ID for each employee"
            tests:
              - not_null
              - unique
      - name: dim_customer_details
        description: "Dimension Table for Customers"
        columns:
          - name: customer_details_id
            description: "Unique ID for customer details table"
            tests:
              - not_null
              - unique
          - name: customer_id
            description: "Unique ID for customer"
            tests:
              - not_null
              - unique
      - name: dim_vendor
        description: "Dimension Table for Vendors or Insurance Providers"
        columns:
          - name: vendor_id
            description: "Unique ID for vendor"
            tests:
              - not_null
              - unique
          - name: id
            description: "Unique ID for vendor table"
            tests:
              - not_null
              - unique
      - name: dim_claim_details
        description: "Dimension Table for Claims"
        columns:
          - name: claim_details_id
            description: "Unique ID for Claims details table"
            tests:
              - not_null
              - unique
          - name: agent_id
            description: "Employee ID connected to the employee table"
            tests:
              - relationships:
                  to: source('main', 'dim_employee')
                  field: agent_id
          - name: vendor_id
            description: "Vendor ID connected to the Vendor table"
            tests:
              - relationships:
                  to: source('main', 'dim_vendor')
                  field: vendor_id
      - name: fct_claims_mobile
        description: "Fact Table to get count of all claims only for mobile insurances"
        columns:
          - name: insurance_type
            description: "Mobile Insurance type"
            tests:
              - accepted_values:
                  values: ['Mobile']
      - name: fct_claims_summary_by_type
        description: "Fact Table to get count of all the claims by type"
        columns:
          - name: insurance_type
            description: "All Types of Insurance"
            tests:
              - accepted_values:
                  values: ['Health','Life','Mobile','Motor','Property','Travel']
      - name: fct_incident_state_severity_count
        description: "Fact table to get all the count of incidents as per state"
        columns:
          - name: incident_state
            description: "Details of state of a incident"
            tests:
              - not_null
          - name: incident_severity
            description: "Details of severity of a incident"
            tests:
              - not_null

In DBT, there are options to define specific tests like we have done above but also generic tests in the tests folder that may be applicable across multiple tables.

Data Visualization & Analysis

Before we get into data analysis, it helps to visualize the data to understand any anomalies or any outliers that we can find in the data before we get deep into analysis.

While DuckDB allows for easy analysis and visualization with Jupyter notebook, we decided to try a more user friendly way to do analysis with data visualization using python library called Streamlit.

Streamlit allowed us to visualize our models and data by way of simple SQL queries and provided an easy way to play around with the data by creating multiple analytical queries.

Here is a snapshot of easily querying the DuckDB table with streamlit -

The same query then provides an option to create a visualization like below which can be played around with different combinations of the columns of the table -

Data Analysis & Visualization

Now that we have everything set up, now we can answer some questions that might have been troubling our business heads.

We looked into some common outliers and anomalies we could find just by querying the DBT models that we created before and do some visualizations with the streamlit library.

Now let's have a look at some visualizations from the analysis that answer some business questions.

Let's have a look at the claims where the reported incident has no injuries listed.

In the above plot, we can see the any injury column has a binary value of 0 or 1 which represents whether the claim has been reported with an injury or not. If we dig deeper into the claim amount, we should be able to identify some high amount claims where there are no injuries. These are the claims we would want to dig deeper and find out what kind of insurance type are these claims reported against in the dataset.

Let's have a look at the distribution of claims reported within 200 days of policy activation date.

In the above plot, the distribution of claims that were reported within 200 days of incorporating the policy tells us that there are agents which are signing off on high number of claims within the time window. This can seem like a suspicious transaction can be flagged for further review.

Lets have a look at the Agents with most claims processed

The above bar chart shows the agents who have processed highest number of claims in 30 days. In this case the business will definitely need to look into agents that are processing unusual amount of claims in a short window of time and cross check them with the data for insurances that were started within last year to understand the reason in increased claim processing.

Lets have a look at the unusual high claim amounts for low premium amounts which were distinguished into High risk segment.

Usually the ones with high claim amount against low premium amount should not be flagged but in this case the risk segmentation was marked as high and so the premium amounts should have been high as well.

In this case, the business will need to revisit the under writing on the insurances to clearly reflect these to avoid similar situations in the future.

Next Steps

The above analysis is only few of the anomalies that can be detected with the data. The data modeling in this case will help businesses to create visualizations of their own while requesting for more such views and tables with custom queries to help with their data analysis and in-turn decision making.

Our next steps include improving the modeling to reflect the business questions that can be easily answered. As this is a dataset that attempts to detect insurance fraud, we can also run predictive analysis to train the model to predict insurance fraud with any future data set.

For future dataset, we should be able to create a scheduler with airflow for efficient ingestion and transformation that can be carried out on regular basis.