What Do Data Scientists Do in Commercial Insurance?
March 18th, 2021
I recently celebrated my 7-year work anniversary which made me want to write a post about what it’s like to be a data scientist in insurance.
The insurance industry typically has a hard time attracting young talent. I’m sure most college graduates interested in data science would be more eager to join a FAANG company or a cool start up. As a result, companies including mine add special rotation programs for new graduates where they learn more about insurance and get to experience working in different capacities, such as underwriting, claims, analytics, etc.
There are two main areas in insurance — Life and Property & Casualty (P&C). The P&C market is further split into personal and commercial lines. Some people think that personal lines are slightly ahead in innovation and lead the digital transformation. You have probably seen personal auto commercials on TV that promise you discounts for safe driving and provide you with a device that tracks your driving patterns. Others offer taking a snap of an accident to quickly evaluate how much money you can expect to get back.
There is an enormous effort spent on digital innovation among commercial carriers with companies abandoning legacy systems, focusing on data collection, improving data access and storage, automating processes with the help of robotics, embracing cloud architecture and advanced analytics.
I think one of the reasons why personal lines had a bit of a head start is because they have more data than commercial carriers. Obviously, as a data scientist you want to have a lot of data to achieve more accurate results, and also because certain methodologies don’t do very well on small datasets. That said, in the vast majority of applications, you will have enough data to train Machine Learning algorithms. You may have to get creative with splitting data for validation purposes and using n-fold Cross Validation, for instance, as opposed to simple Train/Test split.
We truly find ourselves spending 70%+ of our time on accessing data and trying to figure out how to construct a final dataset out of it because our data may be all over the place. This means you’ll need to be highly proficient with SQL.
You will also most likely need to research and figure out how to use external data sources. Some external data sources can be accessed via an API, so the ‘requests’ package in Python may become your best friend.
Sometimes, you even have to get creative and extract data from images and documents to construct a dataset which involves the next level of data engineering skills, far more complex than basic SQL, i.e. using Machine Learning and Deep Learning to construct a data set to do Machine Learning and Deep Learning.
You would have been just fine 7 years ago knowing only classical statistical techniques like GLM and basic ML methodologies; however, these days you’ll find yourself solving problems with more advanced ML, DL models and NLP. We use a lot more than GLMs even though they certainly have their place especially in pricing.
Why are GLMs so popular? Insurance is a regulated industry which means that very often we need to file our rates and explain how we arrived at them; because of that, you may not be able to use an uninterpretable Machine Learning algorithm. Yes, there are various cool Machine Learning Interpretability methods that you could use like Shapley values to explain how results were generated. Unfortunately, the industry is not fully sold on it yet, so you’ll still see interpretable models used when results are heavily scrutinized.
You probably won’t see much of GLMs with Gaussian distribution, but most certainly get familiar with Poisson, Gamma and Tweedie.
As you’ll see below, there are data science applications other than pricing where you get to use machine learning, deep learning and text analytics.
When I first started, we used SAS for data mining and model training. While my team switched toward Open Source tools a while ago, there are some companies that still rely on SAS to some extent.
Knowledge of Python and R is an absolute requirement for a Data Scientist role in insurance. Expensive software don’t help the combined ratio (that’s how we measure profitability) and almost 100% of modelling tasks can be achieved with Open Source tools.
In most cases, you don’t need to be an expert in Data Structures and Algorithms, and if you can solve programming questions on Leetcode, good for you... You probably won’t see any during a Data Science interview. That said, a lot of companies are implementing MLOps practices and bringing people with DevOps background to data science teams to improve and speed up the process of bringing a model into production.
You’ll need to have solid SQL skills to work for an insurance company. While many companies are exploring and slowly moving in Big Data direction, you can often train a model locally. In some cases, especially in companies that use a lot of external sources, you’ll need to potentially spin-up a virtual machine or be comfortable with distributed computing to process large datasets.
It doesn't hurt to get comfortable with non-tabular data formats, such as JSON. While we use NoSQL databases for some purposes, it’s not an absolute requirement at the moment, although I see them gaining popularity in upcoming years.
7 years ago you could have gotten a data science job in insurance with only SQL and R. These days you need to be proficient in Python and/or R, have some understanding of Big Data and have prior exposure to one of the cloud giants (AWS, Azure or GCP).
Examples of Data Science projects in insurance
Claims is where many insurance carriers start there analytics journey including where I started my journey as a data scientist. Here are some examples of problems we try to solve:
- Can we identify fraudulent claims
- Can we use drone imagery after major catastrophic events or vehicle collision images to determine how much damage there is
- Can we estimate severity of a claim using internal and external, structured and unstructured data to assign it to a more experienced claim adjuster
- Can we estimate a claim amount for reserving purposes
The last task is the most challenging and commonly solved with traditional actuarial techniques. However, there is some promising research conducted on how data science can help.
Underwriting and Pricing
This is where you’ll see interpretable modelling techniques used the most. I mentioned rate filing above, but we also like having transparent models and no surprises in how our rates are generated.
Pricing is commonly handled by actuaries or data scientists with actuarial background. In fact, you’ll see quite a bit of data science talent coming from actuaries changing their career trajectory. This isn’t surprising as actuaries have knowledge of probability theory, statistical analysis and solid SQL skills.
It’s also not uncommon for data scientists to work closely with actuaries, and I think is absolutely necessary. Firstly, there are certain adjustments that need to be made to a dataset for pricing models without which we’ll get nonsense results. Secondly, to get to the level of domain expertise that actuaries have would take a decent amount of time and effort, so we benefit from collaborating with them.
Getting a quote can be a cumbersome process which involves filling out a multi-page questionnaire and sending over supporting documentation. Insurance carriers use external data and machine learning to ask as few questions as possible and accurately pre-fill the answers to the others to provide you with a quote as quickly as possible.
Like most other non-insurance companies, we build models to study customer churn or customer retention, and identify segments with prospective customers that are more likely to become policyholders.
Other customer experience initiatives include chatbots. It’s a popular topic, but I have yet to see a chatbot that would make me forgo calling a company and requesting to speak with a representative.
Personal carriers already provide an Electronic Logging Device that analyzes your driving behaviour to offer safe driving discounts. Commercial carriers can do the same. Telematics is an exciting area from a data science perspective. First of all, you’ll typically see large data volumes, so you’ll need to know not only how to analyze it but also how to build pipelines for real-time consumption. Secondly, it’ll most likely be in a non-tabular format that you’ll have to figure out how to use it. Thirdly, you can analyze this data in a variety of ways to get a better sense of driving behaviour.
There is quite a bit of talk about how insurance will get affected by the rise of self-driving cars. I’m personally excited to see how autonomous vehicles will get insured in the future and how we will utilize large data volumes that they generate for underwriting and claims management.
I hope this post showed you what a Data Scientist can do in commercial P&C insurance. With companies going through digital transformation and investing in analytics, it’s a lot more fun to work as a data scientist today than it was 7 years ago. Aside from working on interesting projects, you’ll also learn a lot about insurance which can come handy in our lives.