0% found this document useful (0 votes)
131 views25 pages

Big Data Medicare Fraud Detection - Finance - Project

Big Data can help detect Medicare fraud through machine learning models. Medicare spending has increased to $3 trillion annually, while fraud costs up to 10% of expenditures. The document describes building a model that predicts fraud using physician prescription data, payment amounts, and a list of excluded fraudulent providers. Key steps include data selection from CMS and payments datasets, cleaning, feature engineering like joining on identifiers and mapping drug fraud, and class balancing random forests. The best model achieved a 72% AUC. Future work includes cross-validation, hyperparameter tuning, and a real-time fraud detection pipeline.

Uploaded by

santhosh appu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
131 views25 pages

Big Data Medicare Fraud Detection - Finance - Project

Big Data can help detect Medicare fraud through machine learning models. Medicare spending has increased to $3 trillion annually, while fraud costs up to 10% of expenditures. The document describes building a model that predicts fraud using physician prescription data, payment amounts, and a list of excluded fraudulent providers. Key steps include data selection from CMS and payments datasets, cleaning, feature engineering like joining on identifiers and mapping drug fraud, and class balancing random forests. The best model achieved a 72% AUC. Future work includes cross-validation, hyperparameter tuning, and a real-time fraud detection pipeline.

Uploaded by

santhosh appu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Big Data Medicare Fraud AT H A RVA K O U S A D I K A R

Detection
WHY ?

US Healthcare spending has Medicare accounts for up to Fraud impact is estimated up


increased by 6.7 % making it $800 bn. to 10%
$ 3 trillion.
Workflow
● CMS Prescriber Data 2017

01 Database Selection


Payment Data 2017
Excluded (LEIE) dataset

● Data Visualization/ Exploratory Data Analysis

02 Data Pre-processing



Data cleaning
Feature Engineering
Class weights Balancing

● Logistic Regression
● Gaussian Naïve Bayes

03 Data Modelling ●

Random Forest Classifier
Extra Tree Classifier
● Gradient Boosting Classifier

04 End Result


Conclusion
Future scope
Problem Build an innovative machine
learning model that predicts fraud in
the Medicare industry using

Statement anomaly analysis and geo-


demographic metrics. 
1. Fraud by Service Providers (Doctors, hospitals, pharmacies) 

Fraud 2. Fraud by Insurance subscribers (patient or patient’s employers) 

Patterns 3. Fraud by insurance carriers 

4. Conspiracy Frauds (involved with all parties)


Govt.
Efforts
Government has initialized the
programs, such as the Medicare
Fraud Strike Force, enacted to
help combat fraud, but
continued efforts are needed to
better mitigate the effects of
fraud.
Tools Used:

Insights 1.

2.

3.
Tableau

Power BI

Spark using Azure HDinsight


Population by states
NPI per State
Exclusion Count
Number of Frauds By state
Dataset Selection

● 25M+ rows and 21 columns


● All information related to prescription, drugs,

01 CMS – Prescriber Data 2017



payments and charges by National Provider
Identifier (NPI).
All information on the physician (NPI, Name, City,
Practice, etc.)
● 11M+ rows and 75 columns
● Physicians in the US are required to declare all

02 Payments Received by Physicians


2017

payments received from pharmaceutical
companies
The sum of general payment
● Name of drug associated with the payments

● list of individuals and entities that are excluded


List of Excluded Individuals
03 and Entities (LEIE) database
2017
from participating in federally funded healthcare
programs (i.e. Medicare) due to previous healthcare
fraud.
● Mapped fraud labels
Data Pre-Processing
Data cleaning

● Impute missing Data


● Removing duplicates
● Removing outliers
● Factoring the categorical data
● Removing data based on general information.
● Data Sampling: The data set is very imbalanced in terms of fraud detection context as it is very skewed
(99 % no fraudulent cases and less than 1% fraudulent cases)
Feature Engineering

Joining datasets based on NPI, state, city, first and last n


Drug- based Fraudulent cases

Merging drug fraudulent cases with


prescriber data to create more features
Transforming Data and class balancing

Transform skewed data to approximately conform to normality by using log transformation 

Class weights assigned to reduce


skewness according to the
balancing ratio
ExtraTrees

Data Modelling

Models Implemented:
• Logistic Regression
Train-Test-Split • Gaussian Naïve Bayes
• and Gradient Boosting
• Classifier

Scaling data using Standard Scalar


Random Classifier

Model Evaluation
Conclusion

● With the increasing number of population of over 65 in USA, Medicare Fraud Detection
is essential
● All types of Fraud Patterns have been Covered.
● Most Fraud Cases committed are in bay area
● Out of 5 Models Performed, best resulting model is Random Forest with AUC 72 %
Future Scope

•  Use cross validation for sampling the data into train-test


split.
• Hyper-parameter tuning to increase the overall performance
of the algorithm.
•  Build a real-time fraud detection pipeline using ML flow and
Kafka.
• The model needs to be retrained without stopping the
prediction service, since users will keep interacting.
Kafka and zookeeper server initialized
using docker

Random Forest Model hosted using ML flow


References

● Part D Prescriber Data CY 2017. (n.d.). Retrieved June 23, 2020, from
https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-
Reports/Medicare-Provider-Charge-Data/PartD2017
● LEIE Downloadable Databases: Office of Inspector General: U.S. Department of Health
and Human Services. (2020, June 10). Retrieved June 23, 2020, from
https://oig.hhs.gov/exclusions/exclusions_list.asp
● Dataset Downloads. (n.d.). Retrieved June 23, 2020, from
https://www.cms.gov/OpenPayments/Explore-the-Data/Dataset-Downloads
Thank you

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy