Fraud Detection - Using Graph Database




In Gartner Data and Analytics Summit Feb 2019, Gartner predicted the application of Graph technology to grow by 100% through 2022. The details are available here .

In addition to social networks and recommendation systems, fraud detection is a field where the value of graph technology cannot be understated. Today we'll learn the use of Graph technology with a real world business use case of fraud detection. 

If you are new to Graph technology, and haven't read my previous publication Graph technology in plain english, I highly recommend to read that once. It will help you to understand the fundamentals of Graph technology and serve as a catalyst for your learning process. 


Fraud detection

According to the research published by Crowe in July 2019, global fraud losses equate to shocking US$5.127 trillion each year. Fraudsters commit monetary frauds almost every day. 

What if such fraud can be detected at an early stage, through prior experience of some common fraud patterns or known fraudsters. We'll look at how Graph can be used to identify one such link and how tedious it can be to obtain this knowledge from a non-graph database.


Detect fraud for a finance company A2ZMoney:

A2ZMoney is a financial service company, which has substantial web presence. Businesses and individuals can sign up by creating profiles and use its financial services. Once onboarded, customers can make transactions and send or receive money. This company stores customers' PII and non-PII data.

Fraud A2ZMoney wants to detect Tom is a new customer and A2ZMoney wants to know how risky it would be, to accept Tom as a new customer and allow him to transact. 
 
Let's use a graph to answer this. 
Given that A2ZMoney has access to all of its customers' PII and non-PII data, we can use attributes like email addresses, residential addresses etc, to build a graph (shown below) that illustrates the relationships between each client. By now we know, that a graph is a combination of vertices and edges. Edges being the relationship between the vertices. Below I created a pictorial representation of A2ZMoney's graph:


Click on image to enlarge




Details of above Graph:

Please spend a minute to observe the above graph and you would see the following relationships. Below I have explained, 4 randomly picked relationships from this graph:


Relationships between customers:

Below are shown 4 relationships, with all the vertices colored orange and all the edges colored in green.

1. Tom and Jane have same email id.

          hasEmail                                          hasEmail
Tom --------------> xyz@yahoo.com  <-------------- Jane


2. Jane and Sam use the same device.

           useDevice                                       useDevice
Jane --------------> IP 10.100.13.44  <-------------- Sam


3. Sam and Jonny use the same credit card.

           hasPaymentCard                                                     hasPaymentCard
Sam -----------------------> 3333-3333-3333-3333  <------------------------ Jonny


4. Richard and Sam live in the same house.

                  lives at                                                  lives at
Richard --------------> 999 PQR Street, CA  <-------------- Sam



Vertices:

All A2ZMoney's customers including Tom, Jane, Richard, Mike, Sam and Jonny and their attributes such as email address, phone number, physical address, bank account number, payment card, ssn and the device they used, are vertices of the graph. 

Edges:

Every edge in this graph starts from the customer vertex and ends at the attribute vertex(such as email, phone, address etc). The edges shows the association of every customer with their attributes. 
I have assigned specific edge labels such as hasEmail, useDevice, lives at etc, to facilitate required graph traversals to get the desired insights. 

Please pay close attention to customers Tom(with NEW CUSTOMER sign in red) on the extreme left and Jonny(with Fraud sign) on the extreme right in the graph, as they would be our point of focus for this fraud detection use case. 

How does this graph answer - "If Tom would be a risky customer":

When the company tries to see association of Tom, by linking him with their dataset of known individuals, the above graph clearly shows the association of Tom with a fraudster named Jonny. This linkage is shown in the graph with yellow highlighted edges

Tom -> Jane -> Sam -> Jonny 

This raises a suspicion flag and hints of a risk.

There are two approaches in which Graph can be utilized to detect such risk:
 
1. Deterministic approach:

Such multi level linkage(of Tom with Jonny) can be efficiently calculated using graph traversal.
We need to just write an OLTP traversal logic, with Tom as the starting point. Traverse the graph to find Tom's connections with know fraudster vertex, measuring the number of hops and weightage of association. Weightage being high for attributes like SSN, DDA etc. 

Based on the number of hops and the weightage of association being weak or strong, business can set the acceptable number of hops to approve or reject the customer's on boarding. Thus determine the exact risk and hence the approach is called a deterministic approach.

A Gremlin(one of the graph query language) graph traversal on A2ZMoney's graph, to find Tom's connection with at least 2 fraudsters such as Jonny, is as follows:

g.V().hasLabel('Customer').has('name','Tom').outE().inV().inE().outV().outE().inV().inE().outV().outE().inV().inE().outV().has('status','Fraud').limit(2)

Isn't that one liner code easy, compared to a complex joins sql code or any custom application code you would have to write? That simplicity we get through graph query language.

2. Predictive approach:

This approach uses Graph along with a predictive Machine Learning Model. Collusion is a common fraud pattern, with multiple customers using say the same account number for transactions. Such data when viewed in a graph, appears as a ring. 

This approach involves traversing the graph, for the past known fraud ring patterns and training a data science(Machine Learning) model, with such ring patterns as features. The data science model once trained, gets equipped to generate a high score for a similar ring pattern and prevent similar fraud from recurring. In this case, Tom's association would also go through the scoring ML model, to predict the risk. Higher the score, higher would be the risk.
 
The importance of Graph is that such rings can be detected through traversals or prebuilt algorithms offered by Graph systems. We can design Gremlin traversals to achieve this, as well as there are graph algorithms, which can achieve this in an OLAP graph system.

To achieve best results out of the graph, it is recommended to enrich graph data, using maximum available data from external third party sources(such as Experian). 


How Graph DB is efficient over other Databases?




This question would be well answered, when we try to achieve this same traversal in a non-graph database. In relational world, the data model of A2ZMoney's customer's data looks like this, with fact and dimension tables. Ids of every attribute linking to the respective dimension table of the attribute.:

Click on image to enlarge


 
With this data model, if we try to find the connection of Tom with Jonny, which is 3 hops away, it is clear that it would need multiple self joins. More over the join would also have to be on different attributes, to just enquire if there is any connection with others or not. Dimensional Data Modeling helps a bit compared to Relational Data Modeling, but still several self joins for each hop are inevitable.

Every join operation is expensive. And this is one of those several reasons(besides flexible schema evolution, rich query support etc) which advocates the importance of Graph system. Graph stores this information already as edges. 

I have published techniques to optimize joins in Apache Spark, which you can find here. But still no one can deny the fact that joins are expensive. And Graph systems saves us from them. 

Besides relationships being first class citizen, ease of design of graph traversals and several graph algorithms available as part of the Graph APIs, makes Graph systems valuable. 

We now know how Graph systems can help us detect fraud, and how tedious it would be to detect the same fraud using non-graph databases. We also looked at the Predictive approach, where the power of Machine Learning can also be clubbed with Graph DB to predict fraud. 

Thank you for taking time to read this article. I hope you found it helpful. 

Comments

Popular posts from this blog

UPSERT in Hive(3 Step Process)

Parquet tools utility

Hive - Merging small files into bigger files.

Parquet File format - Storage details

Skew Join Optimization in Hive

Apache Spark - Sort Merge Join

Apache Spark - Explicit Broadcasting Optimization

Spark Optimization - Bucketing

Apache Spark - Broadcast Hash Join

Graph Technology - In plain English