Graph Technology - In plain English



Graph processing systems have become a prominent buzzword in computation world in recent years. Is this just a passing trend which is here today and would vanish tomorrow?

There is no harm in being skeptical. But with knowledge graphs recognized as a top emerging technology trend by industry expert Gartner, it's evident that graph technology is here to stay. So let's dive in and discover what makes graph technology special and why it's gaining momentum. 

What industry experts are saying about Graph Technology:

Knowledge graphs has been recognized by Gartner among the top 5 emerging technology trends, alongside IOT & Blockchain, that will blur the lines between humans and machines. I have included a snippet from Gartner's press release below. You can also find Gartner's official full press release here .



Snippet from Gartner's Newsroom Press Release 


An important note before we proceed further. If you do a simple google search about graphs, you would get all the bar and line graphs. But this is not what we'll be talking about here. When I say graph, please do not imagine this: 


Function Rules based on Graphs ( Read ) | Algebra | CK-12 Foundation


Instead you have to imagine this:


Why Graph systems are gaining popularity:

The world has witnessed an exponential growth in data volume in recent years. According to Forbes, we have generated more data in recent years than in the entire history of humanity. While Big Data analytics systems have helped process large volume, velocity and variety of data, the challenge remains when dealing with vast amount of interconnected data This is where graph systems steps in and offers a unique solution. 

The evolution of Graph Technology:

The history of graph theory dates back to 1736. But much like machine learning, which was invented in the 1950s and gained more popularity in recent years, graph technology has started gaining its share of recognition. Two key factors have contributed to this rise of both these technologies: the advancement in computational power of modern hardware and the emergence of distributed processing paradigms. These advancements have opened the door to solving complex problems that were previously unimaginable.

Why is there a need of Graph Processing Systems: 

What is that we cannot easily achieve with our current options, and why do we need graph databases? 
To answer this, let's consider a social network of friends like Facebook as an example. Such a network of connected friends looks like this:

Well, the above graph appears to be really complex and dense. Let me simplify this a bit, for us to imagine the relationships. Here is a simplified version of this network of friends:

Here each connection between two individuals represent a friendship relation. 
Suppose we want to know all friends of friends of a person named Akshay. A relational database, when modelled appropriately, can also easily fetch this info. 

But the problem would arise when we are interested in friends of friends of friends of Akshay and even deeper depth of friendship, say 10-20 levels deep
To achieve this, traditional data processing systems would require join operations, which are complex to write and also expensive to compute.

So what Graph System offer differently?

Graph systems distinguish themselves primarily in how they store data. Relationships are treated as first-class citizens in graph systems. 
They inherently preserve and store the relationships between connected entities of data. And it is one of the key benefits, which no existing system provides, be it RDBMS or other existing data processing solution. Well, not comparing a rose with a lily, as relational databases and other data processing engines have their own set of use cases, which they are the best candidates for.

Question - So how would Graph database make it easy to fetch 10-20 levels deep friendship relation?

Answer - If we model our social network in a Graph system, with people as vertices and their friendship relation as edges, this computation would boil down to a simple and quick graph traversal. 

Wondering how? 




Two key benefits here:
  1. Relationships are first class citizen in graph storage, meaning each friendship relation is already automatically stored as an edge in the graph. No additional joins or expensive operations are required to compute the relationships. And this is the key to fast performance of graph systems for connected data. 
  2. Graph traversal using specialized graph query languages, such as Gremlin and Cypher, is much simple and concise, compared to the complex queries or code needed in non-graph systems.


Use Cases of Graph technology:

Following are the 4 top business use cases of Graph systems: 

1. Fraud Detection
-
Detect any fraudulent entity in your system, by analyzing its connection with other suspicious physical or human entity in the same system. 

2. Master Data Management - Achieve a 360-degree view of customers, by identity resolution of a person across different departments of organization.

3. Social Network AnalysisExplore associations between individuals in a social network, even at deep levels of relationship.

4. Real-time Recommendation SystemMaintain a knowledge graph of customer preferences and purchase history, to make accurate recommendations.
All these use cases leverage relationships to achieve the desired goal.

Graph Technology: OLTP vs. OLAP

Every information extracted from a graph, requires a traversal of its vertices and edges. And there are two kinds of traversals supported by Graph solutions: OLTP and OLAP.

OLTP(Online Transaction Processing): This type of traversal starts from a specific point, such as retrieving all friends of friends of John Doe. The starting point is known for analysis.  

OLAP(Online Analytical Processing): OLAP traversals are suitable for bulk analytical operations, where the starting point is not a specific individual. Instead, the entire graph is traversed to gain insights, as in the case of finding all the individuals with ten or more friends. 
Apache Spark, with GraphX and GraphFrames, handles OLAP graph processing really well. 


Popular Graph Databases:

Following are some of the Graph databases widely adopted by organizations:


1. DataStax Enterprise Graph

2. Neo4j
3. AnzoGraph DB
4. OrientDB
5. TigerGraph
6. Amazon Neptune
7. ArangoDB
8. Janus Graph



Graph Query Programming Languages:

Following are five key programming languages for querying graphs:

1. Gremlin - A graph programming language which is a part of Apache Tinkerpop graph computing framework.

2. Cypher query language - A declarative graph query language offering SQL like programmatic access to property graphs.  

3. GraphQL - Developed by Facebook, it's a data manipulation and query language for APIs.

4. SPARQL - A query language for RDF databases, to query data stored in RDF format. It was developed by World Wide Web Consortium.

5. AQL(Arango DB Query Language) - A SQL-like language used in Arango DB for querying both documents and graphs.


In conclusion, we've now laid the foundation to understand the world of Graph Databases, processing engines and the rich offerings of graph technology. With this knowledge in hand, we've also explored various graph solutions and query languages that are at your disposal. 



In my next publication, I will go a bit deeper into the concepts and would model a graph to address a popular usecase i.e. fraud detection. 

Stay tuned, and thank you for taking the time to read this article. 

Comments

Popular posts from this blog

UPSERT in Hive(3 Step Process)

Parquet tools utility

Hive - Merging small files into bigger files.

Parquet File format - Storage details

Skew Join Optimization in Hive

Apache Spark - Sort Merge Join

Apache Spark - Explicit Broadcasting Optimization

Spark Optimization - Bucketing

Apache Spark - Broadcast Hash Join