Data load status check for a table without need of a trigger file

By Akshay Agarwal January 09, 2018

When designing data pipelines, it is a common requirement to know if the data required for processing within a particular timeframe is available or not. Mostly this timeframe is one day for daily incremental processes, where a daily ETL process wants to know, if the data for current date is even available for processing or not.

For instance a process executing daily, if the data for current date is loaded into the source directory where it reads from. If you are using Hadoop's HDFS for your data storage, in this article I'll share a piece of code, to easily verify if the data for a desired date is present in a hive table(which always have an underlying directory in HDFS).

Let us take a real world scenario where we get the data loaded in automated fashion, into a certain hive table, which would always have an underlying directory in HDFS. This data load can be through a sourcing ingestion process or the load as a result of an ETL process or the outcome of an application. Pattern can be any, but the data lands in HDFS. Now one way of knowing the status of data load for today's date in the HDFS directory, is through a trigger file mechanism. The process loading this data can put a trigger file, to let the dependent process know of the successful load status.

Rather an easier way is through this piece of code I am sharing. The following code would make this status tracking easier and convenient. This checks the HDFS directory and fetches the date part of the table's files metadata. If it finds the files with date matching, it records the status and we can take the required action as per our business need. This tracking would be really beneficial in scenarios, where we create dynamic cluster for processing input data. We can apply a check to not create a cluster, if there is not even a single file with the desired date available for processing.

Table used - I am using the table name akshay_check_load_status which has the underlying HDFS directory named akshay_check_load_status.

current_date=$(date +%Y-%m-%d)
table_date=`hdfs dfs -ls /data/core/akshay_db/proj/ | grep akshay_check_load_status | awk '{timestamp= $6 " " $7;print timestamp}' | cut -c -10`
if [ ${current_date} = ${table_date} ];
then
echo "The akshay_check_load_status table load completed for $current_date"
else
echo "Waiting for the akshay_check_load_status table to load data for $current_date"

Comments

satyaincenseJanuary 7, 2019 at 1:45 AM

Appreciation for really being thoughtful and also for deciding on certain marvelous guides most people really want to be aware of.

Big data training in bangalore
Hadoop training institute in bangalore
ReplyDelete
Replies

Add comment

New comments are not allowed.

Search This Blog

Technical insight into Big Data, Cloud, Graph & Artificial Intelligence technologies

Data load status check for a table without need of a trigger file

Comments

Popular posts from this blog

UPSERT in Hive(3 Step Process)

Parquet tools utility

Hive - Merging small files into bigger files.

Parquet File format - Storage details

Skew Join Optimization in Hive

Apache Spark - Sort Merge Join

Apache Spark - Explicit Broadcasting Optimization

Spark Optimization - Bucketing

Apache Spark - Broadcast Hash Join

Graph Technology - In plain English