Data load status check for a table without need of a trigger file



When designing data pipelines, it is a common requirement to know if the data required for processing within a particular timeframe is available or not. Mostly this timeframe is one day for daily incremental processes, where a daily ETL process wants to know, if the data for current date is even available for processing or not. 
For instance a process executing daily, if the data for current date is loaded into the source directory where it reads from. If you are using Hadoop's HDFS for your data storage, in this article I'll share a piece of code, to easily verify if the data for a desired date is present in a hive table(which always have an underlying directory in HDFS). 

Let us take a real world scenario where we get the data loaded in automated fashion, into a certain hive table, which would always have an underlying directory in HDFS. This data load can be through a sourcing ingestion process or the load as a result of an ETL process or the outcome of an application. Pattern can be any, but the data lands in HDFS. Now one way of knowing the status of data load for today's date in the HDFS directory, is through a trigger file mechanism. The process loading this data can put a trigger file, to let the dependent process know of the successful load status. 
Rather an easier way is through this piece of code I am sharing. The following code would make this status tracking easier and convenient. This checks the HDFS directory and fetches the date part of the table's files metadata. If it finds the files with date matching, it records the status and we can take the required action as per our business need. This tracking would be really beneficial in scenarios, where we create dynamic cluster for processing input data. We can apply a check to not create a cluster, if there is not even a single file with the desired date available for processing. 

Table used - I am using the table name akshay_check_load_status which has the underlying HDFS directory named akshay_check_load_status. 

current_date=$(date +%Y-%m-%d)
table_date=`hdfs dfs -ls /data/core/akshay_db/proj/ | grep akshay_check_load_status | awk '{timestamp= $6  "  "  $7;print timestamp}' | cut -c -10`
if [ ${current_date} = ${table_date} ];
then
        echo "The akshay_check_load_status table load completed for $current_date"
else
        echo "Waiting for the akshay_check_load_status table to load data for $current_date"

Comments


  1. Appreciation for really being thoughtful and also for deciding on certain marvelous guides most people really want to be aware of.

    Big data training in bangalore
    Hadoop training institute in bangalore

    ReplyDelete
    Replies
    1. satyaincense Thank you for your kind words and appreciation. Such words acts as a catalyst to my motivation.

      Delete

Popular posts from this blog

UPSERT in Hive(3 Step Process)

Parquet tools utility

Hive - Merging small files into bigger files.

Parquet File format - Storage details

Skew Join Optimization in Hive

Apache Spark - Sort Merge Join

Apache Spark - Explicit Broadcasting Optimization

Spark Optimization - Bucketing

Apache Spark - Broadcast Hash Join

Graph Technology - In plain English