Posts

Showing posts with the label load status

Data load status check for a table without need of a trigger file

Image
When designing data pipelines, it is a common requirement to know if the data required for processing within a particular timeframe is available or not. Mostly this timeframe is one day for daily incremental processes, where a daily ETL process wants to know, if the data for current date is even available for processing or not.  For instance a process executing daily, if the data for current date is loaded into the source directory where it reads from.  If you are using Hadoop's HDFS for your data storage, in this article I'll share a piece of code, to easily verify if the data for a desired date is present in a hive table(which always have an underlying directory in HDFS).  Let us take a real world scenario where we get the data loaded in automated fashion, into a certain hive table, which would always have an underlying directory in HDFS. This data load can be through a sourcing ingestion process or the load as a result of an ETL process or the outcome of an...