Update in Hive




Can I do row level updates in Hive? We are all aware of UPDATE not being a supported operation, based on past years of our experience of Hive. But guess what, it is no more a constraint, as UPDATE is available starting Hive 0.14 version. 
In this post we'll learn about the details of UPDATE in Hive(a long awaited operation for hadoop platform, as required by most of the Big data Engineers).

We will learn about the following details:
1. The prerequisites for hive to perform update.
2. Syntax of update.
3. Hive configuration settings to do update.
4. A close look at what happens at Hadoop file system level when update operation is performed.
5. Limitations to UPDATE operation in Hive


For a hive table to be updated(as of the date of publishing this article) :
1. It should be stored as ORC file format.
2. It should be bucketed.
3. It should support transactions(ACID semantics enabled).

Standard Syntax:
UPDATE tablename SET column = value [, column = value ...] [WHERE expression] 

To explain, I'll take data of a fictitious Retail organization's order management system, with data consisting of order number and details of the item ordered.

Schema of table:
Our table will consist of the following columns:-
1. Order number.
2. UPC of item(Universal Product Code of the item ordered)
3. Item name(a short description of the item ordered)
4. Item category(category to which that item belongs)
5. Item quantity(quantity of the item ordered)

Now lets first create this table in hive with prerequisites as required by hive to make this table eligible for UPDATE.
DDL


I am bucketing the table by order_nbr into 128 buckets.
While inserting the data into this table ensure that you enforce bucketing by setting hive.enforce.bucketing to true.The underlying data files(128 as created by 128 reducers equalizing the number of buckets) are as follows:
Underlying HDFS files:
$hdfs dfs -ls /user/akshay/order_details/
/user/akshay/order_details/000000_0
/user/akshay/order_details/000001_0
.
.
.
/user/akshay/order_details/000127_0


Lets pick one of the record to update.


Record:

Update - If the company wanted to order 150(item_qty) Kingston Flashdrives instead of 32, we need to update the respective record in table to have item_qty=150 and not 32. We'll update it as follows:






Note - If you are using Cloudera distribution of Hadoop, Hue may give you issue in executing update query. Please enclose your update queries inside a .hql script and execute the script through beeline client.
Make sure to change the transaction manager to DbTxnManager from the default DummyTxnManager. The DummyTxManager does not let hive support transactions. In addition to the transaction manager make sure to enable concurrency, enforce bucketing and setting dynamic partitioning mode to nonstrict. These are prerequisites for hive to support update.
Now lets check the record and see if the value got updated.



Result updated row:

So we got the record with the desired updated value 150 for item_qty.
Now the curosity is what exactly happened in  the underlying HDFS which is fetching us this new value.

$hdfs dfs -ls /user/akshay/order_details/
/user/akshay/order_details/000000_0
/user/akshay/order_details/000001_0
.
.
.

/user/akshay/order_details/000127_0
/user/akshay/order_details/delta_0000014_0000014/bucket_00000
So we see an extra delta_0000014_0000014 directory with a bucket_00000 file in it. This is the file which actually consists of the new record value for the data in our original order_details table.
Reading it will clarify more details. Lets read it using the orcfiledump utility as follows:

hive --orcfiledump -d 
/user/akshay/order_details/delta_0000014_0000014/bucket_00000

Output:
16/04/27 18:07:21 INFO orc.ReaderImpl: Reading ORC rows from /user/akshay/order_details/delta_0000014_0000014/bucket_00000 with {include: null, offset: 0, length: 9223372036854775807}

{"operation":1,"originalTransaction":0,"bucket":0,"rowId":84,"currentTransaction":14,"row":{"_col1":"1244","_col2":"12323669092","_col3":"Kingston 12 GB Flashdrive","_col4":"Electronics","_col5":"150"}}

As we see this new delta file consists of the transaction detail with the updated record's new details having new item quantity of 150(in yellow). This new delta ORC file is the one which Hive points to, to fetch the new updated value.

Limitations to UPDATE operation in Hive(as of the date of publishing this article) :
1. The referenced column must be a column of the table being updated.
2. The value assigned must be an expression that Hive supports in the select clause. Thus arithmetic operators, UDFs, casts, literals, etc. are supported. Subqueries are not supported.
3. Only rows that match the WHERE clause will be updated.
4. Partitioning columns cannot be updated.
5.Bucketing columns cannot be updated.

Comments

Popular posts from this blog

UPSERT in Hive(3 Step Process)

Parquet tools utility

Hive - Merging small files into bigger files.

Parquet File format - Storage details

Skew Join Optimization in Hive

Apache Spark - Sort Merge Join

Apache Spark - Explicit Broadcasting Optimization

Spark Optimization - Bucketing

Apache Spark - Broadcast Hash Join

Graph Technology - In plain English