In Hive, we can optimize a query by using STREAMTABLE hint. We can specify it in SELECT query with JOIN. During the map/reduce stage of JOIN, a table data can be streamed by using this hint.
SELECT /*+ STREAMTABLE(table1) */ table1.val, table2.val
FROM table1 JOIN table2 ON (table1.key = table2.key1)
In above query we are using table1 as a stream.
If we do not use STREAMTABLE hint then Hive will stream the right most table in the JOIN query.
In Hive, TOUCH clause in ALTER statement is used to read the metadata and write it back. This operation will modify the last accessed time of a partition in Hive.
With TOUCH statement we can also execute the POST and PRE hooks on a table partition.
This statement cannot be used for creating a table or partition if it does not exist yet.
We use OVERWRITE clause in CREATE TABLE statement to delete the existing data and write new data in a Hive table. Essentially, as the name suggests, OVERWRITE helps in overwriting existing data in a Hive table.
We can use following options to connect an application a Hive server:
- JDBC Driver: We can use JDBC Driver with embedded as well as remote access to connect to
HiveServer. This is for Java based connectivity.
- Python Client: For Python language application there is Python client that can connect to Hive server.
- Ruby Client: With Ruby client driver also we can connect to Hive server.
- Thrift Client: We can use Beeline command line shell to connect to Hive server over Thrift. For production mode, this is one of the very good options. It is a secure option for production use. Also we do not need to grant HDFS access to users for using Thrift client.
TRIM and RPAD functions are for processing String data type in Hive.
With TRIM function we can delete the spaces before and after a String. It is very useful for formatting user input in which user may have entered extra spaces. The other variations of TRIM function are LTRIM and RTRIM that remove spaces from left and right side of the string respectively.
E.g. TRIM(‘ Smith ’)
RPAD function is used to add padding (extra spaces) in a String on the right hand side. So that String reaches a specified length. LPAD function is same as RPAD but it pads on the left hand side of String.
E.g. Let say we have a String “Hello”.
We can also specify our optional padding character in RPAD and LPAD functions.
These functions are similar to the ones in SQL.
We can use following commands in Hive to recursively access sub-directories:
hive> Set mapred.input.dir.recursive=true;
hive> Set hive.mapred.supports.subdirectories=true;
Once above options are set to true, Hive will recursively access sub-directories of a directory in MapReduce.
We can convert some of the SELECT queries in Hive into single FETCH task. With this optimization, latency of SELECT query is decreased.
To use this we have to set the value of hive.fetch.task.conversion parameter. The permissible values are:
- 0: It means FETCH is disabled.
- 1: It is minimal mode. SELECT *, FILTER on partition columns (WHERE and HAVING clauses), LIMIT only
- 2: It is more mode: SELECT, FILTER, LIMIT only (including virtual columns)
“more” can even take UDF expressions in the SELECT clause.
We use Optimized Row Columnar (ORC) file format to store data efficiently in Hive. It is used for performance improvement in reading, writing and processing of data.
In ORC format, we can overcome the limitations of other Hive file formats. Some of the advantages of ORC format are:
<li>There is single file as the output of each task. This reduces load on NameNode.</li>
<li>It supports date time, decimal, struct, map etc complex types.</li>
<li>It stores light-weight indexes within the file.</li>
<li>We can bound the memory used in read/write of data.</li>
<li>It stores metadata with Protocol Buffers that supports add/remove of fields.</li>
Hive is mainly used for Datawarehouse applications. Hive used Hadoop and MapReduce that put some restrictions on use cases for Hive.
Some of the main use cases for Hive are:
Analysis of static Big data
<li>Applications in which less responsive time is acceptable</li>
<li>Analysis of data that does not change rapidly</li>
Managed tables are the tables in which files, metadata and statistics etc are managed by internal Hive processes. Hive creates Managed tables by default. When we drop a managed table or partition, then all the metadata and data associated with the table is also deleted.
We use Managed tables, when we want Hive to manage the lifecycle of a table. Even for temporary tables, we use managed tables in Hive.
When we run DESCRIBE FORMATTED tableName statement, it displays whether a table is MANAGED_TABLE or EXTERNAL_TABLE.