Hive Map-Join configuration mystery

These parameters are used to make decision on when to use Map Join against Common join in hive, which ultimately affects query performance at the end. Map join is used when one of the join tables is small enough to fit in the memory, so it is very fast. here’s the explanation of all parameters: hive.auto.convert.join When this parameter set … Read more

Difference between hive.tez.container.size and tez.task.resource.memory.mb

hive.tez.container.size This property specifies tez container size. Usually value of this property should be the same as or a small multiple (1 or 2 times that) of YARN container size yarn.scheduler.minimum-allocation-mb and should not exceed value of yarn.scheduler.maximum-allocation-mb. As a general rule don’t put value higher than memory per processor as you want 1 processor per container and … Read more

Hive dynamic partitioning

You need to modify your select: I am not sure to which column on your demo staging you want to perform partitioning or which column in demo corresponds to land. But whatever is the column it should be present as the last column in select say your demo table column name is id so your … Read more

java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

I did below modifications and I am able to start the Hive Shell without any errors: 1. ~/.bashrc Inside bashrc file add the below environment variables at End Of File : sudo gedit ~/.bashrc 2. hive-site.xml You have to create this file(hive-site.xml) in conf directory of Hive and add the below details 3. You also … Read more

fs.hdfs.impl.disable.cache caused SparkSQL very slow

This is a question related to this question: Hive/Hadoop intermittent failure: Unable to move source to destination We found that we could avoid the problem of “Unable to move source … Filesystem closed” by setting fs.hdfs.impl.disable.cache to true However, we also observed that the SparkSQL queries became very slow — queries that used to finish … Read more

Hive: how to show all partitions of a table?

I have a table with 1000+ partitions. “Show partitions” command only lists a small number of partitions. How can i show all partitions? Update: I found “show partitions” command only lists exactly 500 partitions. “select … where …” only processes the 500 partitions!

Difference between INNER JOIN and LEFT SEMI JOIN

An INNER JOIN can return data from the columns from both tables, and can duplicate values of records on either side have more than one match. A LEFT SEMI JOIN can only return columns from the left-hand table, and yields one of each record from the left-hand table where there is one or more matches in the right-hand table … Read more

Hive’s unix_timestamp and from_unixtime functions

From the language manual: Convert time string with given pattern to Unix time stamp (in seconds) The result of this function is in seconds. Your result changes with the milliseconds portion of the date, but the unix functions only support seconds. For example: SELECT unix_timestamp(’10-Jun-15 10.00.00 AM’, ‘dd-MMM-yy hh.mm.ss a’); 1433930400 SELECT from_unixtime(1433930400, ‘dd-MMM-yy hh.mm.ss … Read more