What is Keyword Context in Hadoop programming world?

What exactly is this keyword Context in Hadoop MapReduce world in new API terms? Its extensively used to write output pairs out of Maps and Reduce, however I am not sure if it can be used somewhere else and what’s exactly happening whenever I use context. Is it a Iterator with different name? What is … Read more

how to write subquery and use “In” Clause in Hive

According to https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select: “Hive does not support IN, EXISTS or subqueries in the WHERE clause.” You might want to look at: https://issues.apache.org/jira/browse/HIVE-801 https://issues.apache.org/jira/browse/HIVE-1799

Difference between hive.tez.container.size and tez.task.resource.memory.mb

hive.tez.container.size This property specifies tez container size. Usually value of this property should be the same as or a small multiple (1 or 2 times that) of YARN container size yarn.scheduler.minimum-allocation-mb and should not exceed value of yarn.scheduler.maximum-allocation-mb. As a general rule don’t put value higher than memory per processor as you want 1 processor per container and … Read more

First hadoop project error: “Input path does not exist”

You need to upload your input files to the HDFS file system first: will create a directory named /user/DEVUSER/In in HDFS. will copy all *.txt files from the current directory to the cluster (HDFS). You seem to have skipped the chapter Upload data from the tutorial. Follow it and your problem should be solved.

Hive dynamic partitioning

You need to modify your select: I am not sure to which column on your demo staging you want to perform partitioning or which column in demo corresponds to land. But whatever is the column it should be present as the last column in select say your demo table column name is id so your … Read more

fs.hdfs.impl.disable.cache caused SparkSQL very slow

This is a question related to this question: Hive/Hadoop intermittent failure: Unable to move source to destination We found that we could avoid the problem of “Unable to move source … Filesystem closed” by setting fs.hdfs.impl.disable.cache to true However, we also observed that the SparkSQL queries became very slow — queries that used to finish … Read more

Hive: how to show all partitions of a table?

I have a table with 1000+ partitions. “Show partitions” command only lists a small number of partitions. How can i show all partitions? Update: I found “show partitions” command only lists exactly 500 partitions. “select … where …” only processes the 500 partitions!

get “ERROR: Can’t get master address from ZooKeeper; znode data == null” when using Hbase shell

If you just want to run HBase without going into Zookeeper management for standalone HBase, then remove all the property blocks from hbase-site.xml except the property block named hbase.rootdir. Now run /bin/start-hbase.sh. HBase comes with its own Zookeeper, which gets started when you run /bin/start-hbase.sh, which will suffice if you are trying to get around … Read more