Top hive frequently asked interview questions

How to get/generate the create statement for an existing hive table?

Assuming you have "table" already in Hive, is there a quick way like other databases to be able to get the "CREATE" statement for that table?

Source: (StackOverflow)

How to select current date in Hive SQL

How do we get the current system date in Hive? In MySQL we have select now(), can any one please help me to get the query results. I am very new to Hive, is there a proper documentation for Hive that gives the details information about the pseudo columns, and built-in functions.

Source: (StackOverflow)

Hive getting top n records in group by query

I have following table in hive

user-id, user-name, user-address,clicks,impressions,page-id,page-name

I need to find out top 5 users[user-id,user-name,user-address] by clicks for each page [page-id,page-name]

I understand that we need to first group by [page-id,page-name] and within each group I want to orderby [clicks,impressions] desc and then emit only top 5 users[user-id, user-name, user-address] for each page but I am finding it difficult to construct the query.

How can we do this using HIve UDF ?

Source: (StackOverflow)

When to use Hadoop, HBase, Hive and Pig?

What are the benefits of using either Hadoop or HBase or Hive ?

From my understanding, HBase avoids using map-reduce and has a column oriented storage on top of HDFS. Hive is a sql-like interface for Hadoop and HBase.

I would also like to know how Hive compares with Pig.

Source: (StackOverflow)

How does Hive compare to HBase?

I'm interested in finding out how the recently-released (http://mirror.facebook.com/facebook/hive/hadoop-0.17/) Hive compares to HBase in terms of performance. The SQL-like interface used by Hive is very much preferable to the HBase API we have implemented.

Source: (StackOverflow)

How do I output the results of a HiveQL query to CSV?

we would like to put the results of a Hive query to a CSV file. I thought the command should look like this:

insert overwrite directory '/home/output.csv' select books from table;

When I run it, it says it completeld successfully but I can never find the file. How do I find this file or should I be extracting the data in a different way?

Thanks!

Source: (StackOverflow)

Difference between Pig and Hive? Why have both? [closed]

My background - 4 weeks old in the Hadoop world. Dabbled a bit in Hive, Pig and Hadoop using Cloudera's Hadoop VM. Have read Google's paper on Map-Reduce and GFS (PDF link).

I understand that-

Pig's language Pig Latin is a shift from(suits the way programmers think) SQL like declarative style of programming and Hive's query language closely resembles SQL.
Pig sits on top of Hadoop and in principle can also sit on top of Dryad. I might be wrong but Hive is closely coupled to Hadoop.
Both Pig Latin and Hive commands compiles to Map and Reduce jobs.

My question - What is the goal of having both when one (say Pig) could serve the purpose. Is it just because Pig is evangelized by Yahoo! and Hive by Facebook?

Source: (StackOverflow)

Hive and Hadoop version

How can I find which Hive version I am using from the command prompt. Below is the details-

I am using Putty to connect to hive table and access records in the tables. So what I did is- I opened Putty and in the host name I typed- leo-ingesting.vip.name.com and then I click Open. And then I entered my username and password and then few commands to get to Hive sql. Below is the list what I did

$ bash
bash-3.00$ hive
Hive history file=/tmp/rkost/hive_job_log_rkost_201207010451_1212680168.txt
hive> set mapred.job.queue.name=hdmi-technology;
hive> select * from table LIMIT 1;

So is there any way from the command prompt I can find which hive version I am using and Hadoop version too?

Source: (StackOverflow)

Can OLAP be done in BigTable?

In the past I used to build WebAnalytics using OLAP cubes running on MySQL. Now an OLAP cube the way I used it is simply a large table (ok, it was stored a bit smarter than that) where each row is basically a measurement or and aggregated set of measurements. Each measurement has a bunch of dimensions (i.e. which pagename, useragent, ip, etc.) and a bunch of values (i.e. how many pageviews, how many visitors, etc.).

The queries that you run on a table like this are usually of the form (meta-SQL):

SELECT SUM(hits), SUM(bytes),
FROM MyCube
WHERE date='20090914' and pagename='Homepage' and browser!='googlebot'
GROUP BY hour

So you get the totals for each hour of the selected day with the mentioned filters. One snag was that these cubes usually meant a full table scan (various reasons) and this meant a practical limitation on the size (in MiB) you could make these things.

I'm currently learning the ins and outs of Hadoop and the likes.

Running the above query as a mapreduce on a BigTable looks easy enough: Simply make 'hour' the key, filter in the map and reduce by summing the values.

Can you run a query like I showed above (or at least with the same output) on a BigTable kind of system in 'real time' (i.e. via a user interface and the user get's their answer ASAP) instead of batch mode?

If not; what is the appropriate technology to do something like this in the realm of BigTable/Hadoop/HBase/Hive and the likes?

Source: (StackOverflow)

Integration testing Hive jobs

I'm trying to write a non-trivial Hive job using the Hive Thrift and JDBC interfaces, and I'm having trouble setting up a decent JUnit test. By non-trivial, I mean that the job results in at least one MapReduce stage, as opposed to only dealing with the metastore.

The test should fire up a Hive server, load some data into a table, run some non-trivial query on that table, and check the results.

I've wired up a Spring context according to the Spring reference. However, the job fails on the MapReduce phase, complaining that no Hadoop binary exists:

java.io.IOException: Cannot run program "/usr/bin/hadoop" (in directory "/Users/yoni/opower/workspace/intellij_project_root"): error=2, No such file or directory

The problem is that the Hive Server is running in-memory, but relies upon local installation of Hive in order to run. For my project to be self-contained, I need the Hive services to be embedded, including the HDFS and MapReduce clusters. I've tried starting up a Hive server using the same Spring method and pointing it at MiniDFSCluster and MiniMRCluster, similar to the pattern used in the Hive QTestUtil source and in HBaseTestUtility. However, I've not been able to get that to work.

After three days of trying to wrangle Hive integration testing, I thought I'd ask the community:

How do you recommend I integration test Hive jobs?
Do you have a working JUnit example for integration testing Hive jobs using in-memory HDFS, MR, and Hive instances?

Additional resources I've looked at:

Edit: I am fully aware that working against a Hadoop cluster - whether local or remote - makes it possible to run integration tests against a full-stack Hive instance. The problem, as stated, is that this is not a viable solution for effectively testing Hive workflows.

Source: (StackOverflow)

COLLECT_SET() in Hive, keep duplicates?

Is there a way to keep the duplicates in a collected set in Hive, or simulate the sort of aggregate collection that Hive provides using some other method? I want to aggregate all of the items in a column that have the same key into an array, with duplicates.

I.E.:

hash_id | num_of_cats
=====================
ad3jkfk            4
ad3jkfk            4
ad3jkfk            2
fkjh43f            1
fkjh43f            8
fkjh43f            8
rjkhd93            7
rjkhd93            4
rjkhd93            7

should return:

hash_agg | cats_aggregate
===========================
ad3jkfk   Array<int>(4,4,2)
fkjh43f   Array<int>(1,8,8)
rjkhd93   Array<int>(7,4,7)

Source: (StackOverflow)

Hive cluster by vs order by vs sort by

As far as I understand;

sort by only sorts with in the reducer
order by orders things globally but shoves everything into one reducers
cluster by intelligently distributes stuff into reducers by the key hash and make a sort by

So my question is does cluster by guarantee a global order? distribute by puts the same keys into same reducers but what about the adjacent keys?

The only document I can find on this is here and from the example it seems like it orders them globally. But from the definition I feel like it doesn't always do that.

Source: (StackOverflow)

What is Hive: Return Code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask

I am getting:

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask

While trying to make a copy of a partitioned table using the commands in the hive console:

CREATE TABLE copy_table_name LIKE table_name;
INSERT OVERWRITE TABLE copy_table_name PARTITION(day) SELECT * FROM table_name;

I initially got some semantic analysis errors and had to set:

set hive.exec.dynamic.partition=true
set hive.exec.dynamic.partition.mode=nonstrict

Although I'm not sure what the above properties do?

Full ouput from hive console:

Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201206191101_4557, Tracking URL = http://jobtracker:50030/jobdetails.jsp?jobid=job_201206191101_4557
Kill Command = /usr/lib/hadoop/bin/hadoop job  -Dmapred.job.tracker=master:8021 -kill job_201206191101_4557
2012-06-25 09:53:05,826 Stage-1 map = 0%,  reduce = 0%
2012-06-25 09:53:53,044 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201206191101_4557 with errors
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask

Source: (StackOverflow)

Querying on multiple Hive stores using Apache Spark

I have a spark application which will successfully connect to hive and query on hive tables using spark engine.

To build this, I just added hive-site.xml to classpath of the application and spark will read the hive-site.xml to connect to its metastore. This method was suggested in spark's mailing list.

So far so good. Now I want to connect to two hive stores and I don't think adding another hive-site.xml to my classpath will be helpful. I referred quite a few articles and spark mailing lists but could not find anyone doing this.

Can someone suggest how I can achieve this?

Thanks.

Docs referred:

Source: (StackOverflow)

Difference between Hive internal tables and external tables?

Can anyone tell me the difference between Hive's external table and internal tables. I know the difference comes when dropping the table. I don't understand what you mean by the data and metadata is deleted in internal and only metadata is deleted in external tables. Can anyone explain me in terms of nodes please.

Source: (StackOverflow)

EzDevInfo.com

hive interview questions

How to get/generate the create statement for an existing hive table?

How to select current date in Hive SQL

Hive getting top n records in group by query

When to use Hadoop, HBase, Hive and Pig?

How does Hive compare to HBase?

How do I output the results of a HiveQL query to CSV?

Difference between Pig and Hive? Why have both? [closed]

Hive and Hadoop version

Can OLAP be done in BigTable?

Integration testing Hive jobs

COLLECT_SET() in Hive, keep duplicates?

Hive cluster by vs order by vs sort by

What is Hive: Return Code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask

Querying on multiple Hive stores using Apache Spark

Difference between Hive internal tables and external tables?