Top hadoop frequently asked interview questions

Large scale data processing Hbase vs Cassandra [closed]

I am nearly landed at Cassandra after my research on large scale data storage solutions. But its generally said that Hbase is better solution for large scale data processing and analysis.

While both are same key/value storage and both are/can run (Cassandra recently) Hadoop layer then what makes Hadoop a better candidate when processing/analysis is required on large data.

I also found good details about both at http://ria101.wordpress.com/2010/02/24/hbase-vs-cassandra-why-we-moved/

but I'm still looking for concrete advantages of Hbase.

While I am more convinced about Cassandra because its simplicity for adding nodes and seamless replication and no point of failure features. And it also keeps secondary index feature so its a good plus.

Source: (StackOverflow)

When to use Hadoop, HBase, Hive and Pig?

What are the benefits of using either Hadoop or HBase or Hive ?

From my understanding, HBase avoids using map-reduce and has a column oriented storage on top of HDFS. Hive is a sql-like interface for Hadoop and HBase.

I would also like to know how Hive compares with Pig.

Source: (StackOverflow)


What is the differences between Apache Spark and Apache Flink?

What are the differences between Apache Spark and Apache Flink?

Will Apache Flink replace Hadoop?

Source: (StackOverflow)

Difference between Pig and Hive? Why have both? [closed]

My background - 4 weeks old in the Hadoop world. Dabbled a bit in Hive, Pig and Hadoop using Cloudera's Hadoop VM. Have read Google's paper on Map-Reduce and GFS (PDF link).

I understand that-

  • Pig's language Pig Latin is a shift from(suits the way programmers think) SQL like declarative style of programming and Hive's query language closely resembles SQL.

  • Pig sits on top of Hadoop and in principle can also sit on top of Dryad. I might be wrong but Hive is closely coupled to Hadoop.

  • Both Pig Latin and Hive commands compiles to Map and Reduce jobs.

My question - What is the goal of having both when one (say Pig) could serve the purpose. Is it just because Pig is evangelized by Yahoo! and Hive by Facebook?

Source: (StackOverflow)

Hadoop examples? [closed]

I'm examining Hadoop as a possible tool with which to do some web log analysis. I want to analyze several kinds of statistics in one run. Each line of my log files has all sorts of potentially useful information that I'd like to aggregate. I'd like to get all sorts of data out of the logs in a single Hadoop run, but the example Hadoop programs I see online all seem to do exactly one thing. This may be because every single example Hadoop program I can find just does word counts. Can I use Hadoop to solve two or more problems at once?

Are there other Hadoop examples, or Hadoop tutorials out there, that don't solve the word count problem?

Source: (StackOverflow)

Is there a .NET equivalent to Apache Hadoop?

So, I've been looking at Hadoop with keen interest, and to be honest I'm fascinated, things don't get much cooler.

My only minor issue is I'm a C# developer and it's in Java.

It's not that I don't understand the Java as much as I'm looking for the Hadoop.net or NHadoop or the .NET project that embraces the Google MapReduce approach. Does anyone know of one?

Source: (StackOverflow)

out of Memory Error in Hadoop

I tried installing Hadoop following this http://hadoop.apache.org/common/docs/stable/single_node_setup.html document. When I tried executing this

bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+' 

I am getting the following Exception

java.lang.OutOfMemoryError: Java heap space

Please suggest a solution so that i can try out the example. The entire Exception is listed below. I am new to Hadoop I might have done something dumb . Any suggestion will be highly appreciated.

anuj@anuj-VPCEA13EN:~/hadoop$ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
11/12/11 17:38:22 INFO util.NativeCodeLoader: Loaded the native-hadoop library
11/12/11 17:38:22 INFO mapred.FileInputFormat: Total input paths to process : 7
11/12/11 17:38:22 INFO mapred.JobClient: Running job: job_local_0001
11/12/11 17:38:22 INFO util.ProcessTree: setsid exited with exit code 0
11/12/11 17:38:22 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@e49dcd
11/12/11 17:38:22 INFO mapred.MapTask: numReduceTasks: 1
11/12/11 17:38:22 INFO mapred.MapTask: io.sort.mb = 100
11/12/11 17:38:22 WARN mapred.LocalJobRunner: job_local_0001
java.lang.OutOfMemoryError: Java heap space
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:949)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
11/12/11 17:38:23 INFO mapred.JobClient:  map 0% reduce 0%
11/12/11 17:38:23 INFO mapred.JobClient: Job complete: job_local_0001
11/12/11 17:38:23 INFO mapred.JobClient: Counters: 0
11/12/11 17:38:23 INFO mapred.JobClient: Job Failed: NA
java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1257)
    at org.apache.hadoop.examples.Grep.run(Grep.java:69)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.hadoop.examples.Grep.main(Grep.java:93)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
    at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:64)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

Source: (StackOverflow)

Life without JOINs... understanding, and common practices

Lots of "BAW"s (big ass-websites) are using data storage and retrieval techniques that rely on huge tables with indexes, and using queries that won't/can't use JOINs in their queries (BigTable, HQL, etc) to deal with scalability and sharding databases. How does that work when you have lots and lots of data that is very related?

I can only speculate that much of this joining has to be done on the application side of things, but doesn't that start to get expensive? What if you have to make several queries to several different tables to get information to compile? Isn't hitting the database that many times starting to get more expensive than just using joins in the first place? I guess it depends on how much data you've got?

And for commonly available ORMs, how do they tend to deal with the inability to use joins? Is there support for this in ORMs that are in heavy usage today? Or do most projects that have to approach this level of data tend to roll their own anyways?

So this is not applicable to any current project I'm doing, but it's something that's been in my head for several months now that I can only speculate as to what "best practices" are. I've never had a need to address this in any of my projects because they've never reached a scale where it is required. Hopefully this question helps other people as well..

As someone said below, ORMs "don't work" without joins. Are there other data access layers that are already available to developers working with data on this level?

EDIT: For some clarification, Vinko Vrsalovic said:

"I believe snicker is wants to talk about NO-SQL, where transactional data is denormalized and used in Hadoop or BigTable or Cassandra schemes."

This is indeed what I'm talking about.

Bonus points for those who catch the xkcd reference.

Source: (StackOverflow)

Integration testing Hive jobs

I'm trying to write a non-trivial Hive job using the Hive Thrift and JDBC interfaces, and I'm having trouble setting up a decent JUnit test. By non-trivial, I mean that the job results in at least one MapReduce stage, as opposed to only dealing with the metastore.

The test should fire up a Hive server, load some data into a table, run some non-trivial query on that table, and check the results.

I've wired up a Spring context according to the Spring reference. However, the job fails on the MapReduce phase, complaining that no Hadoop binary exists:

java.io.IOException: Cannot run program "/usr/bin/hadoop" (in directory "/Users/yoni/opower/workspace/intellij_project_root"): error=2, No such file or directory

The problem is that the Hive Server is running in-memory, but relies upon local installation of Hive in order to run. For my project to be self-contained, I need the Hive services to be embedded, including the HDFS and MapReduce clusters. I've tried starting up a Hive server using the same Spring method and pointing it at MiniDFSCluster and MiniMRCluster, similar to the pattern used in the Hive QTestUtil source and in HBaseTestUtility. However, I've not been able to get that to work.

After three days of trying to wrangle Hive integration testing, I thought I'd ask the community:

  1. How do you recommend I integration test Hive jobs?
  2. Do you have a working JUnit example for integration testing Hive jobs using in-memory HDFS, MR, and Hive instances?

Additional resources I've looked at:

Edit: I am fully aware that working against a Hadoop cluster - whether local or remote - makes it possible to run integration tests against a full-stack Hive instance. The problem, as stated, is that this is not a viable solution for effectively testing Hive workflows.

Source: (StackOverflow)

Difference between HBase and Hadoop/HDFS

This is kind of naive question but I am new to NoSQL paradigm and don't know much about it. So if somebody can help me clearly understand difference between the HBase and Hadoop or if give some pointers which might help me understand the difference.

Till now, I did some research and acc. to my understanding Hadoop provides framework to work with raw chunk of data(files) in HDFS and HBase is database engine above Hadoop, which basically works with structured data instead of raw data chunk. Hbase provides a logical layer over HDFS just as SQL does. Is it correct?

Pls feel free to correct me.


Source: (StackOverflow)

Name node is in safe mode. Not able to leave

root# bin/hadoop fs -mkdir t
mkdir: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create directory /user/root/t. Name node is in safe mode.

not able to create anything in hdfs

I did

root# bin/hadoop fs -safemode leave

But showing

safemode: Unknown command

what is the problem?


Source: (StackOverflow)

How does Hive compare to HBase?

I'm interested in finding out how the recently-released (http://mirror.facebook.com/facebook/hive/hadoop-0.17/) Hive compares to HBase in terms of performance. The SQL-like interface used by Hive is very much preferable to the HBase API we have implemented.

Source: (StackOverflow)

Hadoop on OSX "Unable to load realm info from SCDynamicStore"

I am getting this error on startup of Hadoop on OSX 10.7:

Unable to load realm info from SCDynamicStore put: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create directory /user/travis/input/conf. Name node is in safe mode.

It doesn't appear to be causing any issues with the functionality of Hadoop.

Source: (StackOverflow)

How does Hadoop process records split across block boundaries?

According to the Hadoop - The Definitive Guide

The logical records that FileInputFormats define do not usually fit neatly into HDFS blocks. For example, a TextInputFormat’s logical records are lines, which will cross HDFS boundaries more often than not. This has no bearing on the functioning of your program—lines are not missed or broken, for example—but it’s worth knowing about, as it does mean that data-local maps (that is, maps that are running on the same host as their input data) will perform some remote reads. The slight overhead this causes is not normally significant.

Suppose a record line is split across two blocks (b1 and b2). The mapper processing the first block (b1) will notice that the last line doesn't have a EOL separator and fetches the remaining of the line from the next block of data (b2).

How does the mapper processing the second block (b2) determine that the first record is incomplete and should process starting from the second record in the block (b2)?

Source: (StackOverflow)

How does the MapReduce sort algorithm work?

One of the main examples that is used in demonstrating the power of MapReduce is the Terasort benchmark. I'm having trouble understanding the basics of the sorting algorithm used in the MapReduce environment.

To me sorting simply involves determining the relative position of an element in relationship to all other elements. So sorting involves comparing "everything" with "everything". Your average sorting algorithm (quick, bubble, ...) simply does this in a smart way.

In my mind splitting the dataset into many pieces means you can sort a single piece and then you still have to integrate these pieces into the 'complete' fully sorted dataset. Given the terabyte dataset distributed over thousands of systems I expect this to be a huge task.

So how is this really done? How does this MapReduce sorting algorithm work?

Thanks for helping me understand.

Source: (StackOverflow)