EzDevInfo.com

dfs interview questions

Top dfs frequently asked interview questions

Hadoop java.io.IOException: Mkdirs failed to create /some/path

When I try to run my Job I am getting the following exception:

Exception in thread "main" java.io.IOException: Mkdirs failed to create /some/path
    at org.apache.hadoop.util.RunJar.ensureDirectory(RunJar.java:106)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:150)

Where the /some/path is hadoop.tmp.dir. However when I issue the dfs -ls cmd on /some/path I can see that it exists and the dataset file is present (was copied before lunching the job). Also the path is correctly defined in hadoop configs. Any suggestions will be appreciated. I am using hadoop 0.21.

Source: (StackOverflow)

Lustre, Gluster or MogileFS?? for video storage, encoding and streaming [closed]

So many options and so little time to test them all... I wonder if someone has experiences with distributed file systems for video streaming and storage/encoding.

I have a lot of huge video files (50GB to 250GB) that I need to store somewhere, be able to encode them to mp4 and stream them from several Adobe FMS servers. The only way to handle all this is with a distributed file system but now the question is which one??

My research so far tells me:

Lustre: mature proven solution, used by a lot of big companies, best with >10G files is a kernel driver.
Gluster: new, less mature, FUSE based that means easy to install but maybe slower due to FUSE overhead. Better to handle a large number of smaller files ~1GB
MogileFS: seems to be only for small files ~MB, uses HTTP for access?? possible FUSE binding in the future.

So far Lustre seems the winner but I would like to hear real experiences for the particular application I have.

Also Hadoop, Redhat GFS, Coda and Windows DFS sound as options so any experiences are welcome. If someone has benchmarks please share.

After some real experience this is what I have learned:

Luster:
- Performance: Amazingly fast! I can assert that Lustre can serve a lot of streams and that encoding speed is not affected by accessing files via Lustre.
- POXIS compatibility: Very good!. No need to modify applications to use luster.
- Replication, Load Balancing and Fail Over: Very bad!. For replication load balancing we and fail over we need to rely on other software such as virtual IPs and DRDB.
- Installation: The worst!. Impossible to install by mere mortals. Requires a very specific combination of kernel, lustre patches and tweaks to get it working. And current luster patches usually work with old kernels that are incompatible with new hardware/software.
MogileFS:
- Performance: Good for small files but not usable for medium to large files. This is mostly due to HTTP overhead since all files are send/receive via HTTP requests that encode all data in base64 adding a 33% overhead to each file.
- POXIX compatibility is non existent. All applications require to be modified to use mogilefs that renders it useless for streaming/encoding since most streaming servers and encoding tools do not understand MogileFS protocol.
- Replication and failover out of the box and load balancing can be implemented in the application by accessing more than one tracker at a time.
- Installation is relatively easy and ready to use packages exist in most distributions. The only difficulty I found was setting the database master-slave to eliminate the single point of failure.
  - Gluster:
- Performance: Very bad for streaming. I cannot reach more than a few Mbps in a 10Gbps network. Clients and Server CPU skyrockets on heavy writes. For encoding works because the CPU is saturated before the network and I/O.
- POXIS: Almost compatible. The tools I use can access gluster mounts as normal folders in disk but in some edge cases things start causing problems. Check gluster mailing lists and you will see there are a lot of problems.
- Replication, Failover and Load balancing: The best! if they actually worked. Gluster is very new and it has a lot of bugs and performance problems.
- Installation is too easy. The management command line is amazing and setting replicated, striped and distributed volumes among several servers can not be any easier.

Final conclusion:

Unfortunately the conclusion is "No single silver bullet".

Currently we have our media files in Gluster3.2 in a replicated volume for storage and transcoding. As long as you don't have a lot of servers, avoid geo-replication and stripe volumes things work ok.

When we are going to stream the media files we copy them to a lustre volume that is replicated to a second lustre volume via DR:DB. The wowza server then read the media files from the lustre volumes.

And finally we use MogileFS to serve the thumbnails in our web application servers.

Source: (StackOverflow)

Advertisements

How can I get an active UNC Path in DFS programatically

Given a DFS path how would I know what is the active path it is currently on programatically.

For exmaple I have 2 Servers shares as "\\Server1\Folder\" and "\\Server2\Folder\" and it has DFS turned on so it can be accessed on "\\DFS_Server\Folder\", how would I know what is the active path currently "\\DFS_Server\Folder\" is on, whether it is "\\Server1\Folder\" or "\\Server2\Folder\".

Source: (StackOverflow)

Adjacency matrix neighbors

I have a matrix with 0s and 1s. I can start from any cell. I want to know what is the minimum number of steps(up,down,left,right) required to cover all possible 1s. I can start from 0 or 1.

Example:

0 1 0 
1 1 1
0 1 0

Start from (2,2) within 1 step I can reach all the 1s. I relate this to an adjacency matrix of a unweighted undirected graph. Essentially I need to find the farthest neighbour when I can start from ANY point. I could simply have used BFS/DFS and kept a counter if I could start from only the vertices, however this poses a problem.

Source: (StackOverflow)

Using flex for play video from windows share

I need to play video file from windows share inside corporate network. Share is used because it replicates on other corporate sites, so every user can download video from it local storage(We use DFS for it).

Video need to be played on our web portal. So I want to use Flex for this task.

The question is: How to open windows share from flex.

If you can suggest other solution it also would be great

Thanks!

Source: (StackOverflow)

Which Procedure we can use for Maze exploration BFS or DFS

I know we can use DFS for maze exploration. But I think we can also use BFS for maze exploration. I'm little bit confused here because most of the books and articles that I've read had used DFS for this problem. What I think is that the Best Case time complexity of DFS will be better as compared to BFS. But Average and Worst Case time complexity will be same for both BFS & DFS and thats why we prefer DFS over BFS. Am I right or I'm having some misconception

Source: (StackOverflow)

hadoop/hdfs/name is in an inconsistent state: storage directory(hadoop/hdfs/data/) does not exist or is not accessible

I have tried all the different solutions provided at stackoverflow on this topic, but of no help Asking again with the specific log and the details

Any help is appreciated

I have one master node and 5 slave nodes in my Hadoop cluster. ubuntu user and ubuntu group is the owner of the ~/Hadoop folder Both the ~/hadoop/hdfs/data & ~/hadoop/hdfs/name folder exist

and permission for both the folders are set to 755

successfully formated the namenode before starting the script start-all.sh

THE SCRIPT FAILS TO LAUNCH THE "NAMENODE"

These are running at the master node

ubuntu@master:~/hadoop/bin$ jps

7067 TaskTracker
6914 JobTracker
7237 Jps
6834 SecondaryNameNode
6682 DataNode

ubuntu@slave5:~/hadoop/bin$ jps

31438 TaskTracker
31581 Jps
31307 DataNode

Below is the log from name-node log files.

..........
..........
.........

014-12-03 12:25:45,460 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source jvm registered.
2014-12-03 12:25:45,461 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source NameNode registered.
2014-12-03 12:25:45,532 INFO org.apache.hadoop.hdfs.util.GSet: Computing capacity for map BlocksMap
2014-12-03 12:25:45,532 INFO org.apache.hadoop.hdfs.util.GSet: VM type       = 64-bit
2014-12-03 12:25:45,532 INFO org.apache.hadoop.hdfs.util.GSet: 2.0% max memory = 1013645312
2014-12-03 12:25:45,532 INFO org.apache.hadoop.hdfs.util.GSet: capacity      = 2^21 = 2097152 entries
2014-12-03 12:25:45,532 INFO org.apache.hadoop.hdfs.util.GSet: recommended=2097152, actual=2097152
2014-12-03 12:25:45,588 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=ubuntu
2014-12-03 12:25:45,588 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
2014-12-03 12:25:45,588 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true
2014-12-03 12:25:45,622 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: dfs.block.invalidate.limit=100
2014-12-03 12:25:45,623 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
2014-12-03 12:25:45,716 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemStateMBean and NameNodeMXBean
2014-12-03 12:25:45,777 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: dfs.namenode.edits.toleration.length = 0
2014-12-03 12:25:45,777 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names occuring more than 10 times 
2014-12-03 12:25:45,785 INFO org.apache.hadoop.hdfs.server.common.Storage: Storage directory /home/ubuntu/hadoop/file:/home/ubuntu/hadoop/hdfs/name does not exist
2014-12-03 12:25:45,787 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed.
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /home/ubuntu/hadoop/file:/home/ubuntu/hadoop/hdfs/name is in an inconsistent state: storage directory does not exist or is not accessible.
    at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:304)
    at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:104)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:427)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:395)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:299)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:569)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1479)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1488)
2014-12-03 12:25:45,801 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /home/ubuntu/hadoop/file:/home/ubuntu/hadoop/hdfs/name is in an inconsistent state: storage directory does not exist or is not accessible.
    at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:304)
    at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:104)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:427)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:395)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:299)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:569)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1479)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1488)

Source: (StackOverflow)

How do I setup WebDeploy on Windows 2003 / IIS6?

Webdeploy is an alternative to WebDav, FTP, and FrontPage extensions. It also acts as an alternative to DFS for replicating websites. I found instructions for configuring Windows 2008, but I'm unclear how to set up 2003 especially when multiple sites / IP Addresses are present.

Source: (StackOverflow)

JAVA Graph/DFS implementation

I have a small dilemma I would like to be advised with -

I'm implementing a graph (directed) and I want to make it extra generic - that is Graph where T is the data in the the node(vertex). To add a vertex to the graph will be - add(T t). The graph will wrap T to a vertex that will hold T inside.

Next I would like to run DFS on the graph - now here comes my dilemma - Should I keep the "visited" mark in the vertex (as a member) or initiate some map while running the DFS (map of vertex -> status)?

Keeping it in the vertex is less generic (the vertex shouldn't be familiar with the DFS algo and implementation). But creating a map (vertex -> status) is very space consuming.

What do you think?

Thanks a lot!

Source: (StackOverflow)

Unable to connect with azure blob storage with local hadoop

While trying to connect the local hadoop with the AZURE BLOB storage (ie using the blob storage as HDFS)with Hadoop Version - 2.7.1, It throws exception

Here i have successfully formed the local cluster by setting the property

<property>
    <name>fs.default.name</name>
    <value>wasb://account@storage.blob.core.windows.net</value>
</property>

and followed by its key value for blob storage in core-site.xml.

while listing the file or making HDFS operations to the blob storage,getting the follwing Exception as

 ls: No FileSystem for scheme: wasb

Anyone please guide me to resolve the above issue.

Source: (StackOverflow)

Understanding Skiena's algorithm to detect cycles in a graph

I'm having trouble understanding everything in Skeina's algorithm to detect cycles in a graph whether directed or undirected

void process_edge(int v, int y) {
    // if parent[v] == y it means we're visitng the same edge since this is unweighted?
    // y must be discovered otherwise this is the first I'm seeing it?
    if (discovered[y] && parent[v] != y) {
        printf("\nfound a back edge (%d %d) \n", v, y);
    }
}

void DFSRecursive(graph *g, int v) {
    discovered[v] = true;

    edge *e = g->edges[v];
    while (e != NULL) {
        int y = e->y;
        if (discovered[y] == false) {
            parent[y] = v;
            DFSRecursive(g, y);
        } else if (!processed[y] || g->directed) { // can't be processed and yet has an unvisited kid!?
            process_edge(v, y, data);
        }
        e = e->next;
    }

    processed[v] = true;
}

Why are we checking !processed[y]? If we see a neighbor y and it was discovered already (first if condition), how would it be possible that y has been processed? given that v is a neighbor and we discovered it just now?
I was confused with the check for parent[v] !== y but I guess it makes sense in the unweighted graph case, if we have a graph with just two nodes, both nodes have each other in their adjacency so this is not a cycle. I'm not clear though on why it would make sense in the directed case because 1->2 and 2->1 is considered a cycle, right?
I don't have problems with the third condition discovered[y] in the process edge method because if it undiscovered, it would mean this is the first time we are seeing it

Source: (StackOverflow)

Hadoop File Splits : CompositeInputFormat : Inner Join

I am using CompositeInputFormat to provide input to a hadoop job.

The number of splits generated is the total number of files given as input to CompositeInputFormat ( for joining ).

The job is completely ignoring the block size and max split size ( while taking input from CompositeInputFormat). This is resulting into long running Map Tasks and is making system slow as the input files are larger than the block size.

Is anyone aware of any way through which the number of splits can be managed for CompositeInputFormat?

Source: (StackOverflow)

Hadoop commands

I have Hadoop installed in this location

/usr/local/hadoop$

Now I want to list the files inside the dfs. The command I used is :

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls

This gave me the files in the dfs

Found 3 items
drwxr-xr-x   - hduser supergroup          0 2014-03-20 03:53 /user/hduser/gutenberg
drwxr-xr-x   - hduser supergroup          0 2014-03-24 22:34 /user/hduser/mytext-output
-rw-r--r--   1 hduser supergroup        126 2014-03-24 22:30 /user/hduser/text.txt

Next time, I tried the same command in a different manner

hduser@ubuntu:/usr/local/hadoop$ hadoop dfs -ls

It also gave me the same result.

Could some one please explain why both are working despite of executing the ls command from different folders. I hope you guys understood my question.Just explain me difference between these two :

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls
hduser@ubuntu:/usr/local/hadoop$ hadoop dfs -ls

Source: (StackOverflow)

Useful C# class/methods for DFS and BFS program

I have an XML file of nodes with their connections. Something like:

<Graph>
  <Node Name = "A">
    <ConnectsTo>B</ConnectsTo>
    <ConnectsTo>H</ConnectsTo>
  </Node>
  <Node Name="B"></Node>
  <Node Name="C">
    <ConnectsTo>E</ConnectsTo>
  </Node>
  <Node Name="D">
    <ConnectsTo>C</ConnectsTo>
  </Node>
  <Node Name="E"></Node>
  <Node Name="F">
    <ConnectsTo>D</ConnectsTo>
    <ConnectsTo>G</ConnectsTo>
  </Node>
  <Node Name="G">
    <ConnectsTo>E</ConnectsTo>
    <ConnectsTo>I</ConnectsTo>
  </Node>
  <Node name="H">
    <ConnectsTo>C</ConnectsTo>
    <ConnectsTo>J</ConnectsTo>
    <ConnectsTo>G</ConnectsTo>
  </Node>
  <Node name="I">
    <ConnectsTo>E</ConnectsTo>
  </Node>
  <Node name="J">
    <ConnectsTo>A</ConnectsTo>
  </Node>
</Graph>

Now, I will map those nodes using either BFS or DFS and print how nodes are being mapped/retrieved.

Sample Prompt :

Choose (1)DFS (2)BFS : 1
Choose Starting Vertex : A

Result : 

A B
A H J
A H C E
A H G E
A H G I E

Am I on the right track of re-arranging first the nodes in hierarchy? What classes will be useful for this (rearranging and future process)? Subclass of Graph? LinkedList?

Source: (StackOverflow)

Number of crucial nodes in an Undirected graph

I have a graph G containing N nodes and E edges.Each edge is undirectional.The goal is to find the no. of crucial nodes.

A node which makes the graph disconnected when it is removed is called a crucial node. The goal is to find the no. of such nodes in the graph.

A solution would be:-

For each node belonging to the graph, remove it from the graph, pick a node from the remaining graph, perform dfs, if we are able to reach everywhere then it is not a crucial node.

This solution is O(N*E) or worst case O(N^3).

Is there a O(N^2) solution or O(E) solution because N^3 is a bit too slow.

Source: (StackOverflow)