Interview :: Hadoop
Distributed cache is a facility provided by MapReduce Framework. It is provided to cache files (text, archives etc.) at the time of execution of the job. The Framework copies the necessary files to the slave node before the execution of any task at that node.
Hadoop job - list
Hadoop job - kill jobID
JobTracker is a giant service which is used to submit and track MapReduce jobs in Hadoop. Only one JobTracker process runs on any Hadoop cluster. JobTracker runs it within its own JVM process.
Functionalities of JobTracker in Hadoop:
- When client application submits jobs to the JobTracker, the JobTracker talks to the NameNode to find the location of the data.
- It locates TaskTracker nodes with available slots for data.
- It assigns the work to the chosen TaskTracker nodes.
- The TaskTracker nodes are responsible to notify the JobTracker when a task fails and then JobTracker decides what to do then. It may resubmit the task on another node or it may mark that task to avoid.
The TaskTracker periodically sends heartbeat messages to the JobTracker to assure that it is alive. This messages also inform the JobTracker about the number of available slots. This return message updates JobTracker to know about where to schedule task.
No, There are many ways to deal with non-java codes. HadoopStreaming allows any shell command to be used as a map or reduce function.
HBase data storage component is used by Hadoop.
For writing a custom partitioner on Hadoop, you must follow the following path:
- Create a new class that extends Partitioner Class.
- Override method getPartition() in the wrapper that runs the MapReduce.
- Add the custom partitioner to the job by using method setPartitioner() or add the custom partitioner in the config file.