Interview :: Hadoop
When a Hadoop job runs, it splits input files into chunks and assigns each split to a mapper for processing. It is called the InputSplit.
In TextInputFormat, each line in the text file is a record. Value is the content of the line while Key is the byte offset of the line. For instance, Key: longWritable, Value: text
In Hadoop, SequenceFileInputFormat is used to read files in sequence. It is a specific compressed binary file format which passes data between the output of one MapReduce job to the input of some other MapReduce job.
Hadoop makes 5 splits as follows:
- One split for 64K files
- Two splits for 65MB files, and
- Two splits for 127MB files
InputSplit is assigned with a work but doesn't know how to access it. The record holder class is totally responsible for loading the data from its source and convert it into keys pair suitable for reading by the Mapper. The RecordReader's instance can be defined by the Input Format.
JobTracker is a service within Hadoop which runs MapReduce jobs on the cluster.
WebDAV is a set of extension to HTTP which is used to support editing and uploading files. On most operating system WebDAV shares can be mounted as filesystems, so it is possible to access HDFS as a standard filesystem by exposing HDFS over WebDAV.
Sqoop is a tool used to transfer data between the Relational Database Management System (RDBMS) and Hadoop HDFS. By using Sqoop, you can transfer data from RDBMS like MySQL or Oracle into HDFS as well as exporting data from HDFS file to RDBMS.
These are the main tasks of JobTracker:
- To accept jobs from the client.
- To communicate with the NameNode to determine the location of the data.
- To locate TaskTracker Nodes with available slots.
- To submit the work to the chosen TaskTracker node and monitors the progress of each task.
TaskTracker is a node in the cluster that accepts tasks like MapReduce and Shuffle operations from a JobTracker.