How will you replace HDFS data volume before shutting down a DataNode?

In HDFS, DataNode supports hot swappable drives. With a swappable drive we can add or replace HDFS data volumes while the DataNode is still running.

The procedure for replacing a hot swappable drive is as follows:

<li>First we format and mount the new drive.</li>

<li>We update the DataNode configuration to reflect the data volume directories.</li>

<li>Run the "dfsadmin -reconfig datanode HOST:PORT start" command to start the reconfiguration process</li>

<li>Once the reconfiguration is complete, we just unmount the old data volume</li>

<li>After unmount we can physically remove the old disks.</li>

What are the important configuration files in Hadoop?

There are two important configuration files in a Hadoop cluster:

<li><strong>Default Configuration</strong>: There are core-default.xml, hdfs-default.xml and mapred-default.xml files in which we specify the default configuration for Hadoop cluster. These are read only files.</li>

<li><strong>Custom Configuration</strong>: We have site-specific custom files like core-site.xml, hdfs-site.xml, mapred-site.xml in which we can specify the site-specific configuration. 

All the Jobs in Hadoop and HDFS implementation uses the parameters defined in the above-mentioned files. With customization we can tune these processes according to our use case.

In Hadoop API, there is a Configuration class that loads these files and provides the values at run time to different jobs.

How will you monitor memory used in a Hadoop cluster?

In Hadoop, TaskTracker is the one that uses high memory to perform a task. We can configure the TastTracker to monitor memory usage of the tasks it creates. It can monitor the memory usage to find the badly behaving tasks, so that these tasks do not bring the machine down with excess memory consumption.

In memory monitoring we can also limit the maximum memory used by a tasks. We can even limit the memory usage per node. So that all the tasks executing together on a node do not consume more memory than a limit.

Some of the parameters for setting memory monitoring in Hadoop are as follows:

  •, mapred.cluster.reduce.memory.mb: This is the size of virtual memory of a single map/reduce slot in a cluster of Map-Reduce framework.
  •, mapred.job.reduce.memory.mb: This is the default limit of memory set on each map/reduce task in Hadoop.
  •, mapred.cluster.max.reduce.memory.mb: This is the maximum limit of memory set on each map/reduce task in Hadoop.

Why do we need Serialization in Hadoop map reduce methods?

In Hadoop, there are multiple data nodes that hold data. During the processing of map and reduce methods data may transfer from one node to another node. Hadoop uses serialization to convert the data from Object structure to Binary format.

With serialization, data can be converted to binary format and with de-serialization data can be converted back to Object format with reliability.

What is the use of Distributed Cache in Hadoop?

Hadoop provides a utility called Distributed Cache to improve the performance of jobs by caching the files used by applications.

An application can specify which file it wants to cache by using JobConf configuration.

Hadoop framework copies these files to the nodes one which a task has to be executed. This is done before the start of execution of a task.

DistributedCache supports distribution of simple read only text files as well as complex files like jars, zips etc.

How will you synchronize the changes made to a file in Distributed Cache in Hadoop?

It is a trick question. In Distributed Cache, it is not allowed to make any changes to a file. This is a mechanism to cache read-only data across multiple nodes.

Therefore, it is not possible to update a cached file or run any synchronization in Distributed Cache.

What is Safemode in HDFS?

Safemode is considered as the read-only mode of NameNode in a cluster. During the startup of NameNode, it is in SafeMode. It does not allow writing to file-system in Safemode. At this time, it collects data and statistics from all the DataNodes. Once it has all the data on blocks, it leaves Safemode.

The main reason for Safemode is to avoid the situation when NameNode starts replicating data in DataNodes before collecting all the information from DataNodes. It may erroneously assume that a block is not replicated well enough, whereas, the issue is that NameNode does not know about whereabouts of all the replicas of a block. Therefore, in Safemode, NameNode first collects the information about how many replicas exist in cluster and then tries to create replicas wherever the number of replicas is less than the policy.

How will you create a custom Partitioner in a Hadoop job?

Partition phase runs between Map and Reduce phase. It is an optional phase. We can create a custom partitioner by extending the org.apache.hadoop.mapreduce.Partitioner class in Hadoop. In this class, we have to override getPartition(KEY key, VALUE value, int numPartitions) method.

This method takes three inputs. In this method, numPartitions is same as the number of reducers in our job. We pass key and value to get the partition number to which this key,value record will be assigned. There will be a reducer corresponding to that partition. The reducer will further handle to summarizing of the data.

Once custom Partitioner class is ready, we have to set it in the Hadoop job. We can use following method to set it:


What are the differences between RDBMS and HBase data model?

The main differences between RDBMS and HBase data model are as follows:

  • Schema: In RDBMS, there is a schema of tables, columns etc. In HBASE, there is no schema.
  • Normalization: RDBMS data model is normalized as per the rule of Relation DB normalization. HBase data model does not require any normalization.
  • Partition: In RDBMS we can choose to do partitioning. In HBase, partitioning happens automatically.

What is a Checkpoint node in HDFS?

A Checkpoint node in HDFS periodically fetches fsimage and edits from NameNode, and merges them. This merge result is called a Checkpoint. Once a Checkpoint is created, Checkpoint Node uploads the Checkpoint to NameNode.

Secondary node also takes Checkpoint similar to Checkpoint Node. But it does not upload the Checkpoint to NameNode.

Main benefit of Checkpoint Node is in case of any failure on NameNode. A NameNode does not merge its edits to fsimage automatically during the runtime. If we have long running task, the edits will become huge. When we restart NameNode, it will take much longer time, because it will first merge the edits. In such a scenario, Checkpoint node helps for a long running task. Checkpoint nodes performs the task of merging the edits with fsimage and then uploads these to NameNode. This saves time during the restart of NameNode.