In a Big Data system, the size of data is huge. So it does not make sense to move data across the network. In such a scenario, Hadoop tries to move computation closer to data.
So the Data remains local to the location wherever it was stored. But the computation tasks will be moved to data nodes that hold the data locally.
Hadoop follows following rules for Data Locality optimization:
- Hadoop first tries to schedule the task on node that has an HDFS file on a local disk.
- If it cannot be done, then Hadoop will try to schedule the task on a node on the same rack as the node that has data.
- If this also cannot be done, Hadoop will schedule the task on the node with same data on a different rack.
The above method works well, when we work with the default replication factor of 3 in Hadoop.