What is the Hadoop architecture

Apache Hadoop architecture in HDInsight

  • 3 minutes to read

Apache Hadoop comprises two core components: Apache HDFS (Hadoop Distributed File System) for storage and Apache YARN (Yet Another Resource Negotiator) for processing. With storage and processing functions, a cluster can execute MapReduce programs and thus the desired data processing.

Note

An HDFS is typically not deployed within the HDInsight cluster to provide storage. Instead, Hadoop components use an HDFS-compatible interface layer. The actual storage function is provided either by Azure Storage or by Azure Data Lake Storage. For Hadoop, MapReduce jobs on the HDInsight cluster run as if there were HDFS, so no changes are required to support their storage needs. Hadoop's storage is swapped out in HDInsight, but YARN processing remains a core component. For more information, see Introduction to Azure HDInsight.

This article introduces YARN and explains how to coordinate application execution in HDInsight.

Apache Hadoop YARN Basics

YARN controls and orchestrates data processing in Hadoop. YARN has two core services that run as processes on nodes in the cluster:

  • ResourceManager
  • NodeManager

The ResourceManager service grants applications (such as MapReduce jobs) access to the cluster's computer resources. The resources are provided in the form of containers, with each container being composed of an allocation of CPU cores and memory. If you combine all the resources available in a cluster and then distribute the cores and the memory into blocks, each resource block represents a container. Each node in the cluster has a capacity for a certain number of containers, so the number of available containers for the cluster is based on is limited to a fixed value. The allocation of resources in a container can be configured.

When a MapReduce application runs in a cluster, the ResourceManager service provides the application with the containers to run in. The ResourceManager service monitors the status of running applications, available cluster capacity, and applications that are being completed and their resources released.

In addition, the ResourceManager service runs a web server process that provides a web user interface for monitoring the status of applications.

When a user submits a MapReduce application to run in the cluster, it is submitted to the ResourceManager service. The ResourceManager service in turn assigns a container to available NodeManager nodes. The actual application execution takes place on the NodeManager nodes. A special application called "ApplicationMaster" is executed in the first assigned container. This ApplicationMaster application is responsible for obtaining resources in the form of additional containers that are required to run the transmitted application. The ApplicationMaster application examines the phases of the application (e.g. the assignment and reduction phase) and takes into account the amount of data to be processed. The ApplicationMaster application then requests the necessary resources from the ResourceManager service on behalf of the application (negotiated). The ResourceManager service, in turn, grants the ApplicationMaster application access to resources of the NodeManager instances in the cluster that the application uses to run the application.

The NodeManager instances perform the tasks that make up the application and report their progress and status back to the ApplicationMaster application. The ApplicationMaster application in turn reports the status of the application to the ResourceManager service. The ResourceManager service returns the results to the client.

YARN in HDInsight

YARN is provided by all HDInsight cluster types. The ResourceManager service is provided for high availability with a primary and secondary instance running on the first and second major nodes in the cluster, respectively. Only a single instance of the ResourceManager service is active at a time. The NodeManager instances run on the available worker nodes in the cluster.

Temporary deletion

For information on how to restore a file from your storage account, see:

Azure storage

Azure Data Lake Storage Gen 1

Restore-AzDataLakeStoreDeletedItem

Azure Data Lake Storage Gen 2

Azure Data Lake Storage Gen2

Emptying the trash

The property of HDFS > Advanced core site (core-site (advanced)) should be left at the default value as you should not save any data on the local file system. This value has no effect on remote storage accounts (WASB, ADLS GEN1, ABFS).

Next Steps