hadoop

Posted on 2019-01-07 Edited on 2023-05-30

Hadoop is a framework of distributed storage & computing.

distributed storage: hadoop use HDFS to save large amount of data in cluster.
distributed computing: hadoop use map-reduce framework to conduct fast data analysis (query & writing) over data in HDFS.
resource manager & job schedular: hadoop use yarn to manage/allocate cluster resources (memory, cpu, etc.) and to schedule and moniter job executing.

Architecture

cluster architecture

request processing

Fault Tolerance

Use rack aware so that your replicas will be saved into different racks, which can solve the rack failure issue.

Each data node will send heartbeat and block report to the namenode. Thus when data node fails, the name node knows it and will re-replicated to 3.

High Availability

High Availability: Percentage of Uptime of the system. Fault Tolerance, on the other hand, mainly focus on the data loss / system un-recovered damage tolerance. For example, a name-node failure can be processed by reboot from the aspect of fault-tolerance, while there must be a quick working solution from the aspect of high availability.

Name Node Failure

For a name node failure, we want to switch to a standby name node with all the informations quickly. How?

A name node saves the file namespaces in memory, besides, it also saved editlog for each change into the disk. A name node failure will lose the in-memory fsImage, but we can reproduce the fsImage from the editlogs

A common solution is to use QJM to save the editlogs. And the standby name node will read from the editlogs to rebuild the fsImage. Besides, there’s two failover controllers on each name node and a zookeeper. ZooKeeper keeps a lock, and both name nodes are requesting the lock. When the active name node fails, it lost the lock, and the standby nn will acquire the lock.

Name Node Reboot

What if you just want to reboot the name node, and since the fsImage is in memory, it will be gone at once and it takes a long time to rebuild from the editlogs?

The main issue here is that the fsImage is in memory. Thus to reboot quickly, we need to save the fsImage into disk. The secondary name node is for this. It periodically merge the old fsImage with the editlogs, and replace the old fsImage in the disk, and then truncate the logs.

Secondary Name Node is not necessary. If needed, you can build it on the standby nn.

install hadoop on mac

see default ports used by hadoop services 3.1.0

when config password-free login by ssh, it may only work when generate key into id_rsa/id_dsa. The other user defind key file name won’t work.

access hdfs

hdfs default ports are changed. see hdfs issue, or check the dfs.namenode.http-address property in hdfs-default.xml for the newest setting.

Namenode ports: 50470 –> 9871, 50070 –> 9870, 8020 –> 9820
Secondary NN ports: 50091 –> 9869, 50090 –> 9868
Datanode ports: 50020 –> 9867, 50010 –> 9866, 50475 –> 9865, 50075 – >9864

When running the example, it seems that jar can only search files. Thus you need to ensure there’s no sub-dirs in search dir.

access yarn

Access resource manager through localhost:8088. Or check the property yarn.resourcemanager.webapp.address in yarn-default.xml for the newest configuration

start hadoop locally

map reduce is a framework to write applications which process vast amounts of data in-parallel on large clusters.

A map-reduce job usually splits the input data into independent chunks, and map in a parallel manner. Then the frameworks sorts the output and then reduce to the integrate output.

# initialize the namenode
$ hdfs namenode -format
# start namenode and datanode daemon (access namenode at localhost:9870)
$ start-dfs.sh
# start ResourceManager & NodeManager daemon (access yarn at localhost:8088)
$ start-yarn.sh

# stop namenode and datanode daemon
$ stop-dfs.sh
# stop ResourceManager & NodeManager daemon
$ stop-yarn.sh

Note:

The hdfs namenode -format command must be executed everytime you restarted your computer. And it’s initialized again. Need to figure out other ways to avoid this.

There are other commands used to start these daemon:

# deprecated to use the above
$ start-all.sh

# used on specific node (eg. when a new node is added into the cluster, execute on that node)
$ hadoop-daemon.sh start datanode/namenode
$ yarn-daemon.sh start resourcemanager