Kien Dang Ngoc
Mind Map by Kien Dang Ngoc, updated more than 1 year ago
Kien Dang Ngoc
Created by Kien Dang Ngoc over 5 years ago


Introduction to hadoop

Resource summary

1 definition
1.1 Apache Hadoop is an open-source software framework for distributed storage and distributed processing of Big Data on clusters of commodity hardware. Its Hadoop Distributed File System (HDFS) splits files into large blocks (default 64MB or 128MB) and distributes the blocks amongst the nodes in the cluster.
2 Modules
2.1 HDFS
2.1.1 definition the filesystem that Hadoop uses to store data on the cluster nodes
2.1.2 structure Name node A cluster can just have only one name node File content is split into blocks (128MB) . Each block is replicated at multiple DataNodes(default is 3) File and directories are store inside inodes. Inodes record attributes like permissions modification access times namespace disk space quotas namenode maintains all of these. data nodes each block has 2 files A file contains check sump, and stamp A file store the actual size of the file The size is the actual size of the file at startup, datanodes and namenode will take a handshake to check namespace Id namespace ID is stored persistently in all datanodes inside cluster After handshake, data nodes will be registered with name node with a unique storage ID if it is first time, it will be never changed datanodes manage blocks through block id and send these id to name node through block report. This will be sent immediately after datanodes connect to namenode and then after every hour. Block report helps name node to locate where blocks are located in cluster every 3 secs, datanodes send heartbeat to namenode. If for 10 min, there isn't any heartbeat, namenode will assume that node is dead and all blocks are unavailable. carry information about total storage capapicity storage in use number of transactions namenode also uses heartbeat to send instructions to datanodes replicate blocks to others nodes remove local block replicas re-register and send an immediate block report shutdown the node
2.2 Map reduce
2.2.1 definition a framework for processing large amount of structured or unstructured data in parallel across clusters
2.2.2 tasks Map list all elements and breaks them all into tuples (key/value pairs). Reduce using map as input and combines data tuples into a smaller set of tupples
2.2.3 trackers Job tracker schedule jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks task trackers execute tasks.
2.3 YARN
2.3.1 also called MRv2
2.3.2 split resource management and job scheduling/monitoring into separate daemons (processes?) 1 global resource manager Application Master per application. An application could be A single job a Directed, acylic graph (DAG) of jobs is a framework specific library and is tasked with negotiating resources from resource manager and working with node manager(s) to execute and monitor tasks responsibility negotiating appropriate resource containers from scheduler, tracking status and monitoring for progress the per-node slave, nodemanager forms a data-computation framework with resource manager resource manager is the last authority to arbitrates resources in system responsible for containers , monitoring their resource usage, report to resource manager/scheduler Resource manager 2 main components Scheduler allocate resources like a scheduler no monitoring, tracking guarantee restart failed tasks (application failed or hardware failed) based on abstract notion of resources container with incorporates elements such as memory, cpu, disk, network, etc. ApplicationsManager accepting job submissions negotiating first container for executing the application specific of application master provides service for restating the application master container on failure
2.4 hadoop common packages
2.4.1 provide filesystem and OS level abstractions, Map Reduce engine (MR1 and MR2) and HDFS
2.4.2 provide JAR and scripts needed to start Hadoop
3 Supporters
3.1 Pig
3.1.1 Pig allows you to write complex Mapreduce transformations using a simple scripting language
3.1.2 pig is a high level scripting language that is used with apache hadoop
3.1.3 The language is called pig latin which abtracts java mapreduce into a form similar to SQL
3.1.4 users can extend pig latin by writing their own functions using Java, python, Ruby or others scripting languages
3.1.5 run in 2 modes local access to single machine, all files are installed and run using a localhost and file system mapreduce default, requires access to a Hadoop cluster
3.2 Hive
3.2.1 access to data on top of mapreduce using SQL-like query
3.3 cloudera
Show full summary Hide full summary


Big Data - Hadoop
Pedro J. Plasenc
Big Data
Edgar Reverón
The Skeletal System - PE GCSE EdExcel
Poppies - Jane Weir
Jessica Phillips
Characters in "An Inspector Calls"
Esme Gillen
Key policies and organisations Cold War
Elisa de Toro Arias
Using GoConqr to learn French
Sarah Egan
GCSE AQA Physics 2 Circuits
Lilac Potato
I wish I..
Cristina Cabal
1PR101 1.test - 6. část
Nikola Truong