Hadoop

Kien Dang Ngoc
Mind Map by Kien Dang Ngoc, updated more than 1 year ago
Kien Dang Ngoc
Created by Kien Dang Ngoc over 5 years ago
33
1

Description

Introduction to hadoop

Resource summary

Hadoop
1 definition
1.1 Apache Hadoop is an open-source software framework for distributed storage and distributed processing of Big Data on clusters of commodity hardware. Its Hadoop Distributed File System (HDFS) splits files into large blocks (default 64MB or 128MB) and distributes the blocks amongst the nodes in the cluster.
2 Modules
2.1 HDFS
2.1.1 definition
2.1.1.1 the filesystem that Hadoop uses to store data on the cluster nodes
2.1.2 structure
2.1.2.1 Name node
2.1.2.1.1 A cluster can just have only one name node
2.1.2.1.2 File content is split into blocks (128MB) . Each block is replicated at multiple DataNodes(default is 3)
2.1.2.1.3 File and directories are store inside inodes. Inodes record attributes like
2.1.2.1.3.1 permissions
2.1.2.1.3.2 modification
2.1.2.1.3.3 access times
2.1.2.1.3.4 namespace
2.1.2.1.3.5 disk space quotas
2.1.2.1.4 namenode maintains all of these.
2.1.2.2 data nodes
2.1.2.2.1 each block has 2 files
2.1.2.2.1.1 A file contains check sump, and stamp
2.1.2.2.1.2 A file store the actual size of the file
2.1.2.2.1.2.1 The size is the actual size of the file
2.1.2.2.2 at startup, datanodes and namenode will take a handshake to check namespace Id
2.1.2.2.2.1 namespace ID is stored persistently in all datanodes inside cluster
2.1.2.2.2.2 After handshake, data nodes will be registered with name node with a unique storage ID if it is first time, it will be never changed
2.1.2.2.3 datanodes manage blocks through block id and send these id to name node through block report. This will be sent immediately after datanodes connect to namenode and then after every hour. Block report helps name node to locate where blocks are located in cluster
2.1.2.2.4 every 3 secs, datanodes send heartbeat to namenode. If for 10 min, there isn't any heartbeat, namenode will assume that node is dead and all blocks are unavailable.
2.1.2.2.4.1 carry information about
2.1.2.2.4.1.1 total storage capapicity
2.1.2.2.4.1.2 storage in use
2.1.2.2.4.1.3 number of transactions
2.1.2.2.4.2 namenode also uses heartbeat to send instructions to datanodes
2.1.2.2.4.2.1 replicate blocks to others nodes
2.1.2.2.4.2.2 remove local block replicas
2.1.2.2.4.2.3 re-register and send an immediate block report
2.1.2.2.4.2.4 shutdown the node
2.2 Map reduce
2.2.1 definition
2.2.1.1 a framework for processing large amount of structured or unstructured data in parallel across clusters
2.2.2 tasks
2.2.2.1 Map
2.2.2.1.1 list all elements and breaks them all into tuples (key/value pairs).
2.2.2.2 Reduce
2.2.2.2.1 using map as input and combines data tuples into a smaller set of tupples
2.2.3 trackers
2.2.3.1 Job tracker
2.2.3.1.1 schedule jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks
2.2.3.2 task trackers
2.2.3.2.1 execute tasks.
2.3 YARN
2.3.1 also called MRv2
2.3.2 split resource management and job scheduling/monitoring into separate daemons (processes?)
2.3.2.1 1 global resource manager
2.3.2.2 Application Master per application. An application could be
2.3.2.2.1 A single job
2.3.2.2.2 a Directed, acylic graph (DAG) of jobs
2.3.2.2.3 is a framework specific library and is tasked with negotiating resources from resource manager and working with node manager(s) to execute and monitor tasks
2.3.2.2.4 responsibility negotiating appropriate resource containers from scheduler, tracking status and monitoring for progress
2.3.2.3 the per-node slave, nodemanager forms a data-computation framework with resource manager
2.3.2.3.1 resource manager is the last authority to arbitrates resources in system
2.3.2.3.2 responsible for containers , monitoring their resource usage, report to resource manager/scheduler
2.3.2.4 Resource manager 2 main components
2.3.2.4.1 Scheduler
2.3.2.4.1.1 allocate resources like a scheduler
2.3.2.4.1.2 no monitoring, tracking
2.3.2.4.1.3 guarantee restart failed tasks (application failed or hardware failed)
2.3.2.4.1.3.1 based on abstract notion of resources container with incorporates elements such as memory, cpu, disk, network, etc.
2.3.2.4.2 ApplicationsManager
2.3.2.4.2.1 accepting job submissions
2.3.2.4.2.2 negotiating first container for executing the application specific of application master
2.3.2.4.2.3 provides service for restating the application master container on failure
2.4 hadoop common packages
2.4.1 provide filesystem and OS level abstractions, Map Reduce engine (MR1 and MR2) and HDFS
2.4.2 provide JAR and scripts needed to start Hadoop
3 Supporters
3.1 Pig
3.1.1 Pig allows you to write complex Mapreduce transformations using a simple scripting language
3.1.2 pig is a high level scripting language that is used with apache hadoop
3.1.3 The language is called pig latin which abtracts java mapreduce into a form similar to SQL
3.1.4 users can extend pig latin by writing their own functions using Java, python, Ruby or others scripting languages
3.1.5 run in 2 modes
3.1.5.1 local
3.1.5.1.1 access to single machine, all files are installed and run using a localhost and file system
3.1.5.2 mapreduce
3.1.5.2.1 default, requires access to a Hadoop cluster
3.2 Hive
3.2.1 access to data on top of mapreduce using SQL-like query
3.3 cloudera
Show full summary Hide full summary

Similar

Big Data - Hadoop
Pedro J. Plasenc
Big Data
Edgar Reverón
The Skeletal System - PE GCSE EdExcel
naomisargent
Poppies - Jane Weir
Jessica Phillips
Characters in "An Inspector Calls"
Esme Gillen
Key policies and organisations Cold War
Elisa de Toro Arias
Using GoConqr to learn French
Sarah Egan
GCSE AQA Physics 2 Circuits
Lilac Potato
I wish I..
Cristina Cabal
1PR101 1.test - 6. část
Nikola Truong
DESARROLLO FÍSICO Y COGNOSCITIVO EN LA NIÑEZ MEDIA
ALEJANDRA HERRERA VELEZ