Hadoop

Description

Introduction to hadoop
Kien Dang Ngoc
Mind Map by Kien Dang Ngoc, updated more than 1 year ago
Kien Dang Ngoc
Created by Kien Dang Ngoc over 9 years ago
60
1

Resource summary

Hadoop
  1. definition
    1. Apache Hadoop is an open-source software framework for distributed storage and distributed processing of Big Data on clusters of commodity hardware. Its Hadoop Distributed File System (HDFS) splits files into large blocks (default 64MB or 128MB) and distributes the blocks amongst the nodes in the cluster.
    2. Modules
      1. HDFS
        1. definition
          1. the filesystem that Hadoop uses to store data on the cluster nodes
          2. structure
            1. Name node
              1. A cluster can just have only one name node
                1. File content is split into blocks (128MB) . Each block is replicated at multiple DataNodes(default is 3)
                  1. File and directories are store inside inodes. Inodes record attributes like
                    1. permissions
                      1. modification
                        1. access times
                          1. namespace
                            1. disk space quotas
                            2. namenode maintains all of these.
                            3. data nodes
                              1. each block has 2 files
                                1. A file contains check sump, and stamp
                                  1. A file store the actual size of the file
                                    1. The size is the actual size of the file
                                  2. at startup, datanodes and namenode will take a handshake to check namespace Id
                                    1. namespace ID is stored persistently in all datanodes inside cluster
                                      1. After handshake, data nodes will be registered with name node with a unique storage ID if it is first time, it will be never changed
                                      2. datanodes manage blocks through block id and send these id to name node through block report. This will be sent immediately after datanodes connect to namenode and then after every hour. Block report helps name node to locate where blocks are located in cluster
                                        1. every 3 secs, datanodes send heartbeat to namenode. If for 10 min, there isn't any heartbeat, namenode will assume that node is dead and all blocks are unavailable.
                                          1. carry information about
                                            1. total storage capapicity
                                              1. storage in use
                                                1. number of transactions
                                                2. namenode also uses heartbeat to send instructions to datanodes
                                                  1. replicate blocks to others nodes
                                                    1. remove local block replicas
                                                      1. re-register and send an immediate block report
                                                        1. shutdown the node
                                                3. Map reduce
                                                  1. definition
                                                    1. a framework for processing large amount of structured or unstructured data in parallel across clusters
                                                    2. tasks
                                                      1. Map
                                                        1. list all elements and breaks them all into tuples (key/value pairs).
                                                        2. Reduce
                                                          1. using map as input and combines data tuples into a smaller set of tupples
                                                        3. trackers
                                                          1. Job tracker
                                                            1. schedule jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks
                                                            2. task trackers
                                                              1. execute tasks.
                                                          2. YARN
                                                            1. also called MRv2
                                                              1. split resource management and job scheduling/monitoring into separate daemons (processes?)
                                                                1. 1 global resource manager
                                                                  1. Application Master per application. An application could be
                                                                    1. A single job
                                                                      1. a Directed, acylic graph (DAG) of jobs
                                                                        1. is a framework specific library and is tasked with negotiating resources from resource manager and working with node manager(s) to execute and monitor tasks
                                                                          1. responsibility negotiating appropriate resource containers from scheduler, tracking status and monitoring for progress
                                                                          2. the per-node slave, nodemanager forms a data-computation framework with resource manager
                                                                            1. resource manager is the last authority to arbitrates resources in system
                                                                              1. responsible for containers , monitoring their resource usage, report to resource manager/scheduler
                                                                              2. Resource manager 2 main components
                                                                                1. Scheduler
                                                                                  1. allocate resources like a scheduler
                                                                                    1. no monitoring, tracking
                                                                                      1. guarantee restart failed tasks (application failed or hardware failed)
                                                                                        1. based on abstract notion of resources container with incorporates elements such as memory, cpu, disk, network, etc.
                                                                                      2. ApplicationsManager
                                                                                        1. accepting job submissions
                                                                                          1. negotiating first container for executing the application specific of application master
                                                                                            1. provides service for restating the application master container on failure
                                                                                      3. hadoop common packages
                                                                                        1. provide filesystem and OS level abstractions, Map Reduce engine (MR1 and MR2) and HDFS
                                                                                          1. provide JAR and scripts needed to start Hadoop
                                                                                        2. Supporters
                                                                                          1. Pig
                                                                                            1. Pig allows you to write complex Mapreduce transformations using a simple scripting language
                                                                                              1. pig is a high level scripting language that is used with apache hadoop
                                                                                                1. The language is called pig latin which abtracts java mapreduce into a form similar to SQL
                                                                                                  1. users can extend pig latin by writing their own functions using Java, python, Ruby or others scripting languages
                                                                                                    1. run in 2 modes
                                                                                                      1. local
                                                                                                        1. access to single machine, all files are installed and run using a localhost and file system
                                                                                                        2. mapreduce
                                                                                                          1. default, requires access to a Hadoop cluster
                                                                                                      2. Hive
                                                                                                        1. access to data on top of mapreduce using SQL-like query
                                                                                                        2. cloudera
                                                                                                        Show full summary Hide full summary

                                                                                                        Similar

                                                                                                        Big Data - Hadoop
                                                                                                        Pedro J. Plasenc
                                                                                                        Big Data
                                                                                                        Edgar Reverón
                                                                                                        GCSE PE - 4
                                                                                                        lydia_ward
                                                                                                        An Gnáthrud
                                                                                                        xlauramartinx
                                                                                                        anatomy of the moving body: Skeletal system
                                                                                                        Rupa Kleyn
                                                                                                        Utilitarianism
                                                                                                        ellie.blythe
                                                                                                        Nazi Germany 1933-39
                                                                                                        c7jeremy
                                                                                                        AQA A2 Biology Unit 4: Populations
                                                                                                        Charlotte Lloyd
                                                                                                        Meiosis vs. Mitosis
                                                                                                        nvart00
                                                                                                        Maths GCSE - What to revise!
                                                                                                        livvy_hurrell
                                                                                                        Longevidad y Envejecimiento Fisiológico
                                                                                                        Isaac Alexander