Zusammenfassung der Ressource
Hadoop
- definition
- Apache Hadoop is an open-source software
framework for distributed storage and distributed
processing of Big Data on clusters of commodity
hardware. Its Hadoop Distributed File System (HDFS)
splits files into large blocks (default 64MB or 128MB)
and distributes the blocks amongst the nodes in the
cluster.
- Modules
- HDFS
- definition
- the
filesystem that
Hadoop uses to
store data on
the cluster
nodes
- structure
- Name node
- A cluster can just
have only one
name node
- File content is
split into blocks
(128MB) . Each
block is replicated
at multiple
DataNodes(default
is 3)
- File and
directories
are store
inside inodes.
Inodes record
attributes like
- permissions
- modification
- access times
- namespace
- disk space quotas
- namenode maintains all of these.
- data nodes
- each block has 2 files
- A file contains
check
sump, and stamp
- A file store
the actual
size of the file
- The size is the
actual size of
the file
- at startup, datanodes
and namenode will
take a handshake to
check namespace Id
- namespace ID is
stored persistently in
all datanodes inside
cluster
- After handshake, data
nodes will be registered
with name node with a
unique storage ID if it is
first time, it will be never
changed
- datanodes manage blocks
through block id and send these
id to name node through block
report. This will be sent
immediately after datanodes
connect to namenode and then
after every hour. Block report
helps name node to locate
where blocks are located in
cluster
- every 3 secs,
datanodes send
heartbeat to
namenode. If for 10
min, there isn't any
heartbeat, namenode
will assume that node
is dead and all blocks
are unavailable.
- carry information about
- total
storage
capapicity
- storage in use
- number of transactions
- namenode also uses
heartbeat to send
instructions to
datanodes
- replicate blocks
to others nodes
- remove local block replicas
- re-register and send an immediate block report
- shutdown the node
- Map reduce
- definition
- a framework for processing
large amount of structured or
unstructured data in parallel
across clusters
- tasks
- Map
- list all elements
and breaks them
all into tuples (key/value
pairs).
- Reduce
- using map as input
and combines data
tuples into a smaller
set of tupples
- trackers
- Job tracker
- schedule jobs'
component tasks on the
slaves, monitoring them
and re-executing the
failed tasks
- task trackers
- execute tasks.
- YARN
- also called MRv2
- split resource management
and job scheduling/monitoring
into separate daemons
(processes?)
- 1 global
resource
manager
- Application
Master per
application. An
application
could be
- A single job
- a Directed, acylic
graph (DAG) of
jobs
- is a framework
specific library
and is tasked
with negotiating
resources from
resource
manager and
working with
node manager(s)
to execute and
monitor tasks
- responsibility negotiating appropriate
resource containers from scheduler,
tracking status and monitoring for
progress
- the per-node slave,
nodemanager forms
a data-computation
framework with
resource manager
- resource manager is
the last authority to
arbitrates resources
in system
- responsible for containers ,
monitoring their resource
usage, report to resource
manager/scheduler
- Resource
manager 2 main
components
- Scheduler
- allocate resources like a scheduler
- no monitoring, tracking
- guarantee restart failed
tasks (application failed or
hardware failed)
- based on abstract notion
of resources container
with incorporates
elements such as
memory, cpu, disk,
network, etc.
- ApplicationsManager
- accepting job submissions
- negotiating first container for
executing the application specific of
application master
- provides service for restating the
application master container on
failure
- hadoop common packages
- provide filesystem and OS level
abstractions, Map Reduce engine
(MR1 and MR2) and HDFS
- provide JAR
and scripts needed to
start Hadoop
- Supporters
- Pig
- Pig allows you to write
complex Mapreduce
transformations using a
simple scripting language
- pig is a high level
scripting language that
is used with apache
hadoop
- The language is called pig latin
which abtracts java mapreduce
into a form similar to SQL
- users can extend pig latin by
writing their own functions using
Java, python, Ruby or others
scripting languages
- run in 2 modes
- local
- access to single machine, all files are
installed and run using a localhost and
file system
- mapreduce
- default, requires
access to a Hadoop
cluster
- Hive
- access to data on top of
mapreduce using SQL-like
query
- cloudera