Carolina Colorado
Test por , creado hace más de 1 año

Fundamental Big Data Big Data I (carolina c) Test sobre Modulo 7: Fundamental Big Data engineering, creado por Carolina Colorado el 27/03/2017.

66
1
0
Carolina Colorado
Creado por Carolina Colorado hace alrededor de 7 años
Cerrar

Modulo 7: Fundamental Big Data engineering

Pregunta 1 de 149

1

Data engineering

Selecciona una o más de las siguientes respuestas posibles:

  • is the field of developing, testing, deploying and maintaining data processing solutions via collecting, parsing, transforming, joining processing and managing data

  • make data avalaible for data scientists to develop models and data products

  • two main activities that comprise data _____________include storage and processing of data

  • is tasked with making data amenable to various types of data analyses, including model development (data mining and other business-process specific algorithms) and reporting)

Explicación

Pregunta 2 de 149

1

Big Data engineering

Selecciona una o más de las siguientes respuestas posibles:

  • two main activities that comprise ____________ include storage and processing of data, wich is typically structured.

  • within the realm of__________________ involves developing highly distributed, scalable, fault-tolerant data processing solutions to process large amounts of data in order to garner insights

  • comprises data processing in support of the Big Data analysis lifecycle.

  • make data available for data scientists to develop models and data products

  • They are required to have knowledge of various data storage and processing technology alternatives for acquiring, storing and processing tecnology alternatives for acquiring, storing and processing data that is often semi-structured and unstructured in nature

Explicación

Pregunta 3 de 149

1

Many of challenges faced in Big Data engineering are related to managing the three primary V of big data:
Volume

Selecciona una de las siguientes respuestas posibles:

  • processing large amounts of structured, unstructured, and semi-structured data arriving at a fast pace, including extraction of relevant data from semi-structured and unstructured datasets

  • Internet-scale datasets and associated batch an realtime data processing

  • collection and aggregation of data from multiple sources with disparate schemas or without any schema

Explicación

Pregunta 4 de 149

1

Many of challenges faced in Big Data engineering are related to managing the three primary V of big data:
Velocity

Selecciona una de las siguientes respuestas posibles:

  • Internet-scale datasets and associated batch an realtime data processing

  • collection and aggregation of data from multiple sources with disparate schemas or without any schema

  • processing large amounts of structured, unstructured, and semi-structured data arriving at a fast pace, including extraction of relevant data from semi-structured and unstructured datasets

Explicación

Pregunta 5 de 149

1

Many of challenges faced in Big Data engineering are related to managing the three primary V of big data:
Variety

Selecciona una de las siguientes respuestas posibles:

  • processing large amounts of structured, unstructured, and semi-structured data arriving at a fast pace, including extraction of relevant data from semi-structured and unstructured datasets

  • Internet-scale datasets and associated batch an realtime data processing

  • collection and aggregation of data from multiple sources with disparate schemas or without any schema

Explicación

Pregunta 6 de 149

1

Big Data engineering challenges:

Selecciona una o más de las siguientes respuestas posibles:

  • comprises data processing in support of the big Data analysis lifecycle

  • importing/exporting large amounts of data from to traditional storage technologies, including OLTP (CRM, ERP, SCM systems) and OLAP systems (data warehouse)

  • field of developing, testing deploying and maintaining data processing solutions via collecting, parsing transforming, joining, processing and managing data

  • validating and cleansing data in realtime or batch mode and creating efficient data models

  • stablishing an optimal data storage and processing enviroment based on the type of data and its processing requirements

  • developing efficient data processing algorithms that run over cluster of computers

  • developing Big Data pipelines and Big Data applications that may include meaningful data visualizations

Explicación

Pregunta 7 de 149

1

The big Data Engineer certification provides in_depth knowledge of the concepts ans characteristics of

Selecciona una o más de las siguientes respuestas posibles:

  • storage devices nd

  • analytics engine

  • processing engine

  • resource manager

Explicación

Pregunta 8 de 149

1

data can either be stored using __________ or devices

Selecciona una o más de las siguientes respuestas posibles:

  • disk_based

  • realtime

  • batch

  • memory_based

Explicación

Pregunta 9 de 149

1

generally data needs to be stored to a disk before it can be processed. However, this only applies to ____________ mode

Selecciona una de las siguientes respuestas posibles:

  • disk_based

  • batch processing

  • memory_based

  • realtime

Explicación

Pregunta 10 de 149

1

In ________________mode data is first processed in memory and then stored to the disk

Selecciona una de las siguientes respuestas posibles:

  • batch processing mode

  • realtime processing

  • disk_based

  • memory based

Explicación

Pregunta 11 de 149

1

The acquiered data is generally not in a format or a structure that can be directly processed, Therefore, as a result of data wrangling activity (data cleansing, data filtering, data preparing), it needs to be stored again

Selecciona uno de los siguientes:

  • VERDADERO
  • FALSO

Explicación

Pregunta 12 de 149

1

Data should not be needs to be stored once it gets processed, as a result of analytics and for archival purposes

Selecciona uno de los siguientes:

  • VERDADERO
  • FALSO

Explicación

Pregunta 13 de 149

1

storage is generally required

Selecciona una o más de las siguientes respuestas posibles:

  • When data gets processed as a result of an ETL activity or uotput in generated as a result of an analytical operation

  • When datasets are acquired or when data gets generated inside the enterprise boundary

  • In realtime processing mode, data is first processed in memory and then stored to the disk

  • WHen data is manipulated for making it amenable to data analysis

Explicación

Pregunta 14 de 149

1

__________is the process of horizontally partitioning a large dataset into a collection of smaller, more manageable datasets called ____

Selecciona una de las siguientes respuestas posibles:

  • CA, theorem

  • replication, partition

  • ACID, unique

  • Sharding, shard

  • BASE, atom

Explicación

Pregunta 15 de 149

1

the ____ are distributed across multiple nodes, where a node is a server or a machine

Selecciona una de las siguientes respuestas posibles:

  • master-slave

  • shard

  • map

Explicación

Pregunta 16 de 149

1

Shard

Selecciona una o más de las siguientes respuestas posibles:

  • is stored on a separate node and each node is responsible for only the data stored on it

  • all_______ collectively represent the complete database

  • shares the same schema

  • stores muktiplecopies of a database

Explicación

Pregunta 17 de 149

1

Sharding

Selecciona una o más de las siguientes respuestas posibles:

  • ____________ distributes a processing load across multiple nodes to achieve horizontal scalability

  • ___________ may or not be transparent to the client

  • supports horizontal scaling as a method for increasing resource capacity. This is accomplished by adding similar or higher capacity resources alongside the existing resources

  • Once saved, the data is replicated over to multiple slave nodes

  • SInce each node is responsible for a part of the whole dataset, read/write times are greatly improved

Explicación

Pregunta 18 de 149

1

Sharding: depending on the query, data may need to be fetched from both shards

Selecciona uno de los siguientes:

  • VERDADERO
  • FALSO

Explicación

Pregunta 19 de 149

1

A benefit of________ is that it provides partial tolerance towards failures. In case of a node failure, only data stored on that node is affected

Selecciona una de las siguientes respuestas posibles:

  • replication

  • sharding

  • partitioning

Explicación

Pregunta 20 de 149

1

sharding: with regards to data partitioning, query patterns need to be taken into a account so that shards thenselves do not become performance bottlenecks

Selecciona uno de los siguientes:

  • VERDADERO
  • FALSO

Explicación

Pregunta 21 de 149

1

For example, queries requiring data from multiple shards will impose performance penalties. Data locality, or keeping commonly accessed data collodcated on a single shard, helps to counter such performance issues

Selecciona una de las siguientes respuestas posibles:

  • sharding

  • replication

  • Acid

Explicación

Pregunta 22 de 149

1

___________stores multiple copies of a dataset, know as _________, on multiple nodes. This providese for scalability, availability, and faulttolerance

Selecciona una de las siguientes respuestas posibles:

  • replicaction, replicas

  • sharding, shard

  • master, slave

Explicación

Pregunta 23 de 149

1

Replicaction: Methods of implementing

Selecciona una o más de las siguientes respuestas posibles:

  • shard

  • master-slave

  • peer-to-peer

  • partition

Explicación

Pregunta 24 de 149

1

Nodes are arranged in a _____________ configuration where all data is inserted and updated via a _____node

Selecciona una de las siguientes respuestas posibles:

  • peer-to-peer, peer

  • master-slave, master

  • sharding, shard

Explicación

Pregunta 25 de 149

1

once saved, the data is replicated over multiple _____nodes

Selecciona una de las siguientes respuestas posibles:

  • sharding, shard

  • peer-to-peer, peer

  • master-slave, slave

Explicación

Pregunta 26 de 149

1

All external write requests (insert, update, delete) occur on the master node, whereas data can be read from any slave node

Selecciona una o más de las siguientes respuestas posibles:

  • Replication

  • peer-to-peer

  • master-slave

Explicación

Pregunta 27 de 149

1

Replicaction: master-slave

Selecciona una o más de las siguientes respuestas posibles:

  • ideal for read intensive loads rather than write intensive loads, as growing read demands can be fulfilled simply via horizontal scaling wich adds additional slave nodes

  • Writes are consistent as all writes are coordinated by the master node. This means that write performance suffers as the amount of writes increases

  • Since each node is responsible for a part of a whole dataset, read/write times are greatly improved

  • in case the master node fails, reads are still possible via any of the slave nodes

  • a slave node can be configures as d backup node for the master node

Explicación

Pregunta 28 de 149

1

Replicaction: master-slave

Selecciona una o más de las siguientes respuestas posibles:

  • In case of master node failure, writes are not supported until a master node is reestablished. It is either resurrected from a backup of the master node, o a new master node is chosen from slave nodes.

  • For example, queries requiring data from multiple shards will impose performance penalties. Data locality, or keeping commonly accesed data collocated on a sngle shard, helps to counter such performance issues

  • Read inconsistency can be an issue it a slave node is read prior to the update being propagated over to it from the master node

  • To ensure read consistency, a voting system can be implemented where a read is declared consistent if the majority of the slave nodes contain the same version of the record. Implementation of such a voting system required reliable and fast communication mechanism between the slave nodes

Explicación

Pregunta 29 de 149

1

all nodes operate at the same level.

Selecciona una de las siguientes respuestas posibles:

  • Replication: peer-to-peer

  • Sharding

  • Replication: master-slave

Explicación

Pregunta 30 de 149

1

Replication: Each node, know as a __, is equality capable of handing reads and writes

Selecciona una de las siguientes respuestas posibles:

  • shard

  • peer

  • slave

Explicación

Pregunta 31 de 149

1

Replication: Peer-to-peer

Selecciona una o más de las siguientes respuestas posibles:

  • read inconsistency can be an issue if a principal node is read prior the update being propagated over to it

  • Each write is copied to all peers

  • prone to write inconsistencies that occur as a result of a simultaneous update of the same data across multiple ___

  • This can be addessed by implementing pessimistic or optimistic concurrency.

Explicación

Pregunta 32 de 149

1

Replication: peer-to-peer: pessimistic concurrency

Selecciona una de las siguientes respuestas posibles:

  • is a proactive approach that uses locking to ensure that only one update succeds at a time. however, this affects availability as a the database remains unavailable until all locks are released

  • is a reactive approach that does not use locking. Instead, it allows the inconsistency to happen first and restores consistency after the fact

Explicación

Pregunta 33 de 149

1

Replication: peer-to-peer

Selecciona una o más de las siguientes respuestas posibles:

  • Reads can be inconsistent between the time period when some of the ____ have been update while the others are being updated. However, reads eventually become consistent when the updates have been copied over to all ____

  • Ideal for read intensive loads rather than write intensive loads, as growing read demands can be fufilled simply via horizontal scaling wich adds additional ____ nodes

  • To ensure read consistency, a voting system can be implemented where a read is declared consistent if the majority of the _ contain the same version of the record

  • Implementation of such a voting system requires a reliable and fast communication mechanism between the ____

Explicación

Pregunta 34 de 149

1

To improve on the limited fault tolerance offered by sharding, while additionally benefiting from the increased availability and scalability of replicaction , both sharding and replication can be combined

Selecciona uno de los siguientes:

  • VERDADERO
  • FALSO

Explicación

Pregunta 35 de 149

1

Multiple shards become slaves of a single master whereas the master itself is a shard

Selecciona una de las siguientes respuestas posibles:

  • Master--slave

  • Master-slave Replication

  • Peer-to-peer

  • Peer-to-peer Replication

Explicación

Pregunta 36 de 149

1

Master-slave replication

Selecciona una o más de las siguientes respuestas posibles:

  • Althoug more than one master is possible, a single slave-shard can only be managed by a single master-shard

  • Replicas of shard are kept on multiple slave nodes to provide scalability and fault tolerance for read operations

  • Write consistency is mainteined by the master-shard

  • however, this means that fault tolerance Ç(with regards to write operations) is affected if the master-shard becomes non-operational or a network outage occurs

  • Each shar gets replicated to multiple peers, and each peer is only responsible for subset of data rather than the complete dataset

Explicación

Pregunta 37 de 149

1

Each shard gets replicated to multiple peers, and each peer is only responsible for a subset of data rather than the complete dataset

Selecciona una de las siguientes respuestas posibles:

  • master-slave / replication

  • peer-to-peer replication

  • Replication

Explicación

Pregunta 38 de 149

1

Peer-to-peer replication

Selecciona una o más de las siguientes respuestas posibles:

  • Colelctivelly, this helps to achieve increased sacalability and fault tolerance

  • However, this means that fault tolerance (with regards to write operations) is affected if the master-shard becomes non-operational or a network outage occurs

  • As there is no master involved, there is no single point of failure with regards to both read and write operations

Explicación

Pregunta 39 de 149

1

Consistency, availability and partition tolerance theorem states that a distributed system, particularly a database, running on a cluster can only provide two of the following

Selecciona una de las siguientes respuestas posibles:

  • CAP theorem

  • BASE

  • ACID

Explicación

Pregunta 40 de 149

1

CAP theorem

Selecciona una o más de las siguientes respuestas posibles:

  • Consistency

  • parttition tolerance

  • Availability

  • volumen

  • velocity

Explicación

Pregunta 41 de 149

1

A read from any node results in the same data across multiple nodes.

Selecciona una de las siguientes respuestas posibles:

  • consistency

  • availability

  • partition tolerance

Explicación

Pregunta 42 de 149

1

A read/write request will always be acknowledged in the form of a succes or a failure

Selecciona una de las siguientes respuestas posibles:

  • consistency

  • availability

  • partition tolerance

Explicación

Pregunta 43 de 149

1

The database system can tolerate communication outages wich split the cluster into multiple silos and can still service read/write requests.

Selecciona una de las siguientes respuestas posibles:

  • consistency

  • partition tolerance

  • availability

Explicación

Pregunta 44 de 149

1

CAP theorem

Selecciona una o más de las siguientes respuestas posibles:

  • If consistency (C) and availability (A) are required, available nodes need to communicate to ensure consistency (C). Therefore, partitio tolerance (P) is not possible.

  • If consistency (C) and partition tolerance (P) are required, nodes cannot remain available (A) as the nodes will become unavailable while achieving consistency (C) wich cannot happen while supporting partition tolerance (P)

  • If availability (A) and partition tolerance (P) are required then consistency (C) is not possible because of the data communication requirement between the nodes. So, the database can remain available (A) but with inconsistent results

Explicación

Pregunta 45 de 149

1

CAP Theorem

Selecciona una o más de las siguientes respuestas posibles:

  • In a distribute database, scalability and fault tolerance can be improved through additional nodes, although this challenges consistency (C), which can also cause availability (A) to suffer due to the latency caused by increased communication between nodes.

  • Distributed database systems cannot 100% partition tolerant (P). Although communication outages are rare and temporary, partition tolerance (P) must always be sopported, Thus, CAP is more about being either CP or AP

  • On the other hand, RDBMSs are CA as they are generally available (A) while being consistent (C) at the same time. RDBMSs generally run on a single node, thus partition tolerance (P) is not a large consideration

Explicación

Pregunta 46 de 149

1

This theorem is also know as Brewer´s theorem after the name of its proposer

Selecciona una de las siguientes respuestas posibles:

  • BASE

  • CAP

  • ACID

Explicación

Pregunta 47 de 149

1

ACID

Selecciona una o más de las siguientes respuestas posibles:

  • Availability

  • Atomicity

  • Consistency

  • Isolation

  • Durability

Explicación

Pregunta 48 de 149

1

All operations will always succed for fail completely. In other words there are no partial transactions

Selecciona una de las siguientes respuestas posibles:

  • Isolation

  • Atomicity

  • Consistency

  • Durability

Explicación

Pregunta 49 de 149

1

The database will only allow valid data and the database will always be in a consistent state after an operation. Any write followed by an immediate read is guaranteed to be consistent across multiple clients

Selecciona una de las siguientes respuestas posibles:

  • Atomicity

  • Consistency

  • Isolation

  • Durability

Explicación

Pregunta 50 de 149

1

The result of a transaction are not visible to other operations until it is complete

Selecciona una de las siguientes respuestas posibles:

  • Atomicity

  • Consistency

  • Isolation

  • Durability

Explicación

Pregunta 51 de 149

1

The results of an operation are permanent, that is once a transaction has been commited it cannot be rolled back. This is irrespective of any system failure

Selecciona una de las siguientes respuestas posibles:

  • Atomicity

  • Consistency

  • Isolation

  • Durability

Explicación

Pregunta 52 de 149

1

is a database desing principle based on the CAP theorem and followed by database systems that leverage distributed technology.

Selecciona una de las siguientes respuestas posibles:

  • BASE

  • ACID

  • CAP

Explicación

Pregunta 53 de 149

1

BASE

Selecciona una o más de las siguientes respuestas posibles:

  • Basically available

  • Consistency

  • Soft state

  • Eventual consistency

Explicación

Pregunta 54 de 149

1

database will always acknowledge the client´s request either in the form of requested data or a success/failure notification.

Selecciona una de las siguientes respuestas posibles:

  • Soft state

  • Basically available

  • Eventual consistency

Explicación

Pregunta 55 de 149

1

The database may not be in a consistent state when data is read, thus the results may change if the same data is requested again. This is because the data could be updated for consistency, even though no user has written to the data between the two reads. This property is closely related to eventual consistency

Selecciona una de las siguientes respuestas posibles:

  • Basically available

  • Soft state

  • Eventual consistency

Explicación

Pregunta 56 de 149

1

Reads by different clients, right after a write, may not return consistent reults. The database only attains consistency once the changes have been propagated to all nodes. So while the database is in the process of attainingthe state of eventual consistency, it will be in a soft state.

Selecciona una de las siguientes respuestas posibles:

  • basically avalaible

  • eventual consistency

  • Soft state

Explicación

Pregunta 57 de 149

1

__________ emphasizes availability over immediate consistency, in contrast to _________ wich ensures immediate consistency at the expense of availability due to record locking

Selecciona una de las siguientes respuestas posibles:

  • BASE, ACID

  • ACID, BASE

  • CAP, ACID

Explicación

Pregunta 58 de 149

1

Base: This soft approach towards consistency allows BASE compliant datatbases to serve multiple clients without any latency albeit serving inconsistent results

Selecciona una de las siguientes respuestas posibles:

  • Eventual consistency

  • soft state

  • basically available

Explicación

Pregunta 59 de 149

1

BASE However, BASE compliant databases are generally not used for transactional systems where lack of consistency can become a concern

Selecciona uno de los siguientes:

  • VERDADERO
  • FALSO

Explicación

Pregunta 60 de 149

1

Data storage device : Big Data consists of datasets that cannot be stored using traditional storage solutions, mainly because of the large volume of data they contain.

Selecciona uno de los siguientes:

  • VERDADERO
  • FALSO

Explicación

Pregunta 61 de 149

1

Big data storage device : Velocity is a factor that makes traditional database solutions inappropiate for Big Data storage, mainly because of their centralized desing which offers little or no scalability

Selecciona uno de los siguientes:

  • VERDADERO
  • FALSO

Explicación

Pregunta 62 de 149

1

Big Data storage device: The variety characteristic of Big Data datasets requires special attention as it is estimated that 80% of Big Data datasets are unstructured data, and traditional storage solutions do not support storing semi-structured and unstructured data in a scalable and efficient manner

Selecciona uno de los siguientes:

  • VERDADERO
  • FALSO

Explicación

Pregunta 63 de 149

1

The following storage device characteristics need to be considered:

Selecciona una o más de las siguientes respuestas posibles:

  • scalability

  • soft consistency

  • Redundancy & availability

  • Fast access

  • Long-term storage

  • Schema-less storage

  • Inexpensive Storage

Explicación

Pregunta 64 de 149

1

Scalability

Selecciona una o más de las siguientes respuestas posibles:

  • big data datasets come in huge volumes at a fast pace and are generally acquired from multiple sources. This results in large amounts of data within a short period of time

  • With the potential of findidng new insights about the way businesses work, enterprises are retaining more and more data , both generated inside the enterprise and obtained from outside sources, as part of their data acquisition activities.

  • Traditionally, historic data that is no longer required is generally archived in offline storage. However, this makes the historic data unavailable for instant analysis

Explicación

Pregunta 65 de 149

1

Scalability

Selecciona una o más de las siguientes respuestas posibles:

  • It may not be possible to re-acquire data due to expensive acquisition costs, for example when a dataset is purchased from a data provider.

  • Traditionally, historic data thata is no longer required is generally archived in offline storage. However, this makes the historic data unavailable for instant analysis

  • it may not be possible to re-generate data if the events that generated the data initially were one-off events, for example a smart meter reading for a certain point in time

  • In all cases, we need a scalable storage solutions with enough capacity for current and future data capture requirements

Explicación

Pregunta 66 de 149

1

scalability

Selecciona una o más de las siguientes respuestas posibles:

  • Apart from storing the raw data, additional storage is required to store the data created as a result of the data wrangling activity

  • A storage device needs to make efficient use of the underlying disk resources to minimize storage waste.

  • The data storage requirement, inthe case of joining multiple datasets, will increase in size due to the need to keeo both the original datasets and the joined dataset

  • Storage is further required in order to persist the analytic results from a data analysis activity

  • in order to address the increasing demands for data storage, a storage device can either scale up or scale out

Explicación

Pregunta 67 de 149

1

Scalability: scale up (vertical scaling)

Selecciona una o más de las siguientes respuestas posibles:

  • is a strategy for increasing resource capacity by replacing an existing low capacity device with a higher capacitty device.

  • does not cause disruption and system downtime

  • For example to double the capacity of a storage device, data from a 500 GB disk can be copied over to 1 TB disk.

  • will cause disruption and system downtime

Explicación

Pregunta 68 de 149

1

Scalability: Scale out (horizontal scaling)

Selecciona una o más de las siguientes respuestas posibles:

  • is a strategy for increasing resource capacity by replacing an existing low capacity device with a higher capacitty device.

  • is a strategy for increasing resource capacity by adding similar or higher capacity commodity devices alongside the existing device

  • For example, to double the storage capacity, a 500GB disk can be added alongside an existing 500GB disk

  • Does not cause disruption and system downtime

Explicación

Pregunta 69 de 149

1

Considering the differences between scalingup and scaling out, a storage device should be able to scale out as this is a lower cost strategy that enables increasing storage capacity without incurring system downtime and interruption.

Sharding can help create scalable storage as a shards can be stored on the nodes added via scaling out.

Selecciona uno de los siguientes:

  • VERDADERO
  • FALSO

Explicación

Pregunta 70 de 149

1

Redundancy & availability

Selecciona una o más de las siguientes respuestas posibles:

  • Big Data datasets, both in raw and processed form, are a business asset and require attention with regards to storage. Multiple business functions may need to glean value out of these datasets, cometimes simultaneously

  • Big data datasets come in huge volumes at a fast pace and are generally acquired from multiple sources. This results in large amounts of data within a short period of time.

  • This dependence on Big Data datasets across an enterprise requires a reliable storage device that is fault-tolerant and is highly available

  • A big data solution enviroment is generally composed of cluster that are built using commodity servers connected via a high bandwidth network. With more nodes and network connections, the chances of a node becoming unavailable increases either due to hardware breakdown such as disk failure, or due to network failure.

Explicación

Pregunta 71 de 149

1

Redundacy & availability

Selecciona una o más de las siguientes respuestas posibles:

  • As a result, storage redundancy is required to ensure auninterrupted acces to data in the event of a storage device failure, thereby providing high availability and fault tolerance

  • Sharding can help create scalable storage as shards can be stored on the nodes added via scaling out

  • to provide such redundancy, a storage device implements automatic sharding and replication with a configuration of either sharding and master-slave replication or sharding and peer-to-peer replication

Explicación

Pregunta 72 de 149

1

Big Data analytics generally involve both

Selecciona una o más de las siguientes respuestas posibles:

  • realtime processing

  • batch processing

  • in_memory

  • in disk

Explicación

Pregunta 73 de 149

1

_____________requires fast read/write capabilities that are usually implemented via in-memory storage solutions

Selecciona una de las siguientes respuestas posibles:

  • Realtime data analysis

  • batch processing

Explicación

Pregunta 74 de 149

1

Fast access

Selecciona una o más de las siguientes respuestas posibles:

  • Big Data analytics generally involve both realtime as well as batch processing

  • Realtime data analysis requires fast read/write capabilities that are usually implemented via in-memory storage solutions.

  • Batch processing requires stream-based data access with high throughput, implemented via traditional disk-based storage devices or newer solid state drives that provide better performance at the expense of higher cost

  • The basic tenet of Big Data is to extract value from large amounts of data

  • At the same time, data arriving at high velocity needs a storage device that supports fast writes with minimal overhead. for example, schema validation at read time instead of write time

Explicación

Pregunta 75 de 149

1

Long-term Storage

Selecciona una o más de las siguientes respuestas posibles:

  • The basic tenet of Big Data is to extract value from large amounts of data

  • Realtime data analysis requires fast read/write capabilities that are usually implemented via in-memory storage solutions

  • This requires enterprises to retain data for longer periods of time (to create larger datasets) so that future data analyses are more insightful due to having more historic data available

  • This characteristic warrants the need for a storage device with increased storage capacity that is reliable and can be brougth online without incurring too much time delay

Explicación

Pregunta 76 de 149

1

Long -term storage

Selecciona una o más de las siguientes respuestas posibles:

  • Traditionally, historic data that is no longer required is generally archived in offline storage. however, this makes the historic data unavailable for instant analysis.

  • big data analytics require historic data to be available online for discovering hidden patterns leading to valuable insights

  • data arriving at high velocity needs storage device that supports fast writes with minimal overhead, for example, schema validation at read time instead of write time

  • As a result, historic data is kept online by adding more storage capacity via scaling out

Explicación

Pregunta 77 de 149

1

schema-less storage

Selecciona una o más de las siguientes respuestas posibles:

  • As a result, historic data is kept online by adding more storage capacity via scaling out

  • big data datasets come in multiple formats with limited or no schema, such as semi-structured and unstructured data.

  • without any prior knowledge about the structure of data does not guarantee that the data would conform to the same structure in the future, as in the nature of unstructured data

  • This requires the storage device to support __________ data persistence along with the added ability to make schema changes on the fly without breaking existing applications or incurring downtime

Explicación

Pregunta 78 de 149

1

inexpensive storage

Selecciona una o más de las siguientes respuestas posibles:

  • With all the characteristics required to store voluminous data, provide replication, and support long-term storage, the cost of storage devices can become a concern.

  • without any prior knowledge about the structure of the data does not guarantee that the data would conform to the same structure in the future, as is the nature of unstructured data

  • A storage device needs to make efficient use of the underlying disk resources to minimize storage waste.

  • Use of proprietary storage devices generally requires replacement of existing nodes with more expensive ones in order to scale up, eventually hitting a limit

  • A storage device needs to make use of commodity hardaware that can scale out so that costs can be kept down as enterprises amass more and more data

Explicación

Pregunta 79 de 149

1

On-Disk storage device

Selecciona una o más de las siguientes respuestas posibles:

  • distributed file system

  • database

  • file system

  • NoSQL

Explicación

Pregunta 80 de 149

1

On-disk storage device: Distributed file system

Selecciona una o más de las siguientes respuestas posibles:

  • A storage device that is implemented with a __________ provides simple, fast access data storage that is capable of storing large datasets that are non-relational in nature, such as semi-structured and unstructured data

  • To handle large volumes of data arriving at a fast pace, relational databases generally need to scale.

  • Although based on straightforward file locking mechanisms for concurrency control, it provides fast read/write capability, which addresses the velocity characteristic of Big Data.

  • are good fit when data must be accessed in streaming mode with no random reads and writes.

Explicación

Pregunta 81 de 149

1

On-disk storage device: Distributed File System

Selecciona una o más de las siguientes respuestas posibles:

  • is not ideal for datasets comprising a large number of small files as this creates excessive disk-seek activity, slowing down the overall data access

  • Is also more overhead involved in processing multiple smaller files, as dedicated processes are generally spawned by the processing engine at runtime for processing each file before the results are synchronized from across the cluster

  • good for handling transactional workloads involving small amounts of data with random read/write

  • work best with fewer but larger files accessed in a sequential manner

  • Multiple smaller files are generally combined into a single file to enable optimum storage and processing

  • Do not provide any file searching capability out of box

Explicación

Pregunta 82 de 149

1

On-Disk storage device: Distributed file system

Selecciona una o más de las siguientes respuestas posibles:

  • Can be employed in clustered enviroment.

  • ____ storage device is suitable when large datasets of raw data to be stored or when archiving of datasets is required

  • provides an inexpensive storage option for storing large amounts of data over a long period of time that remains online

  • Good for handling transactional workloads involving small amounts of data with random read/write

  • More disk can simply be added without needing to offload the data to offline data storage, such as tape drives

  • like any file systems, are agnostic to the data being stored and therefore support schema-less data storage

  • provides out of box redundancy and high availability by copying data to multiple locations via replication

Explicación

Pregunta 83 de 149

1

On-dis storage device: Database

Selecciona una o más de las siguientes respuestas posibles:

  • Relational databases or relational database management systems (RDBMSs)

  • File systems

  • non-relational databases or Not only SQL (NoSQL)

  • NewSQL

Explicación

Pregunta 84 de 149

1

RDBMS relational database management systems

Selecciona una o más de las siguientes respuestas posibles:

  • Are good for handling transactional workloads involving small amounts of data with random read/write

  • Schema-less data model-Data can exist in its raw form

  • Are ACID, and in order to honor this compliance, they are generally restricted to a single node

  • The redundancy and fault tolerance provided by sharding and replication in a clustered enviroment are not inherently supported.

Explicación

Pregunta 85 de 149

1

RDBMS relational database management systems

Selecciona una o más de las siguientes respuestas posibles:

  • To handle large volumes of data arriving at a fast pace relational databases generally need to scale

  • Employ vertival scaling, not horizontal scaling, wich is a more costly and disruptive scaling strategy. This makes _______ less than ideal for long term storage of data

  • More disks can simply be added without needing to offload the data offline data storage, such as tape drives

  • Note that some relational databases, for example IBM DB2 pureScale, Sybase ASE Cluster Edition, Oracle Real Application Clusters (RAC) and Microsoft Parallel Data warehouse PDW, are capable of being run on clusters. however, these database clusters still use shared storage that can act as a single point of failure

Explicación

Pregunta 86 de 149

1

RDBMS

Selecciona una o más de las siguientes respuestas posibles:

  • Need to be manually sharded, mostly using application logic.

  • this means that the client (applicaction logic) needs to know which shard to query in order to get the required data

  • Reads across multiple nodes may not be consistent inmediately after a write. however, all nodes will eventually be in a consistent state

  • This further complicates the data processing when data from multiple shards is required

Explicación

Pregunta 87 de 149

1

RDBMS

Selecciona una o más de las siguientes respuestas posibles:

  • generally require data to adhere to a schema. As a result, storage of semi/unstructured data whose schema is not know or keeps changing is not supported

  • Refers to tecnologies used to develop net generation non_relational databases that are highlly scalable and fault-tolerant

  • The schema conformance is validated at the time of data insert/update,wich introduces overhead and leads to latency

  • A less ideal choice for storing high velocity data that needs a highly available database storage device with fast data write capability

  • Is not useful as a storage device in a Big data solution enviroment

Explicación

Pregunta 88 de 149

1

On-disk storage device: Database________________Refers to tecnologies used to develop net generation non_relational databases that are highlly scalable and fault-tolerant

Selecciona una de las siguientes respuestas posibles:

  • RDBMS

  • NoSQL

  • NewSQL

Explicación

Pregunta 89 de 149

1

On-disk storage device: NoSQL Characteristics

Selecciona una o más de las siguientes respuestas posibles:

  • Schema-less data model: Data can exist in its raw form

  • scale out rather than scale up: More nodes required rather than replacing the existing one with better higher performance node

  • Highly available: Built on cluster-based technologies providing fault tolerance out of box

  • Lower operational costs: Built on open source platforms with no licensing costs, and can be deployed on commodity hardware.

  • Eventual consistency; Reads across multiple nodes may not be consistent immediately after a write. However, all nodes will eventually be in a consistent state

  • BASE, not ACID: BASE compliance requires a database to maintain high availability in the event of network/node failure, while not requiring the database to be in a consistent state whenever an update occurs. The database can be in a soft/inconsistent state until it eventually attains consistency.

Explicación

Pregunta 90 de 149

1

On-Disk storage device: NoSQL Characteristics

Selecciona una o más de las siguientes respuestas posibles:

  • API driven data access - Data access is generally supported via API based queries, including RESTful APIs. Whereas some
    implementations may also provide SQL-like query capability

  • Auto Sharding and replication -To support horizontal scaling and provide high availability, a NoSQL storage device automatically employs sharding and replication techniques where the dataset is partitioned horizontally and then copied over to multiple nodes

  • Integrated caching - Removes the need for a third-party distributed caching layer, such as Memcached

  • Distributed query support - NoSQL storage devices maintain consistent query behavior across multiple shards

  • Polyglot persistence - The use of NoSQL device storage does not mandate retirin traditional RDBMSs. In fact, both can be used at the same time, thereby supporting polyglot persistence (an approach of persisting data using different types of storage technologies). this is good for developing systems requiring structured as well as semi/unstructured data.

  • Aggregate-focussed - Unlike relational databases that are most effective with fully normalized data, NoSQL storage devices store de-normalized data aggregated data (an entity containing merged, often nested, data for an object) Thereby eliminating the need for joins and mapping between application objects and the data stored in the database. Note that graph database storage devices are not aggregate-focused

Explicación

Pregunta 91 de 149

1

On Disk storage device: NoSQL Characteristics Volume:

Selecciona una o más de las siguientes respuestas posibles:

  • The storage requirement of ever increasing data volumes commands the use of databases that are highly scalable while keeping costs down for the business to remain competitive.

  • The fast influx of data requires databases with fast access data write capability

  • NoSQL storage devices fulfill this requirement by providing scaling out capability while using inexpensive commodity servers

  • Furthermore, there may not be licensing costs involved as NoSQL databases generally follow the Open Source development model

Explicación

Pregunta 92 de 149

1

On Disk storage device: NoSQL Characteristics Velocity:

Selecciona una o más de las siguientes respuestas posibles:

  • NoSQL storage devices fulfill this requirement by providing scaling out capability while using inexpensive commodity servers

  • The fast influx of data requires databases with fast access data write capability

  • NoSQL storage devices enable fast writes by using schema-on-read rather than schema-on-write principle

  • Being highly available, NoSQL storage devices can ensure that write latency does not occur because of node/network failure

Explicación

Pregunta 93 de 149

1

On- Disk Storage Device NoSQL Types Variety :

Selecciona una o más de las siguientes respuestas posibles:

  • A storage device needs to handle different types of data formats including documents, emails, images and videos and incomplete data

  • NoSQL storage devices enable fast writes by using schema-on-read rather than schema-on-write principle

  • NoSQL storage devices can store these different forms of semi-structured and unstructured data formats

  • At the same time, NoSQL storage devices are able to store schema-less data and incomplete data with the added ability of making schema changes as the data model of the datasets evolve. In other words, NoSQL databases support schema evolution

Explicación

Pregunta 94 de 149

1

On-Disk storage device: NoSQL types

Selecciona una o más de las siguientes respuestas posibles:

  • Key-value

  • RDBMS

  • column-family

  • Document

  • Graph

Explicación

Pregunta 95 de 149

1

NoSQL: Key-value

Selecciona una o más de las siguientes respuestas posibles:

  • Act like hash tables

  • Is a list of values each value is identified by a key

  • indexes that speed up searches are generally supported

  • The value is opaque to the database and is essentially stored as a BLOB

  • The value stored can be aggreted, ranging from sensor data to videos

  • Value look-up can only be performed via the keys as the database is oblivious to the details of the stored aggregate

Explicación

Pregunta 96 de 149

1

NoSQL: Key-value

Selecciona una o más de las siguientes respuestas posibles:

  • Partial updates are not possible. An update is either a delete or an insert operation

  • A select operation can retrieve a part of the aggregate value

  • _______ storage devices generally do not maintain any indexes, therefore writes are quite fast

  • ______ Based on a simple storage model, storage devices are highly scalable

  • The key is usually appended with the type of the value being saved for easy retrieval. for example 123_sensor1

Explicación

Pregunta 97 de 149

1

No-SQL: key-Value

Selecciona una o más de las siguientes respuestas posibles:

  • some implementations support compressing values for reducing the storage footprint. However, this introduces latency at read time, as the data needs to be decompressed firts

  • most key-values storage devices provide collections or bucket(like tables) into wich key-value pairs can be organized

  • can be encoded using either a text-based encoding scheme, such as XML, or JSON or BSON (binary)

Explicación

Pregunta 98 de 149

1

NOSQL: Key-Value

Selecciona una o más de las siguientes respuestas posibles:

  • Unstructure data storage is required

  • high performance read/writes are required

  • the value is fully identifiable via the key alone

  • searches need to be performed on different fields of the document

  • value is a standalone entity that is not dependent on other values

  • values generally have a comparatively simple structure or are binary

  • query patterns are simple. involving insert, select and delete operations only

  • stored values are manipulated at the application layer

Explicación

Pregunta 99 de 149

1

NoSQL:Key-value is inappropiate

Selecciona una o más de las siguientes respuestas posibles:

  • applications require searching or filtering data using attributes of the store value

  • the value is fully identifiable via the key alone

  • relationship exist between different key-value entries

  • a group of keys values need to be updated in a single transaction

  • multiple keys require manipulation in a single operation

  • schema consistency across different values is required

  • update to individual attributes of the value is required

  • examples: Riak,Redis and amazon Dynamo DB

Explicación

Pregunta 100 de 149

1

NoSQL: Document

Selecciona una o más de las siguientes respuestas posibles:

  • The table is a list of values where each value is identified by a key

  • Also store data as key-value pairs. However, unlike key-value storage devices, the stored value is a document that can have a complex nested structure, such as an invoice

  • Can be encoded using either a text-based encoding scheme, such as XML or JSON, or using a binary encoding scheme, such as BSON (binary JSON)

  • Like key-value storage devices, most document storage devices provide collections or buckets (like tables) into which key-value pairs can be organized

Explicación

Pregunta 101 de 149

1

NoSQL: Document differences compared with key-value

Selecciona una o más de las siguientes respuestas posibles:

  • document storage devices are value-aware

  • The stored value is self-describing: the schema can be inferred from the structured of the value

  • A select operation can reference a field inside the aggregate value

  • A select operation can retrieve a part of the aggregate value

  • partial update are supported; a subset of the aggregate can be updated

  • indexes that speed up searches are generally supported

Explicación

Pregunta 102 de 149

1

NoSQL: Document

Selecciona una o más de las siguientes respuestas posibles:

  • Each document canhave a different schema

  • The value is opaque to the database and is essentially stored as a BLOB

  • Is possible to store different types of documents or documents of the same type that have fewer or more fields with respect to each other

  • Additional fields can be added to a document after the initial insert, thereby manifesting flexible schema support

  • It should be noted that document storage devices are not limited to storing data that occurs in the form of actual documents, such as an XML file, but can also be used to store any aggregate that consists of a collection of fields having a flat or a nested schema

Explicación

Pregunta 103 de 149

1

NoSQL:Document is appropiate when:

Selecciona una o más de las siguientes respuestas posibles:

  • Storing semi-structured document-oriented data comprising flat or nested schema

  • schema evolution is a requirement as the structure of the document is either unknown or is likely to change

  • applications require a partial update of the aggregate stored as a document

  • searches need to be performed on different fields of document

  • unstructured data store is required

  • storing domain objects, such as customers, in object form

  • query patterns involve insert, select, update, and delete operations

Explicación

Pregunta 104 de 149

1

NoSQL:Document is inappropiate when

Selecciona una o más de las siguientes respuestas posibles:

  • Multiple documents need to be updated as part of a single transaction

  • storing domain objects, such as customers, in object form

  • Performing operations that need joins between multiple documents or storing data that is normalized

  • schema enforcement for achieving consistent query design is required as the document structure may change between successive query runs, wich will require retructuring the query

  • stored value is not self-describing

  • binary data needs to be stored

Explicación

Pregunta 105 de 149

1

NoSQL:Document examples

Selecciona una de las siguientes respuestas posibles:

  • Riak, Redis and Amazon Dynamo DB

  • MongoDB, CouchDB and terrastore

Explicación

Pregunta 106 de 149

1

NoSQL: Column-Family

Selecciona una o más de las siguientes respuestas posibles:

  • Column-family storage devices store data much like a traditional RDBMS but group related columns together in a row, resulting in column-families

  • The table is a list of values where each value is identified by a key

  • Each column can be a collection of related columns itself, referred to as a super-column

  • Each super-column can contain an arbitrary number of related columns that are generally retrieved or updates as a single unit

  • Each row consists of multiple column-families

  • Each row can have a different set of columns, thereby manifesting flexible schema support

  • Each row is identified by a row key

Explicación

Pregunta 107 de 149

1

NoSQL: Column-Family

Selecciona una o más de las siguientes respuestas posibles:

  • provide fast data acces with random read/write capability

  • Store different column in separate physical file, wich greatly helps in speeding up queries as only the required column-families are searched

  • Partial update are not possible. An update is either a delete or insert operation

  • provide support for selectively compressing

  • Leaving searchable ______ uncompressed can make queries faster because the target column does not need to be decompressed for lookup

  • Most implementations support data versioning after wich the configured columns are automatically removed

Explicación

Pregunta 108 de 149

1

NoSQL: Column-family is appropriate when

Selecciona una o más de las siguientes respuestas posibles:

  • Realtime random read/write capability is needed and data being stored has some defined structure

  • Data represents a tabular structure, each row consists of a large number of columns and nested groups of interrelated data exist

  • Support for schema evolution is required as column families can be added or removed without any system downtime

  • storing domain objects, such as customers, in object form

  • Certain fields are mostly accessed togheter, and searches need to be performed using field values

  • efficient use of storage is required when the data consists of sparsely populated rows (no column, no space)

  • query patterns involve insert, select, update, and delete operations

Explicación

Pregunta 109 de 149

1

NoSQL: Column-Family is inappropriate

Selecciona una o más de las siguientes respuestas posibles:

  • Multiple documents need to be updated as apart of a single transaction

  • relational data access is required, for example joins

  • ACID transactional support is required

  • binary data needs to be stored

  • SQL-compliant queries need to be ecxecuted

  • query patterns are likely to change frequently that initiate a corresponding restructuring of how column-families are arranged, for example, during proof of concept development

Explicación

Pregunta 110 de 149

1

Cassandra, HBase and Amazon SimpleDB

Selecciona una o más de las siguientes respuestas posibles:

  • MongoDB, CouchDB and terrastore

  • NoSQL:Column-Family

  • Riak, Redis and Amazon Dynamo DB

Explicación

Pregunta 111 de 149

1

NoSQL: Graph

Selecciona una o más de las siguientes respuestas posibles:

  • Persist inter-connected entities

  • The value is opaque to the database and is essentially stored as a BLOB

  • Unlike other NoSQL storage devices, where the emphasis is on the structure of the entities, ____________storage devices place emphasis on storing the linkages between entities

  • Entities are stored as nodes, also called vertices, while the linkages between entities are stored as edges. In RDBMS parlance, each node can be thought of a single row while the edge denotes a join

  • Nodes can have more than one type of link between them through multiple edges.

  • Each node can have attribute data as key-value pairs, such as a customer node with ID, name and age attributes

Explicación

Pregunta 112 de 149

1

NoSQL: Graph

Selecciona una o más de las siguientes respuestas posibles:

  • Each edge can have its own attribute data as key-value pairs, wich can be used to further filter query results

  • Multiple edges asre similar to defining interconnected nodes based on node attributes and/or edge attributes, commonly referred to as node traversal

  • Queries generally involve finding interconnected nodes based on node attributes and/or edge attributes, commonly referred to as node traversal

  • Can be a collection of related columns itself referred to as a super-column

  • Edges can be unidirectional or bidirectional, setting the node traversal direction

  • Generally, graph storage devices provide consistency via ACID compliance

Explicación

Pregunta 113 de 149

1

NoSQL:Graph

Selecciona una o más de las siguientes respuestas posibles:

  • The value stored can be any aggregate, ranging from sensor data to videos

  • The degree of usefulness of a graph storage device depends on the number and types of edges dened betwwen the nodes. The higher the number and more diverse the edges are the more diverse types of queries it can handle

  • As a result, it is important to comprehensively capture the types of relations tha exist betqeen the nodes

  • Generally allow adding new types of nodes without making changes to the database, it also enables defining additional links between nodes as new types of relationships or nodes apperar.

Explicación

Pregunta 114 de 149

1

NoSQL: Graph is appropriate when

Selecciona una o más de las siguientes respuestas posibles:

  • interconecccected entities need to be stored

  • querying entities based on the type of relationship with each other rather than the attributes of the entities

  • finding groups of interconnected entities

  • finding distances between entities in terms of the node traversal distance

  • mining data with a view toward finding patterns

  • storing semi-structured document-oriented data comprising flat or nested schema

Explicación

Pregunta 115 de 149

1

NoSQL: Graph is inappropriate when

Selecciona una o más de las siguientes respuestas posibles:

  • updates are required to a large number of node attributes or edge attributes, as this involves searching for nodes or edeges wich is a costly operation compared to performing node traversals

  • binary storage is required or else queries based on selection of node/edge attributes dominate node traversal queries

  • Entities have a large number of attributes or nested data- is the best store ligtweight entities in a graph storage dev ice while storing the rest of the attribute data in a separate non-graph NoSQL storage device

  • ACID transactional support is required

Explicación

Pregunta 116 de 149

1

NoSQL: Graph

Selecciona una de las siguientes respuestas posibles:

  • Neo4j, infinite Graph and orientDB

  • Riak, redis and Amazon Dynamo DB

  • include Cassandra, HBase and Amazon SimpleDB

Explicación

Pregunta 117 de 149

1

NewSQL

Selecciona una o más de las siguientes respuestas posibles:

  • NoSQL storage devices are highly scalable, available fault-tolerant, and very fast for read/writeoperations. However, they do not provide the same transaction and consistency support as exhibited by ACID compliant RDBMS

  • Following the BASE model, NoSQL storage devices provide eventual consistency rather than immediate, and could therefore be in an inconsistent stage while reaching the state of consistency. As a result, they cannot be used for implementimg large scale transactional systems

  • ____ storage devices combine the ACID properties of RDBMS with the scalability and fault tolerance offered by NoSQL storage devices

  • Each super-column can be a collection of related columns itself referred to as a super-column

  • ______ databases generally support SQL compliant syntax for data definition and data manipulation operations, and they often use a relational data model for data storage

  • _____________ can be used for developing OLTP systems with veri high volumes of transactions, for example a bank. They can also be used for realtime analytics, for example operational analytics, as some implementations are memory based

  • As compared to a NoSQL storage device, _________ storage device provides an easier transition from a traditional RDBMS to a highle scalable database due to its support for SQL

Explicación

Pregunta 118 de 149

1

NewSQL examples

Selecciona una de las siguientes respuestas posibles:

  • Neo4J, infinite graph and orientDB

  • Riak, Redis and Amazon Dynamo DB

  • VoltDb, FoundationDB, NuoDB and innoDB

Explicación

Pregunta 119 de 149

1

Distributed/Parallel Data Processing

Selecciona una o más de las siguientes respuestas posibles:

  • big Data datasets are generally stored utilizing distributed technologies (distributed file system or NoSQL database)

  • The very nature of a distributed storage device requires a processing engine that can process data without needing to transfer the data from storage to a computing resource as with distributed data processing

  • Schemas may change over time in a effort to accomodate changing bussines requirements, or simply because of an application software upgrade

  • In support of maximizing the value characteristics of Big Data, it is imperative to employ a processing model based on the divide-and-conquer principle, as with parallel data processing

Explicación

Pregunta 120 de 149

1

Schema-less Data Processing

Selecciona una o más de las siguientes respuestas posibles:

  • often, it may be feasible to process data offline in batches, such as with overnight report generation

  • big Data datasets come in multiple formats (variety characteristic) and may not conform to any schema, especially unstructured data

  • ____________may change over time in an effort to accomodate changing business requirements, or simply because of an application software upgrade.

  • The lack of adherence to any particular data model requires flexible processing of Big Data datasets so that they can be processed in raw form without the need to be stored in a particular data model

Explicación

Pregunta 121 de 149

1

Multi-workload support

Selecciona una o más de las siguientes respuestas posibles:

  • Big data datasets may arrive thick (volume characteristic) and/or fast (velocity characteristic)

  • Often, it may feasible to process data offline in batches, such as with overnight report generation

  • In other cases, the results may be required in in realtime, such as with GPS signal processing workloads (transactional and batch)

  • In order to achieve maximum value from Big Data datasets the underlying processing platform may need to support both transactional and batch workloads

  • A single processing engine may fulfill this requirement, or multiple processing engines may need to be used

  • Similarly, more data sources with unknown schemas may need to be added in the future

Explicación

Pregunta 122 de 149

1

multi-workload support

Selecciona una o más de las siguientes respuestas posibles:

  • Although having the ability to process data both in realtime (transactional) and batch mode is ideal for extracting maximum value out of Big Data datasets, support for both modes may not be required or may not be feasible

  • the processing of Big data datasets requires a highly scalable processing engine that can be linearly scaled

  • Generally, assembling a batch processing Big Data solution enviroment is simpler and cheaper when compared to a realtime processing Big Data solution

  • Consequently, adding support for multi-workload processing should be business driven involving careful cost-benefit analysis

Explicación

Pregunta 123 de 149

1

Linear Scalability

Selecciona una o más de las siguientes respuestas posibles:

  • Owing to the volume and velocity characteristics of Big Data, the processin demand can grow quite sharply with increasing volumes of data arriving at a fast pace

  • supporting a distributed processing enviroment with parallel processing capabilities requires a processing engine that can provide a steady throughput as the data volumes grow

  • The processing of Big Data datasets requires a highly scalable processing engine that can be linearly scaled

  • Big Data datasets may arrive thick (volume characteristics) and/or fast (velocity characteristics)

  • In the context of processing, linear scalability means that one receives a proportional increase in performance with the addition of more processing nodes

  • Realtime business intelligence and analytics can leverage such a linearly scalable processing enviroment to deliver quicker responses involving complex operations on an entire dataset

  • _____________ is generally achieved by employing a scaling out strategy as it provides a simple, non-disruptive, and cost effective method for increasing processing capacity

Explicación

Pregunta 124 de 149

1

Redundancy & fault tolerance

Selecciona una o más de las siguientes respuestas posibles:

  • A highly distributed data processing enviroment with parallel data processing capabilities generally involves complicated architecture

  • With a horizontally scaled processing architecture, wich by desing involves a large number of nodes and networking components, the chances of partial system failure increase

  • Similarly, more data sources with unknown schemas may need to be added in the future

  • As a system failure in the middle of a long running distributed task would be detrimental to achieving the analytic goals, aprocessing engine needs to provide fault tolerance so that a partial failure does not render the entire system unavailable and data processing does nort need to started from scratch

  • Generally, fault tolerance is provided through redundancy

  • With redundant processing resources, the system can still be available in the event of a partial system failure

  • Horizontal scaling lends itself quite useful in this case as redundancy can generally be increased by simply adding more processing resources

Explicación

Pregunta 125 de 149

1

Low Cost

Selecciona una o más de las siguientes respuestas posibles:

  • At the start of a Big Data initiative, the cost for a highly scalable distributed data processing enviroment involving a few processing nodes and networking equipment may not be that high.

  • Over time, as the data volume grows and the types and frequency of analytics being run increase, the requirement for an increased number of processing resources can translate into soaring IT costs involving both software and hardware

  • With redundant processing resources, the system can still be available in the event of partial system failure.

  • This could prove counterproductive to the very reason the Big Data initiative was undertaken (often to help the business deliver increased value, drive down costs, find new sources of reveneu, or establish new service offerings)

  • Use a open source software deployed over commodity hardware helps to keep the costs down

  • Another aspects of keeping costs down is the ability of the processing engine to take advantage of cloud

  • The on-demand and elastic nature of the cloud helps avoid any up-front capital investment, coupled with faster setup of the data processing solution enviroment

Explicación

Pregunta 126 de 149

1

Big data processing

Selecciona una o más de las siguientes respuestas posibles:

  • ____________ large amounts of data is not a new phenomenon, and different large scale data processing architectures exist

  • ______________________ requires a distributed enviroment that is capable of processing data in parallel a characteristic supported by the cluster architecture

  • are highly scalable, supportin g horizontal sclaing with linear performance gains

Explicación

Pregunta 127 de 149

1

Big Data Processing: Cluster

Selecciona una o más de las siguientes respuestas posibles:

  • is a group of nodes connected together via a network to process tasks in parallel

  • __________ is a centrally managed network of nodes (computers) where each node is responsible for a sub-task of a large problem

  • enable distributed data processing

  • _____comprises low-cost commodity nodes that collectively provide increased processing capacity with inherent redundancy and fault tolerance, as it consists of physically separate nodes

  • The majority of Big Data processing occurs

  • are higly scalable, supporting horizontal scaling with linear performance gains

  • Provide an ideal deployment enviroment for a processing engine as large datasets can be divided into smaller datasets and then processed in parallel in a distributed manner

  • Can be utilized both by a realtime processing engine and a batch proceessing engine, such as Spark and MapReduce respectively

Explicación

Pregunta 128 de 149

1

Big Data Processing: Batch Mode

Selecciona una o más de las siguientes respuestas posibles:

  • data is processed offline in batches where the response time could vary from minutes to hours

  • Data is first persisted to the disk and only then processed

  • Strategic BI predictive/prescriptive analytics and ETL operations are generally _____________________

  • data is processed in-memory as it is captured before being persisted to the disk

  • ______________involves processing a range of large datasets, either on their own or joined together, essentially addressing the volume and variety characteristics of big Data datasets

  • the majority of big data processing occurs in ______________________________

  • ______________ is relatively simple, easy to set up, and low cost in comparison to realtime mode

Explicación

Pregunta 129 de 149

1

Big Data Processing: realtime mode

Selecciona una o más de las siguientes respuestas posibles:

  • _____________________ data is processed in -memory as it is captured before being persisted to the disk

  • Response time generally ranges from a few seconds to under a minute

  • strategic BI, predeictive/prescriptive analytics and ETL operations are generally batch-oriented

  • ______________ adresses the velocity characteristic of Big Data dataset

  • _____________________ is alsocalled event or stream processing as the data either arrives continuously (stream) or at intervals (event)

  • The individual event/stream datum is generally small in size, but its continuous nature results in very large datasets

  • Another related term, interactive mode, falls within the category of realtime, interactive mode generally

  • Operational BI/analytics are generally

Explicación

Pregunta 130 de 149

1

Map reduce processing engine

Selecciona una o más de las siguientes respuestas posibles:

  • is a widely used implementation of the batch processing engine mechanism

  • It is a highly scalable and reliable processing engine based on the principle of divide-and-conquer

  • It provides built-in fault tolerance and redundancy

  • it divides a bigger problem into a set of smaller problems that are easier and quicker to solve

  • it has its roots both in distributed computing as well as in parallel computing

  • Operational BI/analytics are generally conducted in realtime mode

  • _____________ is a batch -oriented processing engine used to process large datasets using parallel processing deployed over clusters of commodity hardware

Explicación

Pregunta 131 de 149

1

Map Reduce

Selecciona una o más de las siguientes respuestas posibles:

  • with redundant processing resources, the system can still be available in the event of a partial system failure

  • _____________ does not require the input data to conform to any particular data model. Therefore, it can be used to process schema-less datasets

  • A dataset is bloken down into multipl smaller parts and operations are performec on each part independently and in parallel

  • The results from all operations arre then summarized to arrive at the answer

  • ____________ processing engine generally supports batch workloads only

Explicación

Pregunta 132 de 149

1

MapReduce

Selecciona una o más de las siguientes respuestas posibles:

  • _______, the data processing algorithm is instead moved to the nodes that store the data

  • The data processing algorithm executes in parallel on these nodes, thereby eliminating the need to first move the data

  • This not only saves network bandwidth but also results in a large reduction in processing time for large datasets, as processing smaller chunks of data in parallel is much faster

  • Reads across multiple nodes may not be consistent immediately after a write. However, all nodes will eventually be in a consistent state

Explicación

Pregunta 133 de 149

1

Map Reduce concepts

Selecciona una o más de las siguientes respuestas posibles:

  • low cost

  • map : map task

  • combine (optional) : map task

  • partition : map task

  • shuffle & sort : Reduce task

  • reduce : Reduce task

  • linear scalability

Explicación

Pregunta 134 de 149

1

MapReduce: Map

Selecciona una o más de las siguientes respuestas posibles:

  • The firts stage of map reduce, during wich the dataset file is divided into multiple smaller splts

  • Each split is parsed into its constituent records as a key-value pair

  • The processing of Big Data datasets requires a highly scalable processing engine that can be linearly scaled

  • The key is usuallu the ordinal position of the record and the value is the actual record. For exampl (234, sky is blue)

  • The parsed key-value pairs for each split are then sent to a map function (mapper), with one mapper function per split. The map function executes user-defined logic

Explicación

Pregunta 135 de 149

1

MapReduce: Map

Selecciona una o más de las siguientes respuestas posibles:

  • Each split generally contains multiple key-value pairs and the mapper is run once for each key-value pair in the split

  • The mapper processes each key-value pairs as per the user-defined logic and further generates as a key-value pair as its output

  • The output key can either be the same as theinput key or a substring value from the input value, or another serializable user-defined object

  • similarly, the output value can either be the same as the input value or a substring value from the input value, or another serializable user-defined object

  • generally, the output of the map functions is handled directly by the reduce function.However, map task and reduce tasks are mostlyrun over different nodes

Explicación

Pregunta 136 de 149

1

MapReduce: Combine

Selecciona una o más de las siguientes respuestas posibles:

  • Generally, the output of map function is handled directly by the reduce function. However, map tasks ans reduce tasks are mostly run over different nodes

  • Requires moving data between mappers and reducers that can consume a lot of valuable bandwidth, and directly contributes to processing latency

  • With larger datasets, the time token to move the data between map and reduce stages can exceed the actual processing undertaken by the map and reduce tasks

  • For this reason the mapReduce engine provides an optional______________________ function that sumarizes a mapper´s output before it gets processed by the reducer

  • The first stage of MapReduce is know as _________, during which the dataset file is divided multiple smaller splits

Explicación

Pregunta 137 de 149

1

MapReduce: Combine

Selecciona una o más de las siguientes respuestas posibles:

  • processes each key-value pair as per the user-defined logic and further generates a key-value pair as its output

  • A _________________ is essentially a reducer function that groups a mapper's output locally, on the same node as the mapper

  • A reducer function can be used as combiner function or a custom user-defined function can be used

  • The mapReduce engine combines all values for a given key from the mapper output, creating multiple key-value pairs as input to the combiner where the key is not repeated and the value exists as a list of all corresponding values for that key

  • The ___________ stage is only an optimization stage, therefore it may not even be called by the mapReduce engine.

Explicación

Pregunta 138 de 149

1

MapReduce: Partition

Selecciona una o más de las siguientes respuestas posibles:

  • During this stage, if more than one reducer is involved, a partitioner divides the output from the mapper or combiner, if specified and called by the MapReducen engine, into partitions between reducer instances

  • The number of partitions equals the number of reduces

  • Is only an optimization stage, therefore it may not even be called by the mapReduce engine

  • Although each partition contains multiple key-value pairs all records for a particular key are within the same partition

  • The MapReduce engine guarantees a random and fair distribution between reducers while making sure that all of the same keys across multiple mappers end up with the same reducer instance

  • depending on the nature of the job, certain reducers can sometimes receive a large number of key-value pairs compared to others. As a result of this uneven workload, some reducers will finish earlier than others

  • Overall, this is less efficient and leads to longer job execution times than if the work was evenly split across reducers

  • this can be rectified by customizing the partitioning logic in order to guarantee a fair distribution of key-value pairs

  • This is the last stage of the map task

Explicación

Pregunta 139 de 149

1

MapReduce: Shuffle & sort

Selecciona una o más de las siguientes respuestas posibles:

  • During the first stage of the reduce task, output from all partitioner is copied across the network to the nodes running the reduce task. This is known as _____________

  • Is the final stage of the reduce task

  • The list based key_value output from each partitioner can contains the same key multiple times

  • Next, The MapReducer engine automatically groups and sorts the key-value pairs according to the keys so that the output contains a sorted list of all input keys (and their values) with the same keys appearing together

  • To view the full output from the mapReduce job, all the file parts must be combined

  • The way keys are grouped and sorted can be customized

  • The MapReduce engine then merges each group of keys together before the shuffle and sort output is processed by the reducer function

  • This mere creates a single key-value pair per group, where key is the group key and the value is the list of all group values

Explicación

Pregunta 140 de 149

1

MapReduce: Reduce

Selecciona una o más de las siguientes respuestas posibles:

  • The way kwys are grouped and sorted can be customized

  • Is the final stage of the reduce task

  • Depending on the user-defined logic specified in the reduce function (reducer), the reducer will either further summarize its input or will emit the output without making any changes

  • Thus, for each key-value pair that a reducer receives, the list of values stored in the value part of the pair is processed and another key-value pair is written out

  • The output key can either be the same as the input key or a substring value from the input value, or another serializable user-defined object

Explicación

Pregunta 141 de 149

1

MapReduce:Reduce

Selecciona una o más de las siguientes respuestas posibles:

  • The output value can either be the same as the input value or a substring value from the input from the input value or another serializable user-defined object

  • __________________Just like the mapper, for the input key-value pair, a reducer may not produce any output key-value pair (filtering) or can generate multiple key-value pairs (demultiplexing)

  • Is essentially a reducer function that groups a mappers output locally, on the same node as the mapper

  • The output of the reducer, that is tyhe key-value pairs, is then written out as a separate file one file per reducer

  • To view the full output from the MapReduce job, all the file parts must be combined

  • the number of reducers can be customized

Explicación

Pregunta 142 de 149

1

Mapreduce: Reduce It is also possible to have a MapReduce job without a reducer, for example when performing filtering

Selecciona uno de los siguientes:

  • VERDADERO
  • FALSO

Explicación

Pregunta 143 de 149

1

Somer NoSQL storage devices provide MapReduce support for batch processing out of box

Selecciona uno de los siguientes:

  • VERDADERO
  • FALSO

Explicación

Pregunta 144 de 149

1

Making use of clustered deploymenr of NoSQL storage devices, the MapReduce processing engine distributes the query processing across multiple nodes.

This is generally achieved via the provision of map and reduce constructs that the user then needs to implement via respective API of the NoSQL storage devices

Selecciona uno de los siguientes:

  • VERDADERO
  • FALSO

Explicación

Pregunta 145 de 149

1

MapReduce works on the principle of divide-and-conquer. However, it is important to understand the semantics of this principle in the context of MapReduce. The divide-and-conquer principle can generally be achieved using one of the following approaches: Task parallelism and Data Parallelism

Selecciona uno de los siguientes:

  • VERDADERO
  • FALSO

Explicación

Pregunta 146 de 149

1

(MapReduce) task parallelism

Selecciona una o más de las siguientes respuestas posibles:

  • Different sub-datasets are spread across multiple nodes and are processed using the same algorthm

  • Refers to the parallelization of data processing by dividing a task into sub-tasks and running each sub-task on a separate processor, generally on a separate node in a cluster

  • Each sub-task generally executes a different algorithm, with its own copy of the same data or different data as its input, in parallel

  • Generally the output from multiple sub-tasks is joined together to obtain the final set of results

Explicación

Pregunta 147 de 149

1

(MapReduce) Data Parallelim

Selecciona una o más de las siguientes respuestas posibles:

  • Refers to the parallelization of data processing by dividing a dataset into multiple sub-datasets and processing each sub-dataset in parallel

  • Each sub task generally executes a different algorithm wich its own copy of the same data or different data as its inputs, in parallel

  • Different sub-datasets are spread across multiple nodes and are pocessed using the same algorithm

  • Generally de output from each ptrocessed sub-dataset is joined together to obtain the final set of results

Explicación

Pregunta 148 de 149

1

MapReduce Algorithm Design

Selecciona una o más de las siguientes respuestas posibles:

  • Within Big Data enviroments, the same task generally needs to be performed repeatdly on a data unit, such as a record, where the complete dataset is distributed across multiple locations due to its large size

  • MapReduce adresses this requirement by employing the data parallelism approach, where the data is divided into splits

  • Different sub-datasets are spread across multiple nodes and are processed using the same algorithm

  • Each split is then processed by its own instance of the map function, wich contains the same processing logic as the other map functions

  • The majority of traditional algorithmic development follows a sequential approach where operations on data are performed ona after the other in such a way that subsequent operations is dependent on its preceding operation
    Here, operations are divided among the map and reduce functions

  • Map an Reduce task are independent, and in turn, run isolated from each other

  • Each instance of a map or reduce function runs independently of other instances

Explicación

Pregunta 149 de 149

1

MapReduce Algorithm Design

Selecciona una o más de las siguientes respuestas posibles:

  • The logic within the reduce function is dependent on the output of the map function, in particular, which keys were emitted from the map function as the reduce function receives a unique key with a consolidated list of all of its values

  • Relatively simplistic algorithmic logic, such that he required result can be obtained bya pplying the same logic to different portions of a dataset in parallel, and then aggregating the results in some manner

  • requires a highky scalable processing engine that can be linearly scaled

  • Availability of the dataset in a distributed manner partitioned across a cluster so that multiple map functions can process diferent subset a datasets in parallel

  • understanding of the data structure within the dataset so that a meaningful data unit (a single record) can be chosen

  • Dividing algorithmic logic into map and reduce functions so that the logic in the map function is not dependent on the complete dataset, as only data within a single split is available

  • Emitting the correct key from the map function along with all the required data as value because the reduce function´s logic can only process those values that were emitted as part of the key-value pairs from the map function

  • Emiting the correct key from the reduce function along with the required data as value because the output from each reduce function becomes the final output of the mapReduce algorithm

Explicación