Modulo 7: Fundamental Big Data engineering

Question

Data engineering

Answer 1

is the field of developing, testing, deploying and maintaining data processing solutions via collecting, parsing, transforming, joining processing and managing data

Answer 2

make data avalaible for data scientists to develop models and data products

Answer 3

two main activities that comprise data _____________include storage and processing of data

Answer 4

is tasked with making data amenable to various types of data analyses, including model development (data mining and other business-process specific algorithms) and reporting)

Answer 5

two main activities that comprise ____________ include storage and processing of data, wich is typically structured.

Answer 6

within the realm of__________________ involves developing highly distributed, scalable, fault-tolerant data processing solutions to process large amounts of data in order to garner insights

Answer 7

comprises data processing in support of the Big Data analysis lifecycle.

Answer 8

make data available for data scientists to develop models and data products

Answer 9

They are required to have knowledge of various data storage and processing technology alternatives for acquiring, storing and processing tecnology alternatives for acquiring, storing and processing data that is often semi-structured and unstructured in nature

Answer 10

processing large amounts of structured, unstructured, and semi-structured data arriving at a fast pace, including extraction of relevant data from semi-structured and unstructured datasets

Answer 11

Internet-scale datasets and associated batch an realtime data processing

Answer 12

collection and aggregation of data from multiple sources with disparate schemas or without any schema

Answer 13

Internet-scale datasets and associated batch an realtime data processing

Answer 14

collection and aggregation of data from multiple sources with disparate schemas or without any schema

Answer 15

processing large amounts of structured, unstructured, and semi-structured data arriving at a fast pace, including extraction of relevant data from semi-structured and unstructured datasets

Answer 16

processing large amounts of structured, unstructured, and semi-structured data arriving at a fast pace, including extraction of relevant data from semi-structured and unstructured datasets

Answer 17

Internet-scale datasets and associated batch an realtime data processing

Answer 18

collection and aggregation of data from multiple sources with disparate schemas or without any schema

Answer 19

comprises data processing in support of the big Data analysis lifecycle

Answer 20

importing/exporting large amounts of data from to traditional storage technologies, including OLTP (CRM, ERP, SCM systems) and OLAP systems (data warehouse)

Answer 21

field of developing, testing deploying and maintaining data processing solutions via collecting, parsing transforming, joining, processing and managing data

Answer 22

validating and cleansing data in realtime or batch mode and creating efficient data models

Answer 23

stablishing an optimal data storage and processing enviroment based on the type of data and its processing requirements

Answer 24

developing efficient data processing algorithms that run over cluster of computers

Answer 25

developing Big Data pipelines and Big Data applications that may include meaningful data visualizations

Answer 26

storage devices nd

Answer 27

analytics engine

Answer 28

processing engine

Answer 29

resource manager

Answer 30

disk_based

Answer 31

memory_based

Answer 32

disk_based

Answer 33

batch processing

Answer 34

memory_based

Answer 35

batch processing mode

Answer 36

realtime processing

Answer 37

disk_based

Answer 38

memory based

Answer 39

When data gets processed as a result of an ETL activity or uotput in generated as a result of an analytical operation

Answer 40

When datasets are acquired or when data gets generated inside the enterprise boundary

Answer 41

In realtime processing mode, data is first processed in memory and then stored to the disk

Answer 42

WHen data is manipulated for making it amenable to data analysis

Answer 43

CA, theorem

Answer 44

replication, partition

Answer 45

ACID, unique

Answer 46

Sharding, shard

Answer 47

BASE, atom

Answer 48

master-slave

Answer 49

is stored on a separate node and each node is responsible for only the data stored on it

Answer 50

all_______ collectively represent the complete database

Answer 51

shares the same schema

Answer 52

stores muktiplecopies of a database

Answer 53

____________ distributes a processing load across multiple nodes to achieve horizontal scalability

Answer 54

___________ may or not be transparent to the client

Answer 55

supports horizontal scaling as a method for increasing resource capacity. This is accomplished by adding similar or higher capacity resources alongside the existing resources

Answer 56

Once saved, the data is replicated over to multiple slave nodes

Answer 57

SInce each node is responsible for a part of the whole dataset, read/write times are greatly improved

Answer 58

replication

Answer 59

partitioning

Answer 60

replication

Answer 61

replicaction, replicas

Answer 62

sharding, shard

Answer 63

master, slave

Answer 64

master-slave

Answer 65

peer-to-peer

Answer 66

peer-to-peer, peer

Answer 67

master-slave, master

Answer 68

sharding, shard

Answer 69

sharding, shard

Answer 70

peer-to-peer, peer

Answer 71

master-slave, slave

Answer 72

Replication

Answer 73

peer-to-peer

Answer 74

master-slave

Answer 75

ideal for read intensive loads rather than write intensive loads, as growing read demands can be fulfilled simply via horizontal scaling wich adds additional slave nodes

Answer 76

Writes are consistent as all writes are coordinated by the master node. This means that write performance suffers as the amount of writes increases

Answer 77

Since each node is responsible for a part of a whole dataset, read/write times are greatly improved

Answer 78

in case the master node fails, reads are still possible via any of the slave nodes

Answer 79

a slave node can be configures as d backup node for the master node

Answer 80

In case of master node failure, writes are not supported until a master node is reestablished. It is either resurrected from a backup of the master node, o a new master node is chosen from slave nodes.

Answer 81

For example, queries requiring data from multiple shards will impose performance penalties. Data locality, or keeping commonly accesed data collocated on a sngle shard, helps to counter such performance issues

Answer 82

Read inconsistency can be an issue it a slave node is read prior to the update being propagated over to it from the master node

Answer 83

To ensure read consistency, a voting system can be implemented where a read is declared consistent if the majority of the slave nodes contain the same version of the record. Implementation of such a voting system required reliable and fast communication mechanism between the slave nodes

Answer 84

Replication: peer-to-peer

Answer 85

Replication: master-slave

Answer 86

read inconsistency can be an issue if a principal node is read prior the update being propagated over to it

Answer 87

Each write is copied to all peers

Answer 88

prone to write inconsistencies that occur as a result of a simultaneous update of the same data across multiple ___

Answer 89

This can be addessed by implementing pessimistic or optimistic concurrency.

Answer 90

is a proactive approach that uses locking to ensure that only one update succeds at a time. however, this affects availability as a the database remains unavailable until all locks are released

Answer 91

is a reactive approach that does not use locking. Instead, it allows the inconsistency to happen first and restores consistency after the fact

Answer 92

Reads can be inconsistent between the time period when some of the ____ have been update while the others are being updated. However, reads eventually become consistent when the updates have been copied over to all ____

Answer 93

Ideal for read intensive loads rather than write intensive loads, as growing read demands can be fufilled simply via horizontal scaling wich adds additional ____ nodes

Answer 94

To ensure read consistency, a voting system can be implemented where a read is declared consistent if the majority of the _ contain the same version of the record

Answer 95

Implementation of such a voting system requires a reliable and fast communication mechanism between the ____

Answer 96

Master--slave

Answer 97

Master-slave Replication

Answer 98

Peer-to-peer

Answer 99

Peer-to-peer Replication

Answer 100

Althoug more than one master is possible, a single slave-shard can only be managed by a single master-shard

Answer 101

Replicas of shard are kept on multiple slave nodes to provide scalability and fault tolerance for read operations

Answer 102

Write consistency is mainteined by the master-shard

Answer 103

however, this means that fault tolerance Ç(with regards to write operations) is affected if the master-shard becomes non-operational or a network outage occurs

Answer 104

Each shar gets replicated to multiple peers, and each peer is only responsible for subset of data rather than the complete dataset

Answer 105

master-slave / replication

Answer 106

peer-to-peer replication

Answer 107

Replication

Answer 108

Colelctivelly, this helps to achieve increased sacalability and fault tolerance

Answer 109

However, this means that fault tolerance (with regards to write operations) is affected if the master-shard becomes non-operational or a network outage occurs

Answer 110

As there is no master involved, there is no single point of failure with regards to both read and write operations

Answer 111

CAP theorem

Answer 112

Consistency

Answer 113

parttition tolerance

Answer 114

Availability

Answer 115

consistency

Answer 116

availability

Answer 117

partition tolerance

Answer 118

consistency

Answer 119

availability

Answer 120

partition tolerance

Answer 121

consistency

Answer 122

partition tolerance

Answer 123

availability

Answer 124

If consistency (C) and availability (A) are required, available nodes need to communicate to ensure consistency (C). Therefore, partitio tolerance (P) is not possible.

Answer 125

If consistency (C) and partition tolerance (P) are required, nodes cannot remain available (A) as the nodes will become unavailable while achieving consistency (C) wich cannot happen while supporting partition tolerance (P)

Answer 126

If availability (A) and partition tolerance (P) are required then consistency (C) is not possible because of the data communication requirement between the nodes. So, the database can remain available (A) but with inconsistent results

Answer 127

In a distribute database, scalability and fault tolerance can be improved through additional nodes, although this challenges consistency (C), which can also cause availability (A) to suffer due to the latency caused by increased communication between nodes.

Answer 128

Distributed database systems cannot 100% partition tolerant (P). Although communication outages are rare and temporary, partition tolerance (P) must always be sopported, Thus, CAP is more about being either CP or AP

Answer 129

On the other hand, RDBMSs are CA as they are generally available (A) while being consistent (C) at the same time. RDBMSs generally run on a single node, thus partition tolerance (P) is not a large consideration

Answer 130

Availability

Answer 131

Consistency

Answer 132

Durability

Answer 133

Consistency

Answer 134

Durability

Answer 135

Consistency

Answer 136

Durability

Answer 137

Consistency

Answer 138

Durability

Answer 139

Consistency

Answer 140

Durability

Answer 141

Basically available

Answer 142

Consistency

Answer 143

Soft state

Answer 144

Eventual consistency

Answer 145

Soft state

Answer 146

Basically available

Answer 147

Eventual consistency

Answer 148

Basically available

Answer 149

Soft state

Answer 150

Eventual consistency

Answer 151

basically avalaible

Answer 152

eventual consistency

Answer 153

Soft state

Answer 154

BASE, ACID

Answer 155

ACID, BASE

Answer 156

Eventual consistency

Answer 157

soft state

Answer 158

basically available

Answer 159

scalability

Answer 160

soft consistency

Answer 161

Redundancy & availability

Answer 162

Fast access

Answer 163

Long-term storage

Answer 164

Schema-less storage

Answer 165

Inexpensive Storage

Answer 166

big data datasets come in huge volumes at a fast pace and are generally acquired from multiple sources. This results in large amounts of data within a short period of time

Answer 167

With the potential of findidng new insights about the way businesses work, enterprises are retaining more and more data , both generated inside the enterprise and obtained from outside sources, as part of their data acquisition activities.

Answer 168

Traditionally, historic data that is no longer required is generally archived in offline storage. However, this makes the historic data unavailable for instant analysis

Answer 169

It may not be possible to re-acquire data due to expensive acquisition costs, for example when a dataset is purchased from a data provider.

Answer 170

Traditionally, historic data thata is no longer required is generally archived in offline storage. However, this makes the historic data unavailable for instant analysis

Answer 171

it may not be possible to re-generate data if the events that generated the data initially were one-off events, for example a smart meter reading for a certain point in time

Answer 172

In all cases, we need a scalable storage solutions with enough capacity for current and future data capture requirements

Answer 173

Apart from storing the raw data, additional storage is required to store the data created as a result of the data wrangling activity

Answer 174

A storage device needs to make efficient use of the underlying disk resources to minimize storage waste.

Answer 175

The data storage requirement, inthe case of joining multiple datasets, will increase in size due to the need to keeo both the original datasets and the joined dataset

Answer 176

Storage is further required in order to persist the analytic results from a data analysis activity

Answer 177

in order to address the increasing demands for data storage, a storage device can either scale up or scale out

Answer 178

is a strategy for increasing resource capacity by replacing an existing low capacity device with a higher capacitty device.

Answer 179

does not cause disruption and system downtime

Answer 180

For example to double the capacity of a storage device, data from a 500 GB disk can be copied over to 1 TB disk.

Answer 181

will cause disruption and system downtime

Answer 182

is a strategy for increasing resource capacity by replacing an existing low capacity device with a higher capacitty device.

Answer 183

is a strategy for increasing resource capacity by adding similar or higher capacity commodity devices alongside the existing device

Answer 184

For example, to double the storage capacity, a 500GB disk can be added alongside an existing 500GB disk

Answer 185

Does not cause disruption and system downtime

Answer 186

Big Data datasets, both in raw and processed form, are a business asset and require attention with regards to storage. Multiple business functions may need to glean value out of these datasets, cometimes simultaneously

Answer 187

Big data datasets come in huge volumes at a fast pace and are generally acquired from multiple sources. This results in large amounts of data within a short period of time.

Answer 188

This dependence on Big Data datasets across an enterprise requires a reliable storage device that is fault-tolerant and is highly available

Answer 189

A big data solution enviroment is generally composed of cluster that are built using commodity servers connected via a high bandwidth network. With more nodes and network connections, the chances of a node becoming unavailable increases either due to hardware breakdown such as disk failure, or due to network failure.

Answer 190

As a result, storage redundancy is required to ensure auninterrupted acces to data in the event of a storage device failure, thereby providing high availability and fault tolerance

Answer 191

Sharding can help create scalable storage as shards can be stored on the nodes added via scaling out

Answer 192

to provide such redundancy, a storage device implements automatic sharding and replication with a configuration of either sharding and master-slave replication or sharding and peer-to-peer replication

Answer 193

realtime processing

Answer 194

batch processing

Answer 195

Realtime data analysis

Answer 196

batch processing

Answer 197

Big Data analytics generally involve both realtime as well as batch processing

Answer 198

Realtime data analysis requires fast read/write capabilities that are usually implemented via in-memory storage solutions.

Answer 199

Batch processing requires stream-based data access with high throughput, implemented via traditional disk-based storage devices or newer solid state drives that provide better performance at the expense of higher cost

Answer 200

The basic tenet of Big Data is to extract value from large amounts of data

Answer 201

At the same time, data arriving at high velocity needs a storage device that supports fast writes with minimal overhead. for example, schema validation at read time instead of write time

Answer 202

The basic tenet of Big Data is to extract value from large amounts of data

Answer 203

Realtime data analysis requires fast read/write capabilities that are usually implemented via in-memory storage solutions

Answer 204

This requires enterprises to retain data for longer periods of time (to create larger datasets) so that future data analyses are more insightful due to having more historic data available

Answer 205

This characteristic warrants the need for a storage device with increased storage capacity that is reliable and can be brougth online without incurring too much time delay

Answer 206

Traditionally, historic data that is no longer required is generally archived in offline storage. however, this makes the historic data unavailable for instant analysis.

Answer 207

big data analytics require historic data to be available online for discovering hidden patterns leading to valuable insights

Answer 208

data arriving at high velocity needs storage device that supports fast writes with minimal overhead, for example, schema validation at read time instead of write time

Answer 209

As a result, historic data is kept online by adding more storage capacity via scaling out

Answer 210

As a result, historic data is kept online by adding more storage capacity via scaling out

Answer 211

big data datasets come in multiple formats with limited or no schema, such as semi-structured and unstructured data.

Answer 212

without any prior knowledge about the structure of data does not guarantee that the data would conform to the same structure in the future, as in the nature of unstructured data

Answer 213

This requires the storage device to support __________ data persistence along with the added ability to make schema changes on the fly without breaking existing applications or incurring downtime

Answer 214

With all the characteristics required to store voluminous data, provide replication, and support long-term storage, the cost of storage devices can become a concern.

Answer 215

without any prior knowledge about the structure of the data does not guarantee that the data would conform to the same structure in the future, as is the nature of unstructured data

Answer 216

A storage device needs to make efficient use of the underlying disk resources to minimize storage waste.

Answer 217

Use of proprietary storage devices generally requires replacement of existing nodes with more expensive ones in order to scale up, eventually hitting a limit

Answer 218

A storage device needs to make use of commodity hardaware that can scale out so that costs can be kept down as enterprises amass more and more data

Answer 219

distributed file system

Answer 220

file system

Answer 221

A storage device that is implemented with a __________ provides simple, fast access data storage that is capable of storing large datasets that are non-relational in nature, such as semi-structured and unstructured data

Answer 222

To handle large volumes of data arriving at a fast pace, relational databases generally need to scale.

Answer 223

Although based on straightforward file locking mechanisms for concurrency control, it provides fast read/write capability, which addresses the velocity characteristic of Big Data.

Answer 224

are good fit when data must be accessed in streaming mode with no random reads and writes.

Answer 225

is not ideal for datasets comprising a large number of small files as this creates excessive disk-seek activity, slowing down the overall data access

Answer 226

Is also more overhead involved in processing multiple smaller files, as dedicated processes are generally spawned by the processing engine at runtime for processing each file before the results are synchronized from across the cluster

Answer 227

good for handling transactional workloads involving small amounts of data with random read/write

Answer 228

work best with fewer but larger files accessed in a sequential manner

Answer 229

Multiple smaller files are generally combined into a single file to enable optimum storage and processing

Answer 230

Do not provide any file searching capability out of box

Answer 231

Can be employed in clustered enviroment.

Answer 232

____ storage device is suitable when large datasets of raw data to be stored or when archiving of datasets is required

Answer 233

provides an inexpensive storage option for storing large amounts of data over a long period of time that remains online

Answer 234

Good for handling transactional workloads involving small amounts of data with random read/write

Answer 235

More disk can simply be added without needing to offload the data to offline data storage, such as tape drives

Answer 236

like any file systems, are agnostic to the data being stored and therefore support schema-less data storage

Answer 237

provides out of box redundancy and high availability by copying data to multiple locations via replication

Answer 238

Relational databases or relational database management systems (RDBMSs)

Answer 239

File systems

Answer 240

non-relational databases or Not only SQL (NoSQL)

Answer 241

Are good for handling transactional workloads involving small amounts of data with random read/write

Answer 242

Schema-less data model-Data can exist in its raw form

Answer 243

Are ACID, and in order to honor this compliance, they are generally restricted to a single node

Answer 244

The redundancy and fault tolerance provided by sharding and replication in a clustered enviroment are not inherently supported.

Answer 245

To handle large volumes of data arriving at a fast pace relational databases generally need to scale

Answer 246

Employ vertival scaling, not horizontal scaling, wich is a more costly and disruptive scaling strategy. This makes _______ less than ideal for long term storage of data

Answer 247

More disks can simply be added without needing to offload the data offline data storage, such as tape drives

Answer 248

Note that some relational databases, for example IBM DB2 pureScale, Sybase ASE Cluster Edition, Oracle Real Application Clusters (RAC) and Microsoft Parallel Data warehouse PDW, are capable of being run on clusters. however, these database clusters still use shared storage that can act as a single point of failure

Answer 249

Need to be manually sharded, mostly using application logic.

Answer 250

this means that the client (applicaction logic) needs to know which shard to query in order to get the required data

Answer 251

Reads across multiple nodes may not be consistent inmediately after a write. however, all nodes will eventually be in a consistent state

Answer 252

This further complicates the data processing when data from multiple shards is required

Answer 253

generally require data to adhere to a schema. As a result, storage of semi/unstructured data whose schema is not know or keeps changing is not supported

Answer 254

Refers to tecnologies used to develop net generation non_relational databases that are highlly scalable and fault-tolerant

Answer 255

The schema conformance is validated at the time of data insert/update,wich introduces overhead and leads to latency

Answer 256

A less ideal choice for storing high velocity data that needs a highly available database storage device with fast data write capability

Answer 257

Is not useful as a storage device in a Big data solution enviroment

Answer 258

Schema-less data model: Data can exist in its raw form

Answer 259

scale out rather than scale up: More nodes required rather than replacing the existing one with better higher performance node

Answer 260

Highly available: Built on cluster-based technologies providing fault tolerance out of box

Answer 261

Lower operational costs: Built on open source platforms with no licensing costs, and can be deployed on commodity hardware.

Answer 262

Eventual consistency; Reads across multiple nodes may not be consistent immediately after a write. However, all nodes will eventually be in a consistent state

Answer 263

BASE, not ACID: BASE compliance requires a database to maintain high availability in the event of network/node failure, while not requiring the database to be in a consistent state whenever an update occurs. The database can be in a soft/inconsistent state until it eventually attains consistency.

Answer 264

API driven data access - Data access is generally supported via API based queries, including RESTful APIs. Whereas some implementations may also provide SQL-like query capability

Answer 265

Auto Sharding and replication -To support horizontal scaling and provide high availability, a NoSQL storage device automatically employs sharding and replication techniques where the dataset is partitioned horizontally and then copied over to multiple nodes

Answer 266

Integrated caching - Removes the need for a third-party distributed caching layer, such as Memcached

Answer 267

Distributed query support - NoSQL storage devices maintain consistent query behavior across multiple shards

Answer 268

Polyglot persistence - The use of NoSQL device storage does not mandate retirin traditional RDBMSs. In fact, both can be used at the same time, thereby supporting polyglot persistence (an approach of persisting data using different types of storage technologies). this is good for developing systems requiring structured as well as semi/unstructured data.

Answer 269

Aggregate-focussed - Unlike relational databases that are most effective with fully normalized data, NoSQL storage devices store de-normalized data aggregated data (an entity containing merged, often nested, data for an object) Thereby eliminating the need for joins and mapping between application objects and the data stored in the database. Note that graph database storage devices are not aggregate-focused

Answer 270

The storage requirement of ever increasing data volumes commands the use of databases that are highly scalable while keeping costs down for the business to remain competitive.

Answer 271

The fast influx of data requires databases with fast access data write capability

Answer 272

NoSQL storage devices fulfill this requirement by providing scaling out capability while using inexpensive commodity servers

Answer 273

Furthermore, there may not be licensing costs involved as NoSQL databases generally follow the Open Source development model

Answer 274

NoSQL storage devices fulfill this requirement by providing scaling out capability while using inexpensive commodity servers

Answer 275

The fast influx of data requires databases with fast access data write capability

Answer 276

NoSQL storage devices enable fast writes by using schema-on-read rather than schema-on-write principle

Answer 277

Being highly available, NoSQL storage devices can ensure that write latency does not occur because of node/network failure

Answer 278

A storage device needs to handle different types of data formats including documents, emails, images and videos and incomplete data

Answer 279

NoSQL storage devices enable fast writes by using schema-on-read rather than schema-on-write principle

Answer 280

NoSQL storage devices can store these different forms of semi-structured and unstructured data formats

Answer 281

At the same time, NoSQL storage devices are able to store schema-less data and incomplete data with the added ability of making schema changes as the data model of the datasets evolve. In other words, NoSQL databases support schema evolution

Answer 282

column-family

Answer 283

Act like hash tables

Answer 284

Is a list of values each value is identified by a key

Answer 285

indexes that speed up searches are generally supported

Answer 286

The value is opaque to the database and is essentially stored as a BLOB

Answer 287

The value stored can be aggreted, ranging from sensor data to videos

Answer 288

Value look-up can only be performed via the keys as the database is oblivious to the details of the stored aggregate

Answer 289

Partial updates are not possible. An update is either a delete or an insert operation

Answer 290

A select operation can retrieve a part of the aggregate value

Answer 291

_______ storage devices generally do not maintain any indexes, therefore writes are quite fast

Answer 292

______ Based on a simple storage model, storage devices are highly scalable

Answer 293

The key is usually appended with the type of the value being saved for easy retrieval. for example 123_sensor1

Answer 294

some implementations support compressing values for reducing the storage footprint. However, this introduces latency at read time, as the data needs to be decompressed firts

Answer 295

most key-values storage devices provide collections or bucket(like tables) into wich key-value pairs can be organized

Answer 296

can be encoded using either a text-based encoding scheme, such as XML, or JSON or BSON (binary)

Answer 297

Unstructure data storage is required

Answer 298

high performance read/writes are required

Answer 299

the value is fully identifiable via the key alone

Answer 300

searches need to be performed on different fields of the document

Answer 301

value is a standalone entity that is not dependent on other values

Answer 302

values generally have a comparatively simple structure or are binary

Answer 303

query patterns are simple. involving insert, select and delete operations only

Answer 304

stored values are manipulated at the application layer

Answer 305

applications require searching or filtering data using attributes of the store value

Answer 306

the value is fully identifiable via the key alone

Answer 307

relationship exist between different key-value entries

Answer 308

a group of keys values need to be updated in a single transaction

Answer 309

multiple keys require manipulation in a single operation

Answer 310

schema consistency across different values is required

Answer 311

update to individual attributes of the value is required

Answer 312

examples: Riak,Redis and amazon Dynamo DB

Answer 313

The table is a list of values where each value is identified by a key

Answer 314

Also store data as key-value pairs. However, unlike key-value storage devices, the stored value is a document that can have a complex nested structure, such as an invoice

Answer 315

Can be encoded using either a text-based encoding scheme, such as XML or JSON, or using a binary encoding scheme, such as BSON (binary JSON)

Answer 316

Like key-value storage devices, most document storage devices provide collections or buckets (like tables) into which key-value pairs can be organized

Answer 317

document storage devices are value-aware

Answer 318

The stored value is self-describing: the schema can be inferred from the structured of the value

Answer 319

A select operation can reference a field inside the aggregate value

Answer 320

A select operation can retrieve a part of the aggregate value

Answer 321

partial update are supported; a subset of the aggregate can be updated

Answer 322

indexes that speed up searches are generally supported

Answer 323

Each document canhave a different schema

Answer 324

The value is opaque to the database and is essentially stored as a BLOB

Answer 325

Is possible to store different types of documents or documents of the same type that have fewer or more fields with respect to each other

Answer 326

Additional fields can be added to a document after the initial insert, thereby manifesting flexible schema support

Answer 327

It should be noted that document storage devices are not limited to storing data that occurs in the form of actual documents, such as an XML file, but can also be used to store any aggregate that consists of a collection of fields having a flat or a nested schema

Answer 328

Storing semi-structured document-oriented data comprising flat or nested schema

Answer 329

schema evolution is a requirement as the structure of the document is either unknown or is likely to change

Answer 330

applications require a partial update of the aggregate stored as a document

Answer 331

searches need to be performed on different fields of document

Answer 332

unstructured data store is required

Answer 333

storing domain objects, such as customers, in object form

Answer 334

query patterns involve insert, select, update, and delete operations

Answer 335

Multiple documents need to be updated as part of a single transaction

Answer 336

storing domain objects, such as customers, in object form

Answer 337

Performing operations that need joins between multiple documents or storing data that is normalized

Answer 338

schema enforcement for achieving consistent query design is required as the document structure may change between successive query runs, wich will require retructuring the query

Answer 339

stored value is not self-describing

Answer 340

binary data needs to be stored

Answer 341

Riak, Redis and Amazon Dynamo DB

Answer 342

MongoDB, CouchDB and terrastore

Answer 343

Column-family storage devices store data much like a traditional RDBMS but group related columns together in a row, resulting in column-families

Answer 344

The table is a list of values where each value is identified by a key

Answer 345

Each column can be a collection of related columns itself, referred to as a super-column

Answer 346

Each super-column can contain an arbitrary number of related columns that are generally retrieved or updates as a single unit

Answer 347

Each row consists of multiple column-families

Answer 348

Each row can have a different set of columns, thereby manifesting flexible schema support

Answer 349

Each row is identified by a row key

Answer 350

provide fast data acces with random read/write capability

Answer 351

Store different column in separate physical file, wich greatly helps in speeding up queries as only the required column-families are searched

Answer 352

Partial update are not possible. An update is either a delete or insert operation

Answer 353

provide support for selectively compressing

Answer 354

Leaving searchable ______ uncompressed can make queries faster because the target column does not need to be decompressed for lookup

Answer 355

Most implementations support data versioning after wich the configured columns are automatically removed

Answer 356

Realtime random read/write capability is needed and data being stored has some defined structure

Answer 357

Data represents a tabular structure, each row consists of a large number of columns and nested groups of interrelated data exist

Answer 358

Support for schema evolution is required as column families can be added or removed without any system downtime

Answer 359

storing domain objects, such as customers, in object form

Answer 360

Certain fields are mostly accessed togheter, and searches need to be performed using field values

Answer 361

efficient use of storage is required when the data consists of sparsely populated rows (no column, no space)

Answer 362

query patterns involve insert, select, update, and delete operations

Answer 363

Multiple documents need to be updated as apart of a single transaction

Answer 364

relational data access is required, for example joins

Answer 365

ACID transactional support is required

Answer 366

binary data needs to be stored

Answer 367

SQL-compliant queries need to be ecxecuted

Answer 368

query patterns are likely to change frequently that initiate a corresponding restructuring of how column-families are arranged, for example, during proof of concept development

Answer 369

MongoDB, CouchDB and terrastore

Answer 370

NoSQL:Column-Family

Answer 371

Riak, Redis and Amazon Dynamo DB

Answer 372

Persist inter-connected entities

Answer 373

The value is opaque to the database and is essentially stored as a BLOB

Answer 374

Unlike other NoSQL storage devices, where the emphasis is on the structure of the entities, ____________storage devices place emphasis on storing the linkages between entities

Answer 375

Entities are stored as nodes, also called vertices, while the linkages between entities are stored as edges. In RDBMS parlance, each node can be thought of a single row while the edge denotes a join

Answer 376

Nodes can have more than one type of link between them through multiple edges.

Answer 377

Each node can have attribute data as key-value pairs, such as a customer node with ID, name and age attributes

Answer 378

Each edge can have its own attribute data as key-value pairs, wich can be used to further filter query results

Answer 379

Multiple edges asre similar to defining interconnected nodes based on node attributes and/or edge attributes, commonly referred to as node traversal

Answer 380

Queries generally involve finding interconnected nodes based on node attributes and/or edge attributes, commonly referred to as node traversal

Answer 381

Can be a collection of related columns itself referred to as a super-column

Answer 382

Edges can be unidirectional or bidirectional, setting the node traversal direction

Answer 383

Generally, graph storage devices provide consistency via ACID compliance

Answer 384

The value stored can be any aggregate, ranging from sensor data to videos

Answer 385

The degree of usefulness of a graph storage device depends on the number and types of edges dened betwwen the nodes. The higher the number and more diverse the edges are the more diverse types of queries it can handle

Answer 386

As a result, it is important to comprehensively capture the types of relations tha exist betqeen the nodes

Answer 387

Generally allow adding new types of nodes without making changes to the database, it also enables defining additional links between nodes as new types of relationships or nodes apperar.

Answer 388

interconecccected entities need to be stored

Answer 389

querying entities based on the type of relationship with each other rather than the attributes of the entities

Answer 390

finding groups of interconnected entities

Answer 391

finding distances between entities in terms of the node traversal distance

Answer 392

mining data with a view toward finding patterns

Answer 393

storing semi-structured document-oriented data comprising flat or nested schema

Answer 394

updates are required to a large number of node attributes or edge attributes, as this involves searching for nodes or edeges wich is a costly operation compared to performing node traversals

Answer 395

binary storage is required or else queries based on selection of node/edge attributes dominate node traversal queries

Answer 396

Entities have a large number of attributes or nested data- is the best store ligtweight entities in a graph storage dev ice while storing the rest of the attribute data in a separate non-graph NoSQL storage device

Answer 397

ACID transactional support is required

Answer 398

Neo4j, infinite Graph and orientDB

Answer 399

Riak, redis and Amazon Dynamo DB

Answer 400

include Cassandra, HBase and Amazon SimpleDB

Answer 401

NoSQL storage devices are highly scalable, available fault-tolerant, and very fast for read/writeoperations. However, they do not provide the same transaction and consistency support as exhibited by ACID compliant RDBMS

Answer 402

Following the BASE model, NoSQL storage devices provide eventual consistency rather than immediate, and could therefore be in an inconsistent stage while reaching the state of consistency. As a result, they cannot be used for implementimg large scale transactional systems

Answer 403

____ storage devices combine the ACID properties of RDBMS with the scalability and fault tolerance offered by NoSQL storage devices

Answer 404

Each super-column can be a collection of related columns itself referred to as a super-column

Answer 405

______ databases generally support SQL compliant syntax for data definition and data manipulation operations, and they often use a relational data model for data storage

Answer 406

_____________ can be used for developing OLTP systems with veri high volumes of transactions, for example a bank. They can also be used for realtime analytics, for example operational analytics, as some implementations are memory based

Answer 407

As compared to a NoSQL storage device, _________ storage device provides an easier transition from a traditional RDBMS to a highle scalable database due to its support for SQL

Answer 408

Neo4J, infinite graph and orientDB

Answer 409

Riak, Redis and Amazon Dynamo DB

Answer 410

VoltDb, FoundationDB, NuoDB and innoDB

Answer 411

big Data datasets are generally stored utilizing distributed technologies (distributed file system or NoSQL database)

Answer 412

The very nature of a distributed storage device requires a processing engine that can process data without needing to transfer the data from storage to a computing resource as with distributed data processing

Answer 413

Schemas may change over time in a effort to accomodate changing bussines requirements, or simply because of an application software upgrade

Answer 414

In support of maximizing the value characteristics of Big Data, it is imperative to employ a processing model based on the divide-and-conquer principle, as with parallel data processing

Answer 415

often, it may be feasible to process data offline in batches, such as with overnight report generation

Answer 416

big Data datasets come in multiple formats (variety characteristic) and may not conform to any schema, especially unstructured data

Answer 417

____________may change over time in an effort to accomodate changing business requirements, or simply because of an application software upgrade.

Answer 418

The lack of adherence to any particular data model requires flexible processing of Big Data datasets so that they can be processed in raw form without the need to be stored in a particular data model

Answer 419

Big data datasets may arrive thick (volume characteristic) and/or fast (velocity characteristic)

Answer 420

Often, it may feasible to process data offline in batches, such as with overnight report generation

Answer 421

In other cases, the results may be required in in realtime, such as with GPS signal processing workloads (transactional and batch)

Answer 422

In order to achieve maximum value from Big Data datasets the underlying processing platform may need to support both transactional and batch workloads

Answer 423

A single processing engine may fulfill this requirement, or multiple processing engines may need to be used

Answer 424

Similarly, more data sources with unknown schemas may need to be added in the future

Answer 425

Although having the ability to process data both in realtime (transactional) and batch mode is ideal for extracting maximum value out of Big Data datasets, support for both modes may not be required or may not be feasible

Answer 426

the processing of Big data datasets requires a highly scalable processing engine that can be linearly scaled

Answer 427

Generally, assembling a batch processing Big Data solution enviroment is simpler and cheaper when compared to a realtime processing Big Data solution

Answer 428

Consequently, adding support for multi-workload processing should be business driven involving careful cost-benefit analysis

Answer 429

Owing to the volume and velocity characteristics of Big Data, the processin demand can grow quite sharply with increasing volumes of data arriving at a fast pace

Answer 430

supporting a distributed processing enviroment with parallel processing capabilities requires a processing engine that can provide a steady throughput as the data volumes grow

Answer 431

The processing of Big Data datasets requires a highly scalable processing engine that can be linearly scaled

Answer 432

Big Data datasets may arrive thick (volume characteristics) and/or fast (velocity characteristics)

Answer 433

In the context of processing, linear scalability means that one receives a proportional increase in performance with the addition of more processing nodes

Answer 434

Realtime business intelligence and analytics can leverage such a linearly scalable processing enviroment to deliver quicker responses involving complex operations on an entire dataset

Answer 435

_____________ is generally achieved by employing a scaling out strategy as it provides a simple, non-disruptive, and cost effective method for increasing processing capacity

Answer 436

A highly distributed data processing enviroment with parallel data processing capabilities generally involves complicated architecture

Answer 437

With a horizontally scaled processing architecture, wich by desing involves a large number of nodes and networking components, the chances of partial system failure increase

Answer 438

Similarly, more data sources with unknown schemas may need to be added in the future

Answer 439

As a system failure in the middle of a long running distributed task would be detrimental to achieving the analytic goals, aprocessing engine needs to provide fault tolerance so that a partial failure does not render the entire system unavailable and data processing does nort need to started from scratch

Answer 440

Generally, fault tolerance is provided through redundancy

Answer 441

With redundant processing resources, the system can still be available in the event of a partial system failure

Answer 442

Horizontal scaling lends itself quite useful in this case as redundancy can generally be increased by simply adding more processing resources

Answer 443

At the start of a Big Data initiative, the cost for a highly scalable distributed data processing enviroment involving a few processing nodes and networking equipment may not be that high.

Answer 444

Over time, as the data volume grows and the types and frequency of analytics being run increase, the requirement for an increased number of processing resources can translate into soaring IT costs involving both software and hardware

Answer 445

With redundant processing resources, the system can still be available in the event of partial system failure.

Answer 446

This could prove counterproductive to the very reason the Big Data initiative was undertaken (often to help the business deliver increased value, drive down costs, find new sources of reveneu, or establish new service offerings)

Answer 447

Use a open source software deployed over commodity hardware helps to keep the costs down

Answer 448

Another aspects of keeping costs down is the ability of the processing engine to take advantage of cloud

Answer 449

The on-demand and elastic nature of the cloud helps avoid any up-front capital investment, coupled with faster setup of the data processing solution enviroment

Answer 450

____________ large amounts of data is not a new phenomenon, and different large scale data processing architectures exist

Answer 451

______________________ requires a distributed enviroment that is capable of processing data in parallel a characteristic supported by the cluster architecture

Answer 452

are highly scalable, supportin g horizontal sclaing with linear performance gains

Answer 453

is a group of nodes connected together via a network to process tasks in parallel

Answer 454

__________ is a centrally managed network of nodes (computers) where each node is responsible for a sub-task of a large problem

Answer 455

enable distributed data processing

Answer 456

_____comprises low-cost commodity nodes that collectively provide increased processing capacity with inherent redundancy and fault tolerance, as it consists of physically separate nodes

Answer 457

The majority of Big Data processing occurs

Answer 458

are higly scalable, supporting horizontal scaling with linear performance gains

Answer 459

Provide an ideal deployment enviroment for a processing engine as large datasets can be divided into smaller datasets and then processed in parallel in a distributed manner

Answer 460

Can be utilized both by a realtime processing engine and a batch proceessing engine, such as Spark and MapReduce respectively

Answer 461

data is processed offline in batches where the response time could vary from minutes to hours

Answer 462

Data is first persisted to the disk and only then processed

Answer 463

Strategic BI predictive/prescriptive analytics and ETL operations are generally _____________________

Answer 464

data is processed in-memory as it is captured before being persisted to the disk

Answer 465

______________involves processing a range of large datasets, either on their own or joined together, essentially addressing the volume and variety characteristics of big Data datasets

Answer 466

the majority of big data processing occurs in ______________________________

Answer 467

______________ is relatively simple, easy to set up, and low cost in comparison to realtime mode

Answer 468

_____________________ data is processed in -memory as it is captured before being persisted to the disk

Answer 469

Response time generally ranges from a few seconds to under a minute

Answer 470

strategic BI, predeictive/prescriptive analytics and ETL operations are generally batch-oriented

Answer 471

______________ adresses the velocity characteristic of Big Data dataset

Answer 472

_____________________ is alsocalled event or stream processing as the data either arrives continuously (stream) or at intervals (event)

Answer 473

The individual event/stream datum is generally small in size, but its continuous nature results in very large datasets

Answer 474

Another related term, interactive mode, falls within the category of realtime, interactive mode generally

Answer 475

Operational BI/analytics are generally

Answer 476

is a widely used implementation of the batch processing engine mechanism

Answer 477

It is a highly scalable and reliable processing engine based on the principle of divide-and-conquer

Answer 478

It provides built-in fault tolerance and redundancy

Answer 479

it divides a bigger problem into a set of smaller problems that are easier and quicker to solve

Answer 480

it has its roots both in distributed computing as well as in parallel computing

Answer 481

Operational BI/analytics are generally conducted in realtime mode

Answer 482

_____________ is a batch -oriented processing engine used to process large datasets using parallel processing deployed over clusters of commodity hardware

Answer 483

with redundant processing resources, the system can still be available in the event of a partial system failure

Answer 484

_____________ does not require the input data to conform to any particular data model. Therefore, it can be used to process schema-less datasets

Answer 485

A dataset is bloken down into multipl smaller parts and operations are performec on each part independently and in parallel

Answer 486

The results from all operations arre then summarized to arrive at the answer

Answer 487

____________ processing engine generally supports batch workloads only

Answer 488

_______, the data processing algorithm is instead moved to the nodes that store the data

Answer 489

The data processing algorithm executes in parallel on these nodes, thereby eliminating the need to first move the data

Answer 490

This not only saves network bandwidth but also results in a large reduction in processing time for large datasets, as processing smaller chunks of data in parallel is much faster

Answer 491

Reads across multiple nodes may not be consistent immediately after a write. However, all nodes will eventually be in a consistent state

Answer 492

map : map task

Answer 493

combine (optional) : map task

Answer 494

partition : map task

Answer 495

shuffle & sort : Reduce task

Answer 496

reduce : Reduce task

Answer 497

linear scalability

Answer 498

The firts stage of map reduce, during wich the dataset file is divided into multiple smaller splts

Answer 499

Each split is parsed into its constituent records as a key-value pair

Answer 500

The processing of Big Data datasets requires a highly scalable processing engine that can be linearly scaled

Answer 501

The key is usuallu the ordinal position of the record and the value is the actual record. For exampl (234, sky is blue)

Answer 502

The parsed key-value pairs for each split are then sent to a map function (mapper), with one mapper function per split. The map function executes user-defined logic

Answer 503

Each split generally contains multiple key-value pairs and the mapper is run once for each key-value pair in the split

Answer 504

The mapper processes each key-value pairs as per the user-defined logic and further generates as a key-value pair as its output

Answer 505

The output key can either be the same as theinput key or a substring value from the input value, or another serializable user-defined object

Answer 506

similarly, the output value can either be the same as the input value or a substring value from the input value, or another serializable user-defined object

Answer 507

generally, the output of the map functions is handled directly by the reduce function.However, map task and reduce tasks are mostlyrun over different nodes

Answer 508

Generally, the output of map function is handled directly by the reduce function. However, map tasks ans reduce tasks are mostly run over different nodes

Answer 509

Requires moving data between mappers and reducers that can consume a lot of valuable bandwidth, and directly contributes to processing latency

Answer 510

With larger datasets, the time token to move the data between map and reduce stages can exceed the actual processing undertaken by the map and reduce tasks

Answer 511

For this reason the mapReduce engine provides an optional______________________ function that sumarizes a mapper´s output before it gets processed by the reducer

Answer 512

The first stage of MapReduce is know as _________, during which the dataset file is divided multiple smaller splits

Answer 513

processes each key-value pair as per the user-defined logic and further generates a key-value pair as its output

Answer 514

A _________________ is essentially a reducer function that groups a mapper's output locally, on the same node as the mapper

Answer 515

A reducer function can be used as combiner function or a custom user-defined function can be used

Answer 516

The mapReduce engine combines all values for a given key from the mapper output, creating multiple key-value pairs as input to the combiner where the key is not repeated and the value exists as a list of all corresponding values for that key

Answer 517

The ___________ stage is only an optimization stage, therefore it may not even be called by the mapReduce engine.

Answer 518

During this stage, if more than one reducer is involved, a partitioner divides the output from the mapper or combiner, if specified and called by the MapReducen engine, into partitions between reducer instances

Answer 519

The number of partitions equals the number of reduces

Answer 520

Is only an optimization stage, therefore it may not even be called by the mapReduce engine

Answer 521

Although each partition contains multiple key-value pairs all records for a particular key are within the same partition

Answer 522

The MapReduce engine guarantees a random and fair distribution between reducers while making sure that all of the same keys across multiple mappers end up with the same reducer instance

Answer 523

depending on the nature of the job, certain reducers can sometimes receive a large number of key-value pairs compared to others. As a result of this uneven workload, some reducers will finish earlier than others

Answer 524

Overall, this is less efficient and leads to longer job execution times than if the work was evenly split across reducers

Answer 525

this can be rectified by customizing the partitioning logic in order to guarantee a fair distribution of key-value pairs

Answer 526

This is the last stage of the map task

Answer 527

During the first stage of the reduce task, output from all partitioner is copied across the network to the nodes running the reduce task. This is known as _____________

Answer 528

Is the final stage of the reduce task

Answer 529

The list based key_value output from each partitioner can contains the same key multiple times

Answer 530

Next, The MapReducer engine automatically groups and sorts the key-value pairs according to the keys so that the output contains a sorted list of all input keys (and their values) with the same keys appearing together

Answer 531

To view the full output from the mapReduce job, all the file parts must be combined

Answer 532

The way keys are grouped and sorted can be customized

Answer 533

The MapReduce engine then merges each group of keys together before the shuffle and sort output is processed by the reducer function

Answer 534

This mere creates a single key-value pair per group, where key is the group key and the value is the list of all group values

Answer 535

The way kwys are grouped and sorted can be customized

Answer 536

Is the final stage of the reduce task

Answer 537

Depending on the user-defined logic specified in the reduce function (reducer), the reducer will either further summarize its input or will emit the output without making any changes

Answer 538

Thus, for each key-value pair that a reducer receives, the list of values stored in the value part of the pair is processed and another key-value pair is written out

Answer 539

The output key can either be the same as the input key or a substring value from the input value, or another serializable user-defined object

Answer 540

The output value can either be the same as the input value or a substring value from the input from the input value or another serializable user-defined object

Answer 541

__________________Just like the mapper, for the input key-value pair, a reducer may not produce any output key-value pair (filtering) or can generate multiple key-value pairs (demultiplexing)

Answer 542

Is essentially a reducer function that groups a mappers output locally, on the same node as the mapper

Answer 543

The output of the reducer, that is tyhe key-value pairs, is then written out as a separate file one file per reducer

Answer 544

To view the full output from the MapReduce job, all the file parts must be combined

Answer 545

the number of reducers can be customized

Answer 546

Different sub-datasets are spread across multiple nodes and are processed using the same algorthm

Answer 547

Refers to the parallelization of data processing by dividing a task into sub-tasks and running each sub-task on a separate processor, generally on a separate node in a cluster

Answer 548

Each sub-task generally executes a different algorithm, with its own copy of the same data or different data as its input, in parallel

Answer 549

Generally the output from multiple sub-tasks is joined together to obtain the final set of results

Answer 550

Refers to the parallelization of data processing by dividing a dataset into multiple sub-datasets and processing each sub-dataset in parallel

Answer 551

Each sub task generally executes a different algorithm wich its own copy of the same data or different data as its inputs, in parallel

Answer 552

Different sub-datasets are spread across multiple nodes and are pocessed using the same algorithm

Answer 553

Generally de output from each ptrocessed sub-dataset is joined together to obtain the final set of results

Answer 554

Within Big Data enviroments, the same task generally needs to be performed repeatdly on a data unit, such as a record, where the complete dataset is distributed across multiple locations due to its large size

Answer 555

MapReduce adresses this requirement by employing the data parallelism approach, where the data is divided into splits

Answer 556

Different sub-datasets are spread across multiple nodes and are processed using the same algorithm

Answer 557

Each split is then processed by its own instance of the map function, wich contains the same processing logic as the other map functions

Answer 558

The majority of traditional algorithmic development follows a sequential approach where operations on data are performed ona after the other in such a way that subsequent operations is dependent on its preceding operation Here, operations are divided among the map and reduce functions

Answer 559

Map an Reduce task are independent, and in turn, run isolated from each other

Answer 560

Each instance of a map or reduce function runs independently of other instances

Answer 561

The logic within the reduce function is dependent on the output of the map function, in particular, which keys were emitted from the map function as the reduce function receives a unique key with a consolidated list of all of its values

Answer 562

Relatively simplistic algorithmic logic, such that he required result can be obtained bya pplying the same logic to different portions of a dataset in parallel, and then aggregating the results in some manner

Answer 563

requires a highky scalable processing engine that can be linearly scaled

Answer 564

Availability of the dataset in a distributed manner partitioned across a cluster so that multiple map functions can process diferent subset a datasets in parallel

Answer 565

understanding of the data structure within the dataset so that a meaningful data unit (a single record) can be chosen

Answer 566

Dividing algorithmic logic into map and reduce functions so that the logic in the map function is not dependent on the complete dataset, as only data within a single split is available

Answer 567

Emitting the correct key from the map function along with all the required data as value because the reduce function´s logic can only process those values that were emitted as part of the key-value pairs from the map function

Answer 568

Emiting the correct key from the reduce function along with the required data as value because the output from each reduce function becomes the final output of the mapReduce algorithm

	Created by Carolina Colorado about 8 years ago

Next up

Modulo 7: Fundamental Big Data engineering

Description

Resource summary

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

Question 11

Question 12

Question 13

Question 14

Question 15

Question 16

Question 17

Question 18

Question 19

Question 20

Question 21

Question 22

Question 23

Question 24

Question 25

Question 26

Question 27

Question 28

Question 29

Question 30

Question 31

Question 32

Question 33

Question 34

Question 35

Question 36

Question 37

Question 38

Question 39

Question 40

Question 41

Question 42

Question 43

Question 44

Question 45

Question 46

Question 47

Question 48

Question 49

Question 50

Question 51

Question 52

Question 53

Question 54

Question 55

Question 56

Question 57

Question 58

Question 59

Question 60

Question 61

Question 62

Question 63

Question 64

Question 65

Question 66

Question 67

Question 68

Question 69

Question 70

Question 71

Question 72

Question 73

Question 74

Question 75

Question 76