Module 11: Advanced Big Data Architecture

Question

Operational Data Store (ODS)

Answer 1

As an EDW contains large amounts of data, it is of particular interest when designing an architecture for a Big Data platform. It not only serves as a data source but also as the default interface through which various BI and analysis activities are carried out.

Answer 2

Although a single EDW can house multiple ODSs, because their primary role is to facilitate near-realtime reporting, their use is optional.

Answer 3

On the other hand, Big Data is mostly comprised of unstructured data that has no defined structure. Unless analyzed, the data may not have any value. Big Data analysis requires data to be stored in its raw form without being modeled first. Once collected, the exploratory phase separates signal (valuable data) from noise.

Answer 4

EDWs contain high value data that has gone through rigorous validation and quality control checks

Answer 5

staging area

Answer 6

operational data store (ODS)

Answer 7

analytical database

Answer 8

Although a single EDW can house multiple ODSs, because their primary role is to facilitate near-realtime reporting, their use is optional.

Answer 9

It may not be possible to extract data from all systems at the same time because of various technical or business-related issues. Due to this, a storage buffer where data extracted from different systems at varying times with differing frequencies can be stored is required

Answer 10

It is generally an insert/read-only database utilizing either shared-nothing MPP architecture or shared-everything architecture. Data is fed from the data warehouse into the analytical database on regular intervals

Answer 11

It usually includes an ETL process that ferries data from source systems into a temporary storage area. This process also contains data cleansing, validation and model transformation operations

Answer 12

generally contains recent data. However, the degree of “data freshness” depends upon the reporting requirements. As a result, the range of data stored may span from hours to months

Answer 13

a relational database that acts as the single version of truth for the enterprise by storing standardized data from across the enterprise in a denormalized form that is fit for reporting and data analysis

Answer 14

stores data related to various business entities, such as products or customers. Unlike an OLTP system, data is either inserted or retrieved but not updated in a data warehouse

Answer 15

the queries are generally more complex, involving multiple tables spanning a longer range of data.

Answer 16

Although the historical data can go back up to several years, the freshness of the current data depends on an enterprise’s reporting and analysis requirements

Answer 17

Some basic level of data model transformation and denormalization may also be performed in support of efficient reporting

Answer 18

provides a particular view on the data held in the data warehouse. Although makes data analysis and reporting easier and faster because the stored data is highly customized according to the specific requirements, it does result in data redundancy.

Answer 19

contains large amounts of data, it is of particular interest when designing an architecture for a Big Data platform

Answer 20

It is generally an insert/read-only database utilizing either shared-nothing MPP architecture or shared-everything architecture

Answer 21

Data is highly standardized because it has gone through data cleansing, validation, quality and de-duplication processes, further suggesting that the data is of high value

Answer 22

Some basic level of data model transformation and denormalization may also be performed in support of efficient reporting

Answer 23

These are generally expensive and may come bundled with the required hardware and software in the form of an appliance

Answer 24

Contain high value data that has gone through rigorous validation and quality control checks

Answer 25

On the other hand, Big Data datasets must be stored in their raw unstructured forms, and their values are unknown

Answer 26

Big Data requires a repository that acts as a sink for a variety of data sources where data is stored as is

Answer 27

Stores data related to various business entities, such as products or customers

Answer 28

Big Data requires a distributed and highly scalable storage and processing architecture with scale-out support

Answer 29

Most implementations of the Big Data appliance enable realtime and near-realtime analytics without the need for integrating multiple disparate technologies

Answer 30

A batch processing engine, such as MapReduce, can be used to convert semi- and unstructured data into meaningful structured data

Answer 31

The next-generation data warehouse consists of heterogeneous technologies providing support for structured as well as semi- and unstructured data storage and analysis

Answer 32

The introduction of the Big Data platform in this configuration is comparatively less disruptive because the Big Data platform is essentially an add-on module for processing semi- and unstructured data

Answer 33

Provides a highly scalable data storage and processing environment

Answer 34

BI tools and other analytical applications are unable to make use of the Big Data platform directly

Answer 35

The implementation and maintenance of the interconnect can become complex if it incorporates complicated data processing, such as translation between different data types

Answer 36

relational and non-relational storage

Answer 37

configuration, management and application development environments

Answer 38

an interconnect (between data storage and processing resources)

Answer 39

is analogous to the parallel approach and is also known as the logical data warehouse

Answer 40

It requires complex initial configuration, which usually results in consultation costs

Answer 41

It is generally implemented as Data-as-a-Service (DaaS) by applying service-orientation principles.

Answer 42

This approach makes non-relational data (Big Data datasets) more accessible through the use of standardized interfaces

Answer 43

Is generally implemented through complex software that can be expensive to acquire

Answer 44

Can be utilized as a technology-enabler for Big Data under such circumstances

Answer 45

Ingested data is stored in a distributed file

Answer 46

A single dataset may be of interest to multiple clients developed using different technologies that require data to be available in a specific format

Answer 47

Specialized form of distributed computing that introduces utilization models for remotely provisioning scalable and measured IT resources

Answer 48

Processing and storage technologies that use cluster-based processing and storage resources

Answer 49

The on-demand and elastic nature provides the ability for a much quicker setup of a Big Data platform

Answer 50

Has the potential to provide the basic components for a Big Data solution environment, including data, storage and processing resources

Answer 51

Whether processing data in batch or realtime mode, the pay-per-use model can be fully utilized to build a cluster whose size can be regulated based on the volume and velocity characteristics of Big Data

Answer 52

Infrastructure-as-a-Service (IaaS)

Answer 53

Platform-as-a-Service (PaaS)

Answer 54

Software-as-a-Service (SaaS)

Answer 55

Component-as-a-Service (CaaS)

Answer 56

Heterogeneous Cloud

Answer 57

Private Cloud

Answer 58

Managed Cloud

Answer 59

Hybrid Cloud

Answer 60

Is ideal for enterprises that initially built up Big Data analytics in-house but now want to scale out.

Answer 61

Can be used when input datasets are already stored in the cloud

Answer 62

Is generally less secure but more scalable due to larger pooling of storage and processing resources

Answer 63

It is also ideal when datasets reside within an enterprise’s firewall.

Answer 64

It is also ideal when workloads vary

Answer 65

Is generally less secure but more scalable due to larger pooling of storage and processing resources

Answer 66

It is also ideal when datasets reside within an enterprise’s firewall

Answer 67

Can help develop low latency data analysis capabilities

Answer 68

It is also ideal when workloads vary

Answer 69

Can be used when input datasets are already stored in the cloud

Answer 70

Is a suitable choice when starting a Big Data project

Answer 71

is a suitable choice when using a combination of sensitive data and public datasets

Answer 72

data privacy

Answer 73

regulatory compliance

Answer 74

network connectivity

Answer 75

data virtualization

Answer 76

Cloud-based Big Data Analysis

Answer 77

Cloud-based Big Data Visualization

Answer 78

Cloud-based Big Data Storage

Answer 79

Cloud-based Big Data Processing

Answer 80

This pattern can also be employed when the data sources, such as the CRM system, reside in the same cloud (faster data transfer) or a proof-of-concept is being developed

Answer 81

This ability to store raw data spanning over longer periods of time increases the overall potential of finding valuable insights

Answer 82

Represents a solution environment comprised of inexpensive NoSQL storage

Answer 83

Is associated with the storage device (distributed file system/NoSQL) and data transfer engine mechanisms

Answer 84

This generally involves the use of NoSQL databases such that the downstream applications can communicate directly with these databases using RESTful APIs

Answer 85

The underlying idea is to be able to ingest large amounts of raw data and pre-process it in order to make it suitable for traditional enterprise systems

Answer 86

Keeping multiple copies of the same dataset in different formats is not only inefficient but also adds operational and storage overheads

Answer 87

The involved operations can include data cleansing, validation, model transformation and format transformation, as well as the joining of disparate datasets

Answer 88

Poly Source

Answer 89

Large-Scale Batch Processing

Answer 90

High Volume Tabular Storage

Answer 91

Large-Scale Graph Processing

Answer 92

Ingesting large amounts of data in order to calculate certain statistics or execute a machine learning and then to feed results to enterprise systems

Answer 93

This generally involves the use of NoSQL databases such that the downstream applications can communicate directly with these databases using RESTful APIs

Answer 94

The underlying idea is to be able to ingest large amounts of raw data and pre-process it in order to make it suitable for traditional enterprise systems

Answer 95

A dedicated storage layer helps store, pre-process and further integrate data with structured data without impacting the current storage infrastructure

Answer 96

High Volume Tabular Storage

Answer 97

Large-Scale Graph Processing

Answer 98

Canonical Data Format

Answer 99

Data Size Reduction

Answer 100

Warrants the use of a memory-based storage device with random read and write capability.

Answer 101

Keeping multiple copies of the same dataset in different formats is not only inefficient but also adds operational and storage overheads

Answer 102

A separate connector is used to connect to a particular query engine or the storage device

Answer 103

The ingested data is stored to the distributed file system, where it is enriched via batch processing and then stored on a NoSQL database

Answer 104

Ingested data is stored to the distributed file system, where it is enriched via batch processing and then stored on a NoSQL database

Answer 105

Exporting the data in the form of a file, importing it into a database and then connecting the analytics tool to the database is not a viable option

Answer 106

Is associated with the serialization engine, data transfer engine, storage device and processing engine mechanisms

Answer 107

The use of disk-based storage devices can severely impact the processing time of data

Answer 108

Greatly helps in speeding up data analysis and reduces dependence on IT personnel for data analysis tasks

Answer 109

Incurs increased cost because memory-based storage devices are expensive when compared with disk-based storage devices

Answer 110

Keeping multiple copies of the same dataset in different formats is not only inefficient but also adds operational and storage overheads

Answer 111

Is generally employed by enterprises that have just embarked on a Big Data journey

Answer 112

The results are fed directly to various downstream applications, such as an e-commerce application

Answer 113

Is generally employed by enterprises that have just embarked on a Big Data journey

Answer 114

Represents a standalone solution environment

Answer 115

Offloads existing databases from having to perform complex and long-running data transformation jobs on large datasets

Answer 116

Poly Storage

Answer 117

Poly Source

Answer 118

Confidential Data Storage

Answer 119

In the case of a clustering algorithm applied to a customer dataset for finding customer cohorts

Answer 120

Is generally opted for by enterprises that want to move towards predictive and prescriptive analytics by creating richer statistical and machine learning models

Answer 121

Can be applied in such a case to ensure that even if malicious users get access to sensitive data, they are unable to read and make use of it

Answer 122

This approach provides a better alternative in terms of uploading data to the cloud as well as data security and privacy issues

Answer 123

A dedicated storage layer helps store, pre-process and further integrate data with structured data without impacting the current storage infrastructure

Answer 124

It involves traversing through a large number of nodes (entities) via their defined edges (links).

Answer 125

This approach provides a better alternative in terms of uploading data to the cloud as well as data security and privacy issues

Answer 126

Storing and analyzing very large amounts of structured, unstructured and semi-structured Big Data datasets

Answer 127

The analytical operations performed in support of BI, data mining and creating statistical and machine learning models do not affect the performance

Answer 128

This configuration is generally opted for by enterprises that want to move towards predictive and prescriptive analytics by creating richer statistical and machine learning models

Answer 129

Capable of ingesting and storing large amounts of semi-structured and unstructured data to develop highfidelity statistical and machine learning models for performing predictive and prescriptive analytics

Answer 130

Although analogous to the use of a cloud, this approach provides a better alternative in terms of uploading data to the cloud as well as data security and privacy issues

Answer 131

Random Access Storage

Answer 132

Automated Dataset Execution

Answer 133

File-based Sink

Answer 134

Big Data Processing Environment.

Answer 135

The ingested data is stored to the distributed file system, where it is enriched via batch processing and then stored on a NoSQL database for performing analytical queries

Answer 136

Their current storage infrastructure does not allow them to store semi-structured and unstructured data

Answer 137

A solution environment where the sole purpose of using the Big Data platform is to offload processing of large amounts of structured data

Answer 138

This approach provides a better alternative in terms of uploading data to the cloud as well as data security and privacy issues

Answer 139

Canonical Data Format

Answer 140

Relational Sink

Answer 141

Automatic Data Replication and Reconstruction

Answer 142

Automatic Data Sharding

Answer 143

Requires exporting data via a relational data transfer engine to the data warehouse

Answer 144

Can be applied in such a case to ensure that even if malicious users get access to sensitive data

Answer 145

Is a solution environment comprised of inexpensive storage used to store large amounts of data from both internal and external data sources in an online fashion ready for consumption by any enterprise system

Answer 146

Enable the processing of datasets, which requires the use of a batch processing engine

Answer 147

Storing and analyzing very large amounts of structured, unstructured and semi-structured Big Data datasets

Answer 148

A solution environment comprised of inexpensive storage used to store large amounts of data from both internal and external data sources in an online fashion ready for consumption by any enterprise system

Answer 149

Large data volumes are available and the data itself has not lost its value because it is kept unprocessed in its raw form

Answer 150

The sole purpose of using the Big Data platform is to offload processing of large amounts of structured data

Answer 151

Automated Dataset Execution

Answer 152

Streaming Access Storage

Answer 153

Random Access Storage

Answer 154

Canonical Data Format

Answer 155

This configuration is generally opted for by enterprises that want to move towards predictive and prescriptive analytics by creating richer statistical and machine learning models

Answer 156

Large data volumes are available and the data itself has not lost its value because it is kept unprocessed in its raw form

Answer 157

Storing and analyzing very large amounts of structured, unstructured and semi-structured Big Data datasets

Answer 158

Data from structured sources and from unstructured sources can first be stored on a distributed file system

Answer 159

Automatic Data Sharding

Answer 160

Canonical Data Format

Answer 161

Random Access Storage

Answer 162

Confidential Data Storage

Answer 163

a solution environment comprised of inexpensive NoSQL storage that is utilized as ___________ where large amounts of transactional data from operational systems across the enterprise are collected for operational BI and reporting

Answer 164

Data from structured sources and from unstructured sources can first be stored on a distributed file system

Answer 165

Large data volumes are available and the data itself has not lost its value because it is kept unprocessed in its raw form

Answer 166

Larger amounts of data that spreads over longer time periods can be stored, thereby providing the opportunity to enrich operational BI

Answer 167

High Volume Tabular Storage

Answer 168

Relational Sink

Answer 169

Indirect Data Access

Answer 170

Automated Dataset Execution

Answer 171

The data can be imported into fit-forpurpose NoSQL databases, where it can be easily accessed in support of BI, reporting and other analytical use cases

Answer 172

Enable access to pre-processed data or analysis results stored in a Big Data solution environment via existing BI tools

Answer 173

A solution environment comprised of inexpensive NoSQL storage

Answer 174

Enable the processing of such datasets, which requires the use of a batch processing engine

Answer 175

The data can be imported into fit-forpurpose NoSQL databases, where it can be easily accessed in support of BI, reporting and other analytical use cases

Answer 176

A solution environment capable of processing streams of data in realtime or near-realtime, such as performing analytics on machine-generated or social media data

Answer 177

The streaming data can be stored in disk-based storage, such as the distributed file system, for further analysis

Answer 178

Enable the processing of such datasets, which requires the use of a batch processing engine

Answer 179

Large-Scale Batch Processing

Answer 180

Streaming Source

Answer 181

Automatic Data Replication and Reconstruction

Answer 182

Data Size Reduction

Answer 183

Enable the immediate export of results

Answer 184

Scenarios where the data needs processing as it arrives to obtain immediate results

Answer 185

A solution environment capable of processing streams of data in realtime or near-realtime

Answer 186

Enable access to pre-processed data or analysis results stored in a Big Data solution environment via existing BI tools

Answer 187

Storing high-volume and high-variety data in order to perform various analytics in isolation from other enterprise systems

Answer 188

Data needs processing as it arrives to obtain immediate results

Answer 189

Provide integration with the enterprise identity and access management systems (IAMs)

Answer 190

Enable the immediate export of results

Answer 191

Centralized Dataset Governance

Answer 192

Fan-in Ingress

Answer 193

Centralized Dataset Management

Answer 194

Streaming Egress Pattern

Answer 195

Provides a means for performing a range of data governance tasks from a central location

Answer 196

Provide integration with the enterprise identity and access management systems (IAMs)

Answer 197

Maintain data lineage and details about operations performed on the data across multiple processing stages

Answer 198

Enable policy-based access to resources within the Big Data platform via a central interface

Answer 199

Provides a means for performing a range of data governance tasks from a central location

Answer 200

Enable policy-based access to resources within the Big Data platform via a central interface

Answer 201

Can be used to provide integration with the enterprise identity and access management systems (IAMs)

Answer 202

Is associated with the processing engine, storage device, query engine and productivity portal mechanisms

Answer 203

A security engine is used to enable single sign-on (SSO) functionality that generally works on the basis of trusting the IAM system for user authentication via the use of tokens

Answer 204

Provides a means for performing a range of data governance tasks from a central location

Answer 205

In order to have maximum confidence in the processing results, there needs to be a way to retrace the processing steps that were taken

Answer 206

Data merging may be required due to reasons such as the data is too fine-grained or arrives out of order, due to network latency or due to factors that are beyond the control of the enterprise

Answer 207

Data merging may be required due to reasons such as the data is too fine-grained or arrives out of order, due to network latency or due to factors that are beyond the control of the enterprise

Answer 208

Can be applied to maintain data lineage and details about operations performed on the data across multiple processing stages

Answer 209

Intermediate output from each stage is persisted temporarily to a storage device until the final result is computed and validated

Answer 210

If the final results are incorrect, the entire series of steps need to be executed from scratch even if the results halfway were correct

Answer 211

Intermediate output from each stage is persisted temporarily to a storage device until the final result is computed and validated

Answer 212

Can be applied to maintain data lineage and details about operations performed on the data across multiple processing stages

Answer 213

In order to have maximum confidence in the processing results, there needs to be a way to retrace the processing steps that were taken

Answer 214

Data needs to be simultaneously processed using different sub-systems

Answer 215

The application of this design pattern requires the automated addition of metadata, based on a machine-readable standardized structure, during each stage of data processing

Answer 216

Provides scalability in the context of being able to add more data consumers via a simple configuration

Answer 217

Is applied when data needs to be simultaneously processed using different sub-system

Answer 218

Can be applied to implement logic that merges data originating from multiple sources and generally applies to situations where data is acquired in realtime

Answer 219

Intermediate output from each stage is persisted temporarily to a storage device until the final result is computed and validated

Answer 220

Is applied when data needs to be simultaneously processed using different sub-systems

Answer 221

Maintain data lineage and details about operations performed on the data across multiple processing stages

Answer 222

Data is copied from the source location, stored in the queue and then forwarded to the interested subscribers

Answer 223

Online Data Repository

Answer 224

Unstructured Data Store

Answer 225

Big Data Warehouse

Answer 226

Operational Data Store

Answer 227

Analytical Sandbox

Answer 228

Unstructured Data Store

Answer 229

Data Transformation

Answer 230

Big Data Warehouse

Answer 231

Batch Data Processing

Answer 232

Operational Data Store

Answer 233

Analytical Sandbox

Answer 234

Online Data Repository

Answer 235

Big Data Warehouse

Answer 236

Online Data Repository

Answer 237

Application Enhancement

Answer 238

Realtime Data Processing

Answer 239

Application Enhancement

Answer 240

Online Data Repository

Answer 241

Batch Data Processing

Answer 242

Realtime Data Processing

Answer 243

Realtime Data Processing

Answer 244

Batch Data Processing

Answer 245

Data Transformation

Answer 246

Analytical Sandbox

Answer 247

Data Transformation

Answer 248

Application Enhancement

Answer 249

Big Data Warehouse

Answer 250

Online Data Repository

Answer 251

Big Data Warehouse

Answer 252

Batch Data Processing

Answer 253

Operational Data Store

Answer 254

Analytical Sandbox

Answer 255

Online Data Repository

Answer 256

Operational Data Store

Answer 257

Big Data Warehouse

Answer 258

Unstructured Data Store

Answer 259

The sole purpose of using this kind of platform is to offload processing of large amounts of structured data

Answer 260

Type of Big Data solution architecture that is comprised of multiple layers and forms the basis for developing highly scalable, available, eventually consistent, fault tolerant and low latency realtime Big Data solutions

Answer 261

Uses a combination of both realtime and batch components that operate in parallel to process data without any delay

Answer 262

Additional processing is generally required to put the data in the correct structure

Answer 263

Indexed View

Answer 264

Normalization

Answer 265

Denormalization

Answer 266

Polyglot Persistence

Answer 267

Polyglot Persistence

Answer 268

Denormalization

Answer 269

Polyglot persistence

Answer 270

Recomputation Algorithm

Answer 271

Recomputation Algorithm

Answer 272

Incremental/Approximate Algorithm

Answer 273

Recomputation Algorithm

Answer 274

Incremental/Approximate Algorithm

Answer 275

Incremental/Approximate Algorithm

Answer 276

Replication

Answer 277

recomputation algorithm

Answer 278

Replication

Answer 279

Recomputation Algorithm

Answer 280

incremental algorithm

Answer 281

Replication

Answer 282

Denormalization

Answer 283

Normalization

Answer 284

Replication

Answer 285

Denormalization

Answer 286

This not only helps process voluminous data faster but also helps cater to infrequent or ad-hoc data processing requests that require above-average storage and processing resources

Answer 287

Data architectures are becoming difficult to design and maintain due to the ever-increasing volume, velocity and variety of data.

Answer 288

Efficient data storage and efficient querying have incompatible requirements that require following different strategies

Answer 289

Data is either stored in a disk-based NoSQL or a memory-based storage device, which can be a NoSQL or some other cluster-based storage technology, that enables low latency data access to perform realtime or near-realtime analytics

Answer 290

Processes raw data by employing both realtime and batch data processing techniques in parallel

Answer 291

Maintain data lineage and details about operations performed on the data across multiple processing stages

Answer 292

The results generated by realtime processing are based on incremental algorithms that may not be consistent/accurate

Answer 293

Batch data processing eliminates the complexity of maintaining data consistency across nodes by storing only immutable data

Answer 294

Processing of raw data

Answer 295

Storage of raw data

Answer 296

Ad hoc reporting

Answer 297

Calculation of views

Answer 298

Uses incremental algorithms and processes comparatively smaller amounts of data to provide low latency results

Answer 299

Consists of a storage device (distributed file system), batch processing engine and a workflow engine

Answer 300

Uses a recomputation algorithm to provide consistent accurate views and further provides fault-tolerance when compared with an incremental algorithm

Answer 301

Comprises an enhanced version of the query engine with logic that can automatically and intelligently combine serving and speed views based on the query criteria

Answer 302

Although raw data is stored, for achieving consistency, some structure needs to be applied to the data before storage

Answer 303

The storage device used in this layer only needs to support batch write (no random write) with random read capabilities

Answer 304

As the layer follows the mutable storage model and the processing results are generated more frequently, the storage device that stores the views needs to support random writes with random reads

Answer 305

For keeping the complexity to a minimum and providing faster reads, normally a simple key-value NoSQL database is used

Answer 306

The use of an append-only and streaming data storage device keeps complexity to a minimum

Answer 307

The views created by the batch layer are not amenable to random querying, as these are generally stored in the distributed file system

Answer 308

A memory-based storage device for the storage of raw data and a memory or disk-based NoSQL storage device for the storage of views is generally used

Answer 309

Event data is captured using the event data transfer engine and is processed in memory via the realtime processing engine to create indexed views that are generally stored inside a NoSQL database

Answer 310

For easier integration, the speed and serving views should be constructed in a modular manner

Answer 311

Merging the results from views residing in the speed and serving layers for successfully executing a query

Answer 312

Once the latest batch view is available via the serving layer, the corresponding results in the realtime views can be ignored or flushed

Answer 313

Is a high latency layer such that there is a time lag before the latest version of the views, based on fresher data, is available

Answer 314

Raw data is fed simultaneously to the batch and speed layers, generally using the same event data transfer engine

Answer 315

The batch layer can be further used for deep analytics, as it contains complete datasets

Answer 316

The limitations of the SCV principle are also relaxed

Answer 317

Although the speed layer is responsible for processing the entire set of fresh data while the corresponding batch view is not ready, it does not process the entire set as a single job because doing so adds to the latency and results in excessive resource usage

Answer 318

Algorithms for the speed layer can be complex or might need some time to understand, as they use incremental or approximation (probability)-based techniques that the batch equivalent may not be using

Answer 319

The complexity of the architecture is restricted to the speed layer, as that is where the incremental algorithms and read/write database are used

Answer 320

The immutable nature of the batch layer helps re-process data as a result of a data processing logic change that may occur due to new business requirements or a bug fix

Answer 321

Realtime data processing capability is required with consistent results

Answer 322

Realtime data processing capability is required with consistent results

Answer 323

Fault-tolerance and accuracy need to be added to the existing realtime system

Answer 324

Loss of data is not acceptable

Answer 325

Polyglot persistence by employing fit-for-purpose storage devices at each layer

Answer 326

Configuring the batch layer to process data in small batches reduces load on the speed layer

Answer 327

Raw data is fed simultaneously to the batch and speed layers, generally using the same event data transfer engine, and each layer can be implemented via a different set of technologies

Answer 328

Complexity is greatly increased, as two separate layers need building and maintaining while ensuring that each provides the same functionality

Answer 329

Requires schema adherence in the batch layer, which adds complexity, adds another step before data can actually be persisted and requires prior knowledge about the structure of the incoming data

Answer 330

Employing the same processing engine for both the speed and batch layer, such as Spark, helps keep system complexity to a minimum

Answer 331

The key-value storage model employed in the serving layer may not be sufficient for all types of query requirements

Answer 332

The immutable nature of the batch layer helps re-process data as a result of a data processing logic change that may occur due to new business requirements or a bug fix

Answer 333

A balance is required based on the processing requirements, as the throughput obtained from employing small batches may be less than from larger batches and will further require frequent updates to the serving layer

Answer 334

batch layer

Answer 335

serving layer

Answer 336

speed layer

Answer 337

query layer

Answer 338

Cloud computing

Answer 339

Distributed storage system

Answer 340

Processing system

Answer 341

Storage devices

Answer 342

Distributed Storage System

Answer 343

Cloud computing

Answer 344

Processing System

Answer 345

A dedicated storage layer helps store, pre-process and further integrate data with structured data without impacting the current storage infrastructure

Answer 346

The underlying idea is to be able to ingest large amounts of raw data and pre-process it in order to make it suitable for traditional enterprise systems

Answer 347

Is ideal for enriching the EDW with unstructured data

Answer 348

This generally involves the use of NoSQL databases such that the downstream applications can communicate directly with these databases using RESTful APIs

Answer 349

A dedicated storage layer helps store, pre-process and further integrate data with structured data without impacting the current storage infrastructure

Answer 350

Certain statistics are calculated by processing large amounts of data, or a statistical/machine learning model is run

Answer 351

Solution environment capable of storing high-volume and high-variety data in order to perform various analytics in isolation from other enterprise systems

Answer 352

Examples of functionality enhancement include personalized recommendations and discounts as well as targeted advertisements

Answer 353

Although analogous to the use of a cloud, this approach provides a better alternative in terms of uploading data to the cloud as well as data security and privacy issues

Answer 354

The underlying idea is to be able to ingest large amounts of raw data and pre-process it in order to make it suitable for traditional enterprise systems

Answer 355

Is not integrated with the EDW and is instead used directly to explore data and perform analytics

Answer 356

Keep the Big Data initiative separate from existing IT operations and systems

Answer 357

This configuration is generally opted for by enterprises that want to move towards predictive and prescriptive analytics

Answer 358

Although analogous to the use of a cloud, this approach provides a better alternative in terms of uploading data to the cloud as well as data security and privacy issues

Answer 359

A dedicated storage layer helps store, pre-process and further integrate data with structured data without impacting the current storage infrastructure

Answer 360

Generally, the ingested data is stored to the distributed file system, where it is enriched via batch processing and then stored on a NoSQL database for performing analytical queries

Answer 361

Such a solution is generally employed by enterprises that have just embarked on a Big Data journey

Answer 362

Once processed, the streaming data can be stored in disk-based storage, such as the distributed file system, for further analysis

Answer 363

This not only helps process voluminous data faster but also helps cater to infrequent or ad-hoc data processing requests

Answer 364

Although analogous to the use of a cloud, this approach provides a better alternative in terms of uploading data to the cloud as well as data security and privacy issues

Answer 365

The data can be imported into fit-forpurpose NoSQL databases, where it can be easily accessed in support of BI

Answer 366

Large data volumes are available and the data itself has not lost its value because it is kept unprocessed in its raw form

Answer 367

Based on the data storage requirements, a distributed file system or a NoSQL database can be used for data storage

Answer 368

Large amounts of transactional data from operational systems across the enterprise are collected

Answer 369

This not only helps process voluminous data faster but also helps cater to infrequent or ad-hoc data processing requests that require above-average storage and processing resources

Answer 370

Setting up a cluster in-house may result in under-utilization of processing resources, as it would not be utilized at all times

Answer 371

Is associated with the processing engine, storage device, resource manager and coordination engine mechanisms

Answer 372

Enable the processing of such datasets, which requires the use of a batch processing engine

	Created by Alveiro Garcia over 7 years ago

Next up

Module 11: Advanced Big Data Architecture

Description

Resource summary

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

Question 11

Question 12

Question 13

Question 14

Question 15

Question 16

Question 17

Question 18

Question 19

Question 20

Question 21

Question 22

Question 23

Question 24

Question 25

Question 26

Question 27

Question 28

Question 29

Question 30

Question 31

Question 32

Question 33

Question 34

Question 35

Question 36

Question 37

Question 38

Question 39

Question 40

Question 41

Question 42

Question 43

Question 44

Question 45

Question 46

Question 47

Question 48

Question 49

Question 50

Question 51

Question 52

Question 53

Question 54

Question 55

Question 56

Question 57

Question 58

Question 59

Question 60

Question 61

Question 62

Question 63

Question 64

Question 65

Question 66

Question 67

Question 68

Question 69

Question 70

Question 71

Question 72

Question 73

Question 74

Question 75

Question 76