Big Data: Principles and Best Practices of Scalable Real-Time Data Systems
ISBN: 9789351198062
328 pages
For more information write to us at: acadmktg@wiley.com

Description
This book presents the Lambda Architecture, a scalable, easy-to-understand approach that can be built and run by a small team. You'll explore the theory of big data systems and how to implement them in practice. In addition to discovering a general framework for processing big data, you'll learn specific technologies like Hadoop, Storm and NoSQL databases.
Part 1 Batch layer
2 Data model for Big Data
2.1 The properties of data
2.2 The fact-based model for representing data
2.3 Graph schemas
2.4 A complete data model for SuperWebAnalytics.com
2.5 Summary
3 Data model for Big Data: Illustration
3.1 Why a serialization framework?
3.2 Apache Thrift
3.3 Limitations of serialization frameworks
3.4 Summary
4 Data storage on the batch layer
4.1 Storage requirements for the master dataset
4.2 Choosing a storage solution for the batch layer
4.3 How distributed filesystems work
4.4 Storing a master dataset with a distributed filesystem
4.5 Vertical partitioning
4.6 Low-level nature of distributed filesystems
4.7 Storing the SuperWebAnalytics.com master dataset on a distributed filesystem
4.8 Summary
5 Data storage on the batch layer: Illustration
5.1 Using the Hadoop Distributed File System
5.2 Data storage in the batch layer with Pail
5.3 Storing the master dataset for SuperWebAnalytics.com
5.4 Summary
6 Batch layer
6.1 Motivating examples
6.2 Computing on the batch layer
6.3 Recomputation algorithms vs. incremental algorithms
6.4 Scalability in the batch layer
6.5 MapReduce: a paradigm for Big Data computing
6.6 Low-level nature of MapReduce
6.7 Pipe diagrams: a higher-level way of thinking about batch computation
6.8 Summary
7 Batch layer: Illustration
7.1 An illustrative example
7.2 Common pitfalls of data-processing tools
7.3 An introduction to JCascalog
7.4 Composition
7.5 Summary
8 An example batch layer: Architecture and algorithms
8.1 Design of the SuperWebAnalytics.com batch layer
8.2 Workflow overview
8.3 Ingesting new data
8.4 URL normalization
8.5 User-identifier normalization
8.6 Deduplicate pageviews
8.7 Computing batch views
8.8 Summary
9 An example batch layer: Implementation
9.1 Starting point
9.2 Preparing the workflow
9.3 Ingesting new data
9.4 URL normalization
9.5 User-identifier normalization
9.6 Deduplicate pageviews
9.7 Computing batch views
9.8 Summary
Part 2 Serving layer
10 Serving layer
10.1 Performance metrics for the serving layer
10.2 The serving layer solution to the normalization/denormalization problem
10.3 Requirements for a serving layer database
10.4 Designing a serving layer for SuperWebAnalytics.com
10.5 Contrasting with a fully incremental solution
10.6 Summary
11 Serving layer: Illustration
11.1 Basics of ElephantDB
11.2 Building the serving layer for SuperWebAnalytics.com
11.3 Summary
Part 3 Speed layer
12 Realtime views
12.1 Computing realtime views
12.2 Storing realtime views
12.3 Challenges of incremental computation
12.4 Asynchronous versus synchronous updates
12.5 Expiring realtime views
12.6 Summary
13 Realtime views: Illustration
13.1 Cassandra’s data model
13.2 Using Cassandra
13.3 Summary
14 Queuing and stream processing
14.1 Queuing
14.2 Stream processing
14.3 Higher-level, one-at-a-time stream processing
14.4 SuperWebAnalytics.com speed layer
14.5 Summary
15 Queuing and stream processing: Illustration
15.1 Defining topologies with Apache Storm
15.2 Apache Storm clusters and deployment
15.3 Guaranteeing message processing
15.4 Implementing the SuperWebAnalytics.com uniques-over-time speed layer
15.5 Summary
16 Micro-batch stream processing
16.1 Achieving exactly-once semantics
16.2 Core concepts of micro-batch stream processing
16.3 Extending pipe diagrams for micro-batch processing
16.4 Finishing the speed layer for SuperWebAnalytics.com
16.5 Pageviews over time 262 n Bounce-rate analysis
16.6 Another look at the bounce-rate-analysis example
16.7 Summary
17 Micro-batch stream processing: Illustration
17.1 Using Trident
17.2 Finishing the SuperWebAnalytics.com speed layer
17.3 Fully fault-tolerant, in-memory, micro-batch processing
17.4 Summary
18 Lambda Architecture in depth
18.1 Defining data systems
18.2 Batch and serving layers
18.3 Speed layer
18.4 Query layer
18.5 Summary