# The Data Science Handbook

ISBN: 9788126573332

416 pages

## Description

Finding a good data scientist has been likened to hunting for a unicorn. The required combination of software engineering skills, mathematical fluency, and business savvy are simply very hard to find in one person. On top of that, good data science is not just rote application of trainable skillsets, but rather requires the ability to think critically in all these areas. This book provides a crash course in data science, combining all the necessary skills into a unified discipline. The author describes the classic machine learning algorithms, including the mathematics needed to understand what's really going on. Classical statistics is taught so that readers learn to think critically about the interpretation of data and its common pitfalls. In addition, basic software engineering and computer science skillsets often lacking in data scientists are given a central place in the book. Visualization tools are reviewed, and their central importance in data science is highlighted

Preface

1 Introduction: Becoming a Unicorn

1.1 Aren't Data Scientists Just Overpaid Statisticians?

1.2 How Is This Book Organized?

1.3 How to Use This Book?

1.4 Why Is It All in Python, Anyway?

1.5 Example Code and Datasets

1.6 Parting Words

Part I The Stuff You'll Always Use

2 The Data Science Road Map

2.1 Frame the Problem

2.2 Understand the Data: Basic Questions

2.3 Understand the Data: Data Wrangling

2.4 Understand the Data: Exploratory Analysis

2.5 Extract Features

2.6 Model

2.7 Present Results

2.8 Deploy Code

2.9 Iterating

2.10 Glossary

3 Programming Languages

3.1 Why Use a Programming Language? What Are the Other Options?

3.2 A Survey of Programming Languages for Data Science

3.3 Python Crash Course

3.4 Strings

3.5 Defining Functions

3.6 Python's Technical Libraries

3.7 Other Python Resources

3.8 Further Reading

3.9 Glossary

3a Interlude: My Personal Toolkit

4 Data Munging: String Manipulation, Regular Expressions and Data Cleaning

4.1 The Worst Dataset in the World

4.2 How to Identify Pathologies

4.3 Problems with Data Content

4.4 Formatting Issues

4.5 Example Formatting Script

4.6 Regular Expressions

4.7 Life in the Trenches

4.8 Glossary

5 Visualizations and Simple Metrics

5.1 A Note on Python's Visualization Tools

5.2 Example Code

5.3 Pie Charts

5.4 Bar Charts

5.5 Histograms

5.6 Means, Standard Deviations, Medians and Quantiles

5.7 Boxplots

5.8 Scatterplots

5.9 Scatterplots with Logarithmic Axes

5.10 Scatter Matrices

5.11 Heatmaps

5.12 Correlations

5.13 Anscombe's Quartet and the Limits of Numbers

5.14 Time Series

5.15 Further Reading

5.16 Glossary

6 Machine Learning Overview

6.1 Historical Context

6.2 Supervised versus Unsupervised

6.3 Training Data, Testing Data and the Great Boogeyman of Overfitting

6.4 Further Reading

6.5 Glossary

7 Interlude: Feature Extraction Ideas

7.1 Standard Features

7.2 Features That Involve Grouping

7.3 Preview of More Sophisticated Features

7.4 Defining the Feature You Want to Predict

8 Machine Learning Classification

8.1 What Is a Classifier and What Can You Do with It?

8.2 A Few Practical Concerns

8.3 Binary versus Multiclass

8.4 Example Script

8.5 Specific Classifiers

8.6 Evaluating Classifiers

8.7 Selecting Classification Cutoffs

8.8 Further Reading

8.9 Glossary

9 Technical Communication and Documentation

9.1 Several Guiding Principles

9.2 Slide Decks

9.3 Written Reports

9.4 Speaking: What Has Worked for Me

9.5 Code Documentation

9.6 Further Reading

9.7 Glossary

Part II Stuff You Still Need to Know

10 Unsupervised Learning: Clustering and Dimensionality Reduction

10.1 The Curse of Dimensionality

10.2 Example: Eigenfaces for Dimensionality Reduction

10.3 Principal Component Analysis and Factor Analysis

10.4 Skree Plots and Understanding Dimensionality

10.5 Factor Analysis

10.6 Limitations of PCA

10.7 Clustering

10.8 Further Reading

10.9 Glossary

11 Regression

11.1 Example: Predicting Diabetes Progression

11.2 Least Squares

11.3 Fitting Nonlinear Curves

11.4 Goodness of Fit: R2 and Correlation

11.5 Correlation of Residuals

11.6 Linear Regression

11.7 LASSO Regression and Feature Selection

11.8 Further Reading

11.9 Glossary

12 Data Encodings and File Formats

12.1 Typical File Format Categories

12.2 CSV Files

12.3 JSON Files

12.4 XML Files

12.5 HTML Files

12.6 Tar Files

12.7 GZip Files

12.8 Zip Files

12.9 Image Files: Rasterized, Vectorized, and/or Compressed

12.10 It's All Bytes at the End of the Day

12.11 Integers

12.12 Floats

12.13 Text Data

12.14 Further Reading

12.15 Glossary

13 Big Data

13.1 What Is Big Data?

13.2 Hadoop: The File System and the Processor

13.3 Using HDFS

13.4 Example PySpark Script

13.5 Spark Overview

13.6 Spark Operations

13.7 Two Ways to Run PySpark

13.8 Configuring Spark

13.9 Under the Hood

13.10 Spark Tips and Gotchas

13.11 The MapReduce Paradigm

13.12 Performance Considerations

13.13 Further Reading

13.14 Glossary

14 Databases

14.1 Relational Databases and MySQL

14.2 Key-Value Stores

14.3 Wide Column Stores

14.4 Document Stores

14.5 Further Reading

14.6 Glossary

15 Software Engineering Best Practices

15.1 Coding Style

15.2 Version Control and Git for Data Scientists

15.3 Testing Code

15.4 Test-Driven Development

15.5 AGILE Methodology

15.6 Further Reading

15.7 Glossary

16 Natural Language Processing

16.1 Do I Even Need NLP?

16.2 The Great Divide: Language versus Statistics

16.3 Example: Sentiment Analysis on Stock Market Articles

16.4 Software and Datasets

16.5 Tokenization

16.6 Central Concept: Bag of Words

16.7 Word Weighting: TFIDF

16.8 nGrams

16.9 Stop Words

16.10 Lemmatization and Stemming

16.11 Synonyms

16.12 Part of Speech Tagging

16.13 Common Problems

16.14 Advanced NLP: Syntax Trees, Knowledge, and Understanding

16.15 Further Reading

16.16 Glossary

17 Time Series Analysis

17.1 Example: Predicting Wikipedia Page Views

17.2 A Typical Workflow

17.3 Time Series versus Time-Stamped Events

17.4 Resampling an Interpolation

17.5 Smoothing Signals

17.6 Logarithms and Other Transformations

17.7 Trends and Periodicity

17.8 Windowing

17.9 Brainstorming Simple Features

17.10 Better Features: Time Series as Vectors

17.11 Fourier Analysis: Sometimes a Magic Bullet

17.12 Time Series in Context: The Whole Suite of Features

17.13 Further Reading

17.14 Glossary

18 Probability 261

18.1 Flipping Coins: Bernoulli Random Variables

18.2 Throwing Darts: Uniform Random Variables

18.3 The Uniform Distribution and Pseudorandom Numbers

18.4 Nondiscrete, Noncontinuous Random Variables

18.5 Notation, Expectations, and Standard Deviation

18.6 Dependence, Marginal and Conditional Probability

18.7 Understanding the Tails

18.8 Binomial Distribution

18.9 Poisson Distribution

18.10 Normal Distribution

18.11 Multivariate Gaussian

18.12 Exponential Distribution

18.13 Log-Normal Distribution

18.14 Entropy

18.15 Further Reading

18.16 Glossary

19 Statistics

19.1 Statistics in Perspective

19.2 Bayesian versus Frequentist: Practical Tradeoffs and Differing Philosophies

19.3 Hypothesis Testing: Key Idea and Example

19.4 Multiple Hypothesis Testing

19.5 Parameter Estimation

19.6 Hypothesis Testing: t-Test

19.7 Confidence Intervals

19.8 Bayesian Statistics

19.9 Naive Bayesian Statistics

19.10 Bayesian Networks

19.11 Choosing Priors: Maximum Entropy or Domain Knowledge

19.12 Further Reading

19.13 Glossary

20 Programming Language Concepts

20.1 Programming Paradigms

20.2 Compilation and Interpretation

20.3 Type Systems

20.4 Further Reading

20.5 Glossary

21 Performance and Computer Memory

21.1 Example Script

21.2 Algorithm Performance and Big O Notation

21.3 Some Classic Problems: Sorting a List and Binary Search

21.4 Amortized Performance and Average Performance

21.5 Two Principles: Reducing Overhead and Managing Memory

21.6 Performance Tip: Use Numerical Libraries When Applicable

21.7 Performance Tip: Delete Large Structures You Don't Need

21.8 Performance Tip: Use Built In Functions When Possible

21.9 Performance Tip: Avoid Superfluous Function Calls

21.10 Performance Tip: Avoid Creating Large New Objects

21.11 Further Reading

21.12 Glossary

Part III Specialized or Advanced Topics

22 Computer Memory and Data Structures

22.1 Virtual Memory, the Stack, and the Heap

22.2 Example C Program

22.3 Data Types and Arrays in Memory

22.4 Structs

22.5 Pointers, the Stack, and the Heap

22.6 Key Data Structures

22.7 Further Reading

22.8 Glossary

23 Maximum Likelihood Estimation and Optimization

23.1 Maximum Likelihood Estimation

23.2 A Simple Example: Fitting a Line

23.3 Another Example: Logistic Regression

23.4 Optimization

23.5 Gradient Descent and Convex Optimization

23.6 Convex Optimization

23.7 Stochastic Gradient Descent

23.8 Further Reading

23.9 Glossary

24 Advanced Classifiers

24.1 A Note on Libraries

24.2 Basic Deep Learning

24.3 Convolutional Neural Networks

24.4 Different Types of Layers. What the Heck Is a Tensor?

24.5 Example: The MNIST Handwriting Dataset

24.6 Recurrent Neural Networks

24.7 Bayesian Networks

24.8 Training and Prediction

24.9 Markov Chain Monte Carlo

24.10 PyMC Example

24.11 Further Reading

24.12 Glossary

25 Stochastic Modeling

25.1 Markov Chains

25.2 Two Kinds of Markov Chain, Two Kinds of Questions

25.3 Markov Chain Monte Carlo

25.4 Hidden Markov Models and the Viterbi Algorithm

25.5 The Viterbi Algorithm

25.6 Random Walks

25.7 Brownian Motion

25.8 ARIMA Models

25.9 Continuous Time Markov Processes

25.10 Poisson Processes

25.11 Further Reading

25.12 Glossary

25a Parting Words: Your Future as a Data Scientist

Index