Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining

Simon Munzert

ISBN: 9788126570423

480 pages

INR 839


This book provides a unified framework of web scraping and information extraction from text data with R for the social sciences. It demonstrates how to use scraping tools, text mining, data management, visualization and publication software as well as explaining how to use R for every step in a research project. A multitude of exercises and examples are presented to illustrate each technique. R code and solutions to the exercises presented are featured on the books supporting website. 



1 Introduction

1.1 Case study: World Heritage Sites in Danger  

1.2 Some remarks on web data quality

1.3 Technologies for disseminating, extracting and storing web data

1.4 Structure of the book


Part One A Primer on Web and Data Technologies


2.1 Browser presentation and source code

2.2 Syntax rules

2.3 Tags and attributes

2.4 Parsing


3 XML and JSON

3.1 A short example XML document

3.2 XML syntax rules

3.3 When is an XML document well-formed or valid?

3.4 XML extensions and technologies

3.5 XML and R in practice

3.6 A short example JSON document

3.7 JSON syntax rules

3.8 JSON and R in practice


4 XPath

4.1 XPath--a query language for web documents

4.2 Identifying node sets with XPath

4.3 Extracting node elements



5.1 HTTP fundamentals

5.2 Advanced features of HTTP

5.3 Protocols beyond HTTP

5.4 HTTP in action



6.1 JavaScript

6.2 XHR

6.3 Exploring AJAX with Web Developer Tools


7 SQL and relational databases

7.1 Overview and terminology

7.2 Relational Databases

7.3 SQL: a language to communicate with Databases

7.4 Databases in action


8 Regular expressions and essential string functions

8.1 Regular expressions

8.2 String processing

8.3 A word on character encodings


Part Two A Practical Toolbox for Web Scraping and Text Mining

9 Scraping the Web

9.1 Retrieval scenarios

9.2 Extraction strategies

9.3 Web scraping: Good practice

9.4 Valuable sources of inspiration


10 Statistical text processing

10.1 The running example: Classifying press releases of the British government

10.2 Processing textual data

10.3 Supervised learning techniques

10.4 Unsupervised learning techniques


11 Managing data projects

11.1 Interacting with the file system

11.2 Processing multiple documents/links

11.3 Organizing scraping procedures

11.4 Executing R scripts on a regular basis


Part Three A Bag of Case Studies

12 Collaboration networks in the US Senate

12.1 Information on the bills

12.2 Information on the senators

12.3 Analyzing the network structure

12.4 Conclusion


13 Parsing information from semi structured documents

13.1 Downloading data from the FTP server

13.2 Parsing semi structured text data

13.3 Visualizing station and temperature data


14 Predicting the 2014 Academy Awards using Twitter

14.1 Twitter APIs: Overview

14.2 Twitter-based forecast of the 2014 Academy Awards

14.3 Conclusion


15 Mapping the geographic distribution of names

15.1 Developing a data collection strategy

15.2 Website inspection

15.3 Data retrieval and information extraction

15.4 Mapping names

15.5 Automating the process


16 Gathering data on mobile phones

16.1 Page exploration

16.2 Scraping procedure

16.3 Graphical analysis

16.4 Data storage


17 Analyzing sentiments of product reviews

17.1 Introduction

17.2 Collecting the data

17.3 Analyzing the data

17.4 Conclusion



General Index

Package Index

Function Index