# Computational Topics by Chapter

## 1 Predicting Location via Indoor Positioning Systems

• string manipulation

• data structures and representation, including variable length observations

• aggregating data in ragged arrays

• exploratory data analysis and visualization

• modular functions

• debugging

• nearest neighbor methods

• cross-validation for parameter selection

## 2 Modeling Runners' Times in the Cherry Blossom Race

• Use regular expressions to extract and clean messy data from pre-formatted text tables and to create unique identifiers for matching records that belong to the same individual.

• Employ statistical techniques to identify bad data and to confirm these problems have been corrected.

• Visualize data that have a large number of observations (`~`150,000 records).

• Gain experience with the R formula language for plots and modeling.

• Fit piecewise linear models using least squares and non-parametric curves using local averaging.

• Compare data structures, e.g., a data frame and a list of data frames, for holding and working with longitudinal data. This includes the application of 'apply' functions such as tapply(), mapply(), sapply(), and lapply().

• Develop strategies for debugging code with recover() for browsing active function calls after an error.

• Scrape simple Web pages for text content.

## 3 Using Statistics to Identify Spam

The computational work in this chapter involves extensive text processing, which includes the following types of computations.

• Locate and process thousands of files by programmatically finding the names of the files and reading them into R in a general and automated manner.

• Manipulate strings and apply regular expressions to strings to turn the unstructured text in a message into structured information for analysis.

• Use cross-validation to select among competing models, e.g., by varying parameters in fitting a model, and evaluate the selected model using independent test data. This includes finding the Type I and II errors incurred when applying different model fits to test data.

• Efficiently and accurately compute with vectors containing large numbers of small values.

• Explore the email to design variables and develop intuition for what may be useful features in predicting spam and ham.

• Classification trees and recursive partitioning.

• Debugging techniques.

• Complex data structures, e.g., lists of lists of vectors and data frames.

• Writing modular functions to process the email files.

In computing the log likelihood ratio for a message, we used the following representation of this quantity to guide how we wrote the code In other words, we first computed from the observed proportions in our training set the estimates to , , , and . Then, we took logs of these estimated probabilities and combined them to calculate the LLR for a particular message. That is, we selected which of these terms to include in the above sum, according to whether each word in the bag of words was present or absent from that message. Given our bag of words consists of more than 80,000 words, we want to consider whether there are faster or more accurate ways to carry out these computations.

The following are equivalent representations of the log likelihood ratio: These alternative mathematical expressions each suggest a different approach to carrying out the computation of the log odds. We leave it as an exercise to write code for them and compare the results to our approach.

Why might these various alternatives not give us the same answer? A computer is a finite state machine, meaning that it has only a fixed amount of space to store a number so some numbers can only be approximated, e.g., irrational numbers. Additionally, the order of operations can matter. For example, if we have one large number and many small numbers, then adding up all of the small numbers first and then adding this total to the large number can produce a more accurate result than adding the small numbers one at a time to the large number. Below is an artificial example that makes this point:

```smallNums = rep((1/2)^40, 2000000)
largeNum = 10000

print(sum(smallNums), digits = 20)
 1.8189894035458564758e-06

print(largeNum + sum(smallNums), digits = 20)
 10000.000001818989404

for (i in 1:length(smallNums)) {
largeNum = largeNum + smallNums[i]
}
print(largeNum, digits = 20)
 10000
```

In our case, we are working with thousands of small numbers such as the proportion of spam that contains a particular work. It might matter quite a bit how we compute the LLR.

## 4 Processing Robot and Sensor Log Files: Seeking a Circular Target

• Text processing of log files
• Visualization
• Non-linear least squares
• Numerical optimization
• Goodness-of-fit criteria
• Streaming data

## 5 Strategies for Analyzing a 12-Gigabyte Data Set: Airline Flight Delays

• Big Data strategies.
• Shell commands and pipes.
• Relational databases.
• Parallel computing.
• External data representation.
• The split-apply-combine approach.

## 6 Pairs Trading

This case study ranges from simple access to data directly from the Internet within R[bib:R], to working with dates, writing fast code, simulating random processes, and "estimating" optimal parameter values for an objective function using a grid search. These include:

## 7 Simulation Study of a Branching Process

• The Monte Carlo method, simulation, and random number generation using R's[bib:R] built-in probability distribution generators.
• Loops and recursion.
• Modular code and designing code in small testable steps.
• Debugging code via exploratory data analysis of the results.
• Data structures and deciding what representation to use for simulation results.
• Efficiency and profiling code.
• Three-dimensional visualizations and creating custom visualizations of simulation results.

## 8 A Self-Organizing Dynamic System with a Phase Transition

We will map a mathematical description of a dynamic process to code. We will focus on writing small functions in a flexible, reusable manner. We will validate the functions and combine them to create higher-level functions. We'll use R's[bib:R] simple S3 object-oriented programming model to define classes and methods for working with the data structures from our simulation of the dynamic model. We'll explore making the code faster in several different ways, including writing vectorized code and using compiled C code from within R. Finally, we'll use a simulation study to investigate the behavior of the dynamic model for different ranges of the inputs and different outputs.

• Developing small functions that we combine together to implement the entire dynamic process.
• Computational efficiency.
• Choice of data structures to simplify and improve computations.
• Comparing loops and vectorized operations and striving for vectorized code.
• Matrix subsetting operations in R.
• Classes and S3object-oriented programming in R.
• Profiling code to find where bottlenecks occur.
• The C programming language and interfacing to C code from R.
• Simulation.

## 9 Simulating Blackjack

In this chapter, we cover the following:

• How to break down a complex problem into simple tasks.

• Ways to use data to make your functions simpler.

• Tips for developing robust functions in R.

• Strategies to make sure that our code actually works the way we expect it to work.

• Reference classes, for creating objects that can be modified in place.

## 10 Baseball: Exploring Data in a Relational Database

To explore the baseball data, we need to gain some experience working with a relational database. More specifically, we need to accomplish the following:

• Read schema for a database to understand how the database is organized.
• Use basic SELECT statements, including the WHERE clause, to access and retrieve subsets of data stored in a single table.
• Merge and combine information within and across tables with clauses such as GROUP BY, HAVING, and LEFT OUTER JOIN.
• Retrieve data from a query in blocks and further process it in R.

We use R functions to interface with the database, and we compare how we might perform some or all of the computations in R. This comparison also helps us understand the pros and cons of working in one environment over the other.

## 11 CIA Factbook Mashup

• Data manipulation: extract data from different formats, merge data from multiple sources, and transform variables.

• Programmatic access to data available on the Web: we write code that uses HTTP to acquire the data.

• Programmatic data retrieval to create a reproducible record of the data acquisition process.

• Visualization: maps, color, plotting symbols, labels, and legends.

• Interactive visualization on Google Earth, including customized placemarks, pop-up windows, tool-tips, and legends.

• Modularity: design functions to create our visualization in pieces that correspond to different subtasks of the larger task of making a map, which facilitates testing and enables re-usability for different purposes.

• Tree structures in XML and XPath expressions to locate content.

## 12 Exploring Data Science Jobs with Web Scraping and Text Mining

• Scraping data from HTML pages.
• The basics of HTTP/Web requests.
• Using the XPath language and queries.
• Manipulating XML/HTML elements in R.
• Text processing, regular expressions, stop words, and word stemming.
• Reusable functions, modular software, and designing software for extensibility.
• Simple exploratory data analysis and visualization.