Computational Topics by Chapter

1 Predicting Location via Indoor Positioning Systems

2 Modeling Runners' Times in the Cherry Blossom Race

3 Using Statistics to Identify Spam

The computational work in this chapter involves extensive text processing, which includes the following types of computations.

In computing the log likelihood ratio for a message, we used the following representation of this quantity to guide how we wrote the code In other words, we first computed from the observed proportions in our training set the estimates to , , , and . Then, we took logs of these estimated probabilities and combined them to calculate the LLR for a particular message. That is, we selected which of these terms to include in the above sum, according to whether each word in the bag of words was present or absent from that message. Given our bag of words consists of more than 80,000 words, we want to consider whether there are faster or more accurate ways to carry out these computations.

The following are equivalent representations of the log likelihood ratio: These alternative mathematical expressions each suggest a different approach to carrying out the computation of the log odds. We leave it as an exercise to write code for them and compare the results to our approach.

Why might these various alternatives not give us the same answer? A computer is a finite state machine, meaning that it has only a fixed amount of space to store a number so some numbers can only be approximated, e.g., irrational numbers. Additionally, the order of operations can matter. For example, if we have one large number and many small numbers, then adding up all of the small numbers first and then adding this total to the large number can produce a more accurate result than adding the small numbers one at a time to the large number. Below is an artificial example that makes this point:

smallNums = rep((1/2)^40, 2000000)
largeNum = 10000

print(sum(smallNums), digits = 20)
[1] 1.8189894035458564758e-06
print(largeNum + sum(smallNums), digits = 20)
[1] 10000.000001818989404
for (i in 1:length(smallNums)) { largeNum = largeNum + smallNums[i] } print(largeNum, digits = 20)
[1] 10000

In our case, we are working with thousands of small numbers such as the proportion of spam that contains a particular work. It might matter quite a bit how we compute the LLR.

4 Processing Robot and Sensor Log Files: Seeking a Circular Target

5 Strategies for Analyzing a 12-Gigabyte Data Set: Airline Flight Delays

6 Pairs Trading

This case study ranges from simple access to data directly from the Internet within R[bib:R], to working with dates, writing fast code, simulating random processes, and "estimating" optimal parameter values for an objective function using a grid search. These include:

7 Simulation Study of a Branching Process

8 A Self-Organizing Dynamic System with a Phase Transition

We will map a mathematical description of a dynamic process to code. We will focus on writing small functions in a flexible, reusable manner. We will validate the functions and combine them to create higher-level functions. We'll use R's[bib:R] simple S3 object-oriented programming model to define classes and methods for working with the data structures from our simulation of the dynamic model. We'll explore making the code faster in several different ways, including writing vectorized code and using compiled C code from within R. Finally, we'll use a simulation study to investigate the behavior of the dynamic model for different ranges of the inputs and different outputs.

9 Simulating Blackjack

In this chapter, we cover the following:

10 Baseball: Exploring Data in a Relational Database

To explore the baseball data, we need to gain some experience working with a relational database. More specifically, we need to accomplish the following:

We use R functions to interface with the database, and we compare how we might perform some or all of the computations in R. This comparison also helps us understand the pros and cons of working in one environment over the other.

11 CIA Factbook Mashup

12 Exploring Data Science Jobs with Web Scraping and Text Mining