The computational work in this chapter involves extensive text
processing, which includes the following types of computations.

Locate and process thousands of files by programmatically finding the
names of the files and reading them into R in a general and
automated manner.

Manipulate strings and apply regular expressions to strings to turn
the unstructured text in a message into structured information for
analysis.

Use cross-validation to select among competing models, e.g., by
varying parameters in fitting a model, and evaluate the selected model
using independent test data. This includes finding the Type I and II
errors incurred when applying different model fits to test data.

Efficiently and accurately compute with vectors containing large
numbers of small values.

Explore the email to design variables and develop intuition for what
may be useful features in predicting spam and ham.

Classification trees and recursive partitioning.

Debugging techniques.

Complex data structures, e.g., lists of lists of vectors and data frames.

Writing modular functions to process the email files.

In computing the log likelihood ratio for a message, we used the
following representation of this quantity to guide how we wrote the
code
In other words, we first computed from the observed proportions in our
training set the estimates to
, , , and .
Then, we took logs of these estimated probabilities and combined them to
calculate the LLR for a particular message. That is, we selected which
of these terms to include in the above sum, according to whether
each word in the bag of
words was present or absent from that message. Given our bag of words
consists of more than 80,000 words, we want to consider whether there
are faster or more accurate ways to carry out these computations.

The following are equivalent representations of the log likelihood ratio:
These alternative mathematical expressions each suggest a
different approach to carrying out the computation of the log
odds. We leave it as an exercise to write code for them and
compare the results to our approach.

Why might these various alternatives not give us the same answer? A
computer is a finite state machine, meaning that it has only a fixed
amount of space to store a number so some numbers can only be
approximated, e.g., irrational numbers. Additionally, the order of
operations can matter. For example, if we have one large number and
many small numbers, then adding up all of the small numbers first and
then adding this total to the large number can produce a more accurate
result than adding the small numbers one at a time to the large
number. Below is an artificial example that makes this point:

smallNums = rep((1/2)^40, 2000000)
largeNum = 10000
print(sum(smallNums), digits = 20)
[1] 1.8189894035458564758e-06

print(largeNum + sum(smallNums), digits = 20)
[1] 10000.000001818989404

for (i in 1:length(smallNums)) {
largeNum = largeNum + smallNums[i]
}
print(largeNum, digits = 20)
[1] 10000

In our case, we are working with thousands of small numbers such as
the proportion of spam that contains a particular work.
It might matter quite a bit how we compute the LLR.

This case study ranges from simple access to data directly from the
Internet within R[bib:R], to working with dates, writing fast code,
simulating random processes, and "estimating" optimal parameter values
for an objective function using a grid search.
These include:

We will map a mathematical description of a
dynamic process to code. We will focus on writing small functions in a
flexible, reusable manner. We will validate the functions and combine
them to create higher-level functions. We'll use R's[bib:R] simple S3
object-oriented programming model to define classes and methods for
working with the data structures from our simulation of the dynamic
model. We'll explore making the code faster in several different
ways, including writing vectorized code and using compiled *C* code
from within R. Finally, we'll use a simulation study to
investigate the behavior of the dynamic model for different ranges of
the inputs and different outputs.

- Developing small functions that we combine together to implement the entire dynamic process.
- Computational efficiency.
- Choice of data structures to simplify and improve computations.
- Comparing loops and vectorized operations and striving for vectorized code.
- Matrix subsetting operations in R.
- Classes and S3object-oriented programming in R.
- Profiling code to find where bottlenecks occur.
- The
*C* programming language and interfacing to *C* code from R. - Simulation.

In this chapter, we cover the following:

How to break down a complex problem into simple tasks.

Ways to use data to make your functions simpler.

Tips for developing robust functions in R.

Strategies to make sure that our code actually works the way we expect it to work.

Reference classes, for creating objects that can be modified in
place.

To explore the baseball data, we need to gain some experience working
with a relational database. More specifically, we need to accomplish
the following:

- Read schema for a database to understand how the database is organized.
- Use basic
*SELECT* statements, including
the **WHERE** clause, to access and retrieve subsets of
data stored in a single table. - Merge and combine information within and across tables with
clauses such as
*GROUP BY*,
**HAVING**, and *LEFT OUTER JOIN*. - Retrieve data from a query in blocks and further process it in R.

We use R functions to interface with the database, and we compare
how we might perform some or all of the computations in R. This
comparison also helps us understand the pros and cons of working in
one environment over the other.