We hand selected email to belong to the sample set in
sampleEmail. Instead of this approach, use the
sample function to choose messages at random for the
sample. Be sure to take files from all five directories of email.
Read these files into R and test several functions with these new
messages, e.g., getBoundary()
and
dropAttach()
from the section called “Removing Attachments from the Message Body”, to make sure that they work
properly.
In the text mining approach to detecting spam we ignored all
attachments in creating the set of words belonging to a message (see
the section called “Removing Attachments from the Message Body”). Write a function to extract
words from any plain text or HTML attachment and include these
words in the set of a message's words. Try to reuse the
findMsg()
function and modify the
dropAttach()
function to accept an additional parameter
that indicates whether or not the words in attachments are to be
extracted. Does this change improve the classification?
The string manipulation functions in R can be used instead of
regular expression functions for finding, changing, extracting
substrings from strings. These functions include:
strsplit()
to divide a string up into pieces,
substr()
to extract a portion of a string,
paste()
to glue together multiple strings, and
nchar()
which returns the number of characters in a
string. Write your own version of getBoundary()
(see
the section called “Removing Attachments from the Message Body”) using these functions to
extract the boundary string from the Content-Type. Debug your function
with the messages in sampleEmail.
Write the dropAttach()
function for the section called “Removing Attachments from the Message Body”. This function has two inputs, the
body of a message and the boundary string that marks the location of
the attachments. It returns the body without its attachments. Include
in the return value the lines of the body that follow the first
boundary string up to the string marking the first attachment and the
lines following the ending boundary string. Be sure to consider the
idiosyncratic cases of no attachments and a missing ending boundary
string.
Write the function findMsgWords()
of the section called “Extracting Words from a Message Body”. This function takes as input the
message body (with no attachments) and the return value is a vector of
the unique words in the message. That is, we only track which words
are in the message, not the number of times it appears there. This
function should eliminate punctuation, digits, and blanks from the
message. Consider whether it is simpler to split the string by blanks
first and then process the punctuation, digits, etc. The function
should convert capital letters to lower case, and drop all stop words
and words that are only one letter long. A vector of stop words is
available in the tm package. Use the
stopwords()
function in this package to create the
vector.
Try to improve the text cleaning in findMsgWords()
of
the section called “Extracting Words from a Message Body” by stemming the words in the
messages. That is, make plural words singular and reduce present and
past tenses to their root words, e.g., run
,
ran
, runs
,
running
all have the same “stem”. To
do this, use the stemming functions available in the text mining
package tm. Incorporate this stemming process into the
findMsgWords()
function. Then recreate the vectors of
words for all the email and see if the classification improves.
Consider the treatment of s in the text cleaning in
findMsgWords()
of the section called “Extracting Words from a Message Body”.
Notice that this function often turns a into gibberish. Should
we drop s all together from the messages, or should we try to
keep the as one whole “word”? Why might these
alternatives be better or worse than the approach taken in the section called “Extracting Words from a Message Body”? Try one of these alternatives and
compare it to the approach of that section to see if it improves the
classification.
In the section called “Computational Considerations” we saw a few alternative
mathematical expressions for the naïve Bayes approximation to the log
likelihood ratio of the chance a message is spam or ham. Each
suggests a different approach to carrying out the computation. Create
alternative versions of computeFreqs()
and
computeMsgOdds()
in the section called “Implementing the Naïve Bayes Classifier” to calculate the log odds. Compare
the accuracy of these alternatives to the approach used in the section called “Implementing the Naïve Bayes Classifier”. Also consider timing the
computation with a call to system.time()
to determine
if one approach is much faster or slower than the others.
The function computeFreqs()
in the section called “Implementing the Naïve Bayes Classifier” uses as a default bag of words
constructed from the words passed to it in the argument
wordsList. However, it is possible to supply a
different bag of words via the bow parameter. When this
happens, it may be the case that some words in
wordsList are not found in the bag of words. This
causes an error when running the function. Determine which lines of
code in computeFreqs()
are problematic and update the
function to handle this situation. Be sure to test your code.
Use the typeIErrorRates()
function in the section called “Classifying New Messages” as a guide to write a function that
computes the Type II error rates, i.e., the proportion of spam
messages that are misclassified as ham. As with the Type I error
rates, convince yourself that the error rate is monotonic in
, that it changes only at the values
of the LLR in the provided messages, and that you only need to
consider these values for the spam messages. This function, called
typeIIErrorRates()
, has the same inputs as the
typeIErrorRates()
function and returns the same
structure. The only difference is that the rates returned are based on
Type II errors.
In the section called “Processing the Header”, we used the read.dcf()
function to read the key: value data in the email headers.
In this exercise, we use regular expressions to extract the keys and their values from the header.
The first step in the process is to find the continuation lines for a
value, and then collapse them with the first line of the value. These
continuation lines start with either a blank space or a tab character.
Use regular expressions to locate them. Then paste them to the first
line of the value.
Next break the revised set of key:value strings into the keys and
values. Again use regular expressions to do this. Then create the
names vector from these keys and values.
Write the processAttach()
function for the section called “Processing Attachments”. This function has two inputs, the
body of a message and the Content-Type
value. It
returns a list of two elements. The first is called
body and it contains the message body without its
attachments. Be sure to consider the idiosyncratic cases of no
attachments and a missing ending boundary string. The second element
is a data frame called attachDF. This data frame has
one row for each attachment in the message. There are two variables
in this data frame, aLen and aType,
which hold the number of lines and the MIME type of each attachment,
respectively.
Write a function to handle an alternative approach to measure yelling:
count the number of lines in the email message text that consist
entirely of capital letters. Carefully consider the case where the
message body is empty. How do you modify your code to report a
percentage rather than a count? In considering this modification, be
sure to make clear how you handle empty lines, lines with no
alpha-characters, and messages with no text.
Write alternative implementations for the isRe()
function in the section called “Deriving Variables from the Email Message”. For one
alternative, instead of only checking whether the subject of a message
begins Re:
, look also for Fwd:
Re:
. For a second alternative, check for
Re:
any where in the subject, not just at the
beginning of the string. Analyze the output from these three
functions, including the original isRe()
function in
the section called “Deriving Variables from the Email Message”. How many more messages have a
return value of TRUE
for these alternatives, and are they all
ham? Which do you think is most useful in predicting spam?
Choose several of the ideas listed in Table 3.1, “Variable Definition Table” for deriving features from the email
and write functions for them. Be sure to check your code against what
you expect. Try writing one of these functions in two different ways
and compare the output from each. Use exploratory data analysis
techniques to check that your code works as expected. Does the output
of your function match the values in the corresponding columns of
emailDF? If not, why do you think this might be the
case? Does it appear that this derived variable will be useful in
identifying spam or ham?
Consider other variables that are not listed in Table 3.1, “Variable Definition Table” that might by useful in the
classification problem. Write functions to derive them from the
messages in emailStruct and add them to
emailDF. Refit the classification tree with the
enhanced data frame. Were these new variables chosen to partition the
messages? Is the error in classification improved?
Carry out additional exploratory analysis as described in the section called “Checking Our Code for Errors”. Include in your analysis a
quantile-quantile plot of perCaps for the ham and spam.
Write code to handle the attachments in the message separately from
the text in the body of the message. Since each attachment has its
own header, try processing the header and body of the attachment
in a manner similar to the message header and body. Use the
processHeader()
function to do this. You may need to
revise processHeader()
to handle cases that arise in
the attachments and not in the main headers of the messages.
Consider the other parameters that can be used to control the
recursive partitioning process. Read the documentation for them in the
rpart.control()
documentation. Also, carry out an
Internet search for more information on how to tweak the
rpart()
tuning parameters. Experiment with values for
these parameters. Do the trees that result make sense with your
understanding of how the parameters are used? Can you improve the
prediction using them?
In the section called “Classifying New Messages” we used the test set that we
had put aside to both select , the
threshold for the log odds, and to evaluate the Type I and II errors
incurred when we use this threshold. Ideally, we choose
from another set of messages that is
both independent of our training data and our test data. The method
of cross-validation is designed to use the training set for training
and validating the model. Implement 5-fold cross-validation to choose
and assess the error rate with our
training data. To do this, follow the steps:
Use the sample()
function to permute the indices of the
training set, and organize these permuted indices into 5 equal-size
sets, called folds.
For each fold, take the corresponding subset from the training data to
use as a
'test'
set. Use the remaining messages in the
training data as the training set. Apply the functions developed in
the section called “Implementing the Naïve Bayes Classifier” to estimate the probabilities
that a word occurs in a message given it is spam or ham, and use these
probabilities to compute the log likelihood ratio for the messages in the
training set.
Pool all of the LLR values from the messages in all of the folds, i.e.,
from all of the training data, and use these values and the
typeIErrorRate()
function to select a threshold that
achieves a 1% Type I error.
Apply this threshold to our original/real test set and find its Type I
and Type II errors.
Use the sample()
function to permute the indices of the
training set, and organize these permuted indices into 5 equal-size
sets, called folds.
For each fold, take the corresponding subset from the training data to
use as a
'test'
set. Use the remaining messages in the
training data as the training set. Apply the functions developed in
the section called “Implementing the Naïve Bayes Classifier” to estimate the probabilities
that a word occurs in a message given it is spam or ham, and use these
probabilities to compute the log likelihood ratio for the messages in the
training set.
Pool all of the LLR values from the messages in all of the folds, i.e.,
from all of the training data, and use these values and the
typeIErrorRate()
function to select a threshold that
achieves a 1% Type I error.
Apply this threshold to our original/real test set and find its Type I
and Type II errors.
Often in statistics, when we combine two methods, the resulting hybrid
out performs the two pure methods. For example, consider a naïve Bayes
approach that incorporates the derived variables into the probability
calculation. For simplicity, you might try a few variables that
result from the important splits in the recursive partitioning tree
from Figure 3.8, “Tree for Partitioning Email to Predict Spam”, e.g., whether or not the
percentage of capitals in the message body exceeds 13%. These
variables have only two values as the splits are based on yes-no
questions. Develop a hybrid classifier that uses both the word vectors
and these additional features.
An alternative to recursive partitioning is the
nearest neighbor method. This
method computes the distance between two messages using the values of
the derived variables from the section called “Deriving Variables from the Email Message”, or
some subset of them. Use the email in trainDF to find
the closest messages to each email in
testDF. Then use these
neighbors to classify the test message as spam or ham. That is, use
the neighbors' classifications to vote on the classification for the
test message. Compare this method for predicting spam to the
recursive partitioning approach. Use both Type I and Type II errors in
making your comparison. Include a comparison of the two approaches
from a computational perspective. Is one much faster or easier to
implement?