1. Predicting Location via Indoor Positioning Systems (15)
  2. Modeling Runners' Times in the Cherry Blossom Race (19)
  3. Using Statistics to Identify Spam (26)
  4. Processing Robot and Sensor Log Files: Seeking a Circular Target (17)
  5. Strategies for Analyzing a 12 Gigabyte Data Set: Airline Flight Delays (15)
  6. Pairs Trading (17)
  7. Simulation Study of a Branching Process (12)
  8. A Self-Organizing Dynamic System with a Phase Transition (35)
  9. Simulating Blackjack (11)
  10. Baseball: Exploring Data in a Relational Database (14)
  11. CIA Factbook Mashup (15)
  12. Exploring Data Science Jobs with Web Scraping and Text Mining (20)

1 Predicting Location via Indoor Positioning Systems

  1. Write the code to read the raw training data into the data structure in the first approach described in the section called “The Raw Data”. That is, the data structure is a data frame with a column for each MAC address that detected a signal. For the column name, use the last two characters of the MAC address, or some other unique identifier.

  2. Compare the size of two data structures: the data frame created in the section called “The Raw Data” and the data frame created in the previous problem. Which uses less memory? What is the dimension of each? How might this change with different numbers of devices in the building? different number of signals from the less commonly detected devices? Use object.size() and dim() to address these questions.

  3. Compare the time taken to read the raw data and create the data frame using the two approaches described in the section called “The Raw Data” both approaches? Do this for different size subsets of the data (chosen at random) and draw a curve of input size and time taken to read the data? Comment on the memory and speed for the two approaches. Use system.time() and Rprof() to make these comparisons.

  4. Examine the time variable in the offline data. Any change over time in the characteristics of the signal caused by, e.g., reduced battery power in the measuring device as time goes by, or measurements taken on different days may be made by different people with different levels of accuracy. Also, examination of time can give insight into how the experiment was carried out. Were the positions close to each other measured at similar times? Do you see any change in the signal strength variability or mean over time? Try controlling for other variables that might affect this relationship.

  5. Write the readData() function described in the section called “Creating a Function to Prepare the Data”. The arguments to this function are the file name, filename and the MAC addresses to retain, subMacs. Determine whether these parameters should have default values or not. The return value is the data frame described in the section called “Cleaning the Data and Building a Representation for Analysis”. Use the findGlobals() function available in codetools to check that the function is not relying on any global variables.

  6. In the section called “Distribution of Signal Strength” we calculated measures of center and location for the signal strengths at each location angle access point combination. (See for example Figureá1.9, “Comparison of Mean and Median Signal Strength”.) Another possible summary statistic we can calculate is the Kolmogorov-Smirnov test-statistic for normality. If the signal strengths are roughly normal, then we expect the -values to have a uniform distribution. This leads to about 5% of the -values for the 8000 tests to fall below 0.05.

  7. Write the surfaceSS() function that creates plots such as those in Figureá1.10, “Median Signal at Two Access Points and Two Angles”. This function has three arguments: data for the offline summary data frame, and mac and angle which supply, respectively, the MAC address and angle to select the subset of the data which we want smoothed and plotted.

  8. Consider the scatter plots in Figureá1.11, “Signal Strength vs. Distance to Access Point”. There appears to be curvature in the signal strength – distance relationship. Does a log transformation improve this relationship, i.e., make it linear? Note that the signals are negative values so we need to be careful if we want to take the log of signal strength.

  9. The floor plan for the building (see Figureá1.1, “Floor Plan of the Test Environment”) shows six access points. However, the data contain seven access points with roughly the expected number of signals (166 location 8 orientations 110 replications 146,080 measurements). With the signal strength seen in the heat maps of Figureá1.10, “Median Signal at Two Access Points and Two Angles”), we matched the access points to the corresponding MAC address. However, two of the MAC addresses seem to be for the same access point. In the section called “Exploring MAC Address” we decided to keep the measurements from the MAC address 00:0f:a3:39:e1:c0 and to eliminate the 00:0f:a3:39:dd:cd address. Conduct a more thorough data analysis into these two MAC addresses. Did we make the correct decision? Does swapping out the one we kept for the one we discarded improve the prediction?

  10. Write the selectTrain() function described in the section called “Choice of Orientation”. This function has three parameters: angleNewObs, the angle of the new observation; signals, the training data, i.e., data in the format of offlineSummary; and m, the number of angles to include from signals. The function returns a data frame that matches trainSS, i.e., selectTrain() calls reshapeSS() (see the section called “Choice of Orientation” for this function definition).

  11. We use Euclidean distance to find the distance between the signal strength vectors. However, Euclidean distance is not robust in that it is sensitive to outliers. Consider other metrics such as the distance, i.e., the absolute value of the difference. Modify the findNN() function in the section called “Finding the Nearest Neighbors” to use this alternative distance. Does it improve the predictions?

  12. To predict location, we use the nearest neighbors to a set of signal strengths. We average the known values for these neighbors. However, a better predictor might be a weighted average, where the weights are inversely proportional to the ``distance'' (in signal strength) from the test observation. This allows us to include the points that are close, but to differentiate between them by how close they actually are. The weights might be for the -th closest neighboring observation where is the distance from our new test point to this reference point (in signal strength space). Implement this alternative prediction method. Does this improve the predictions? Use calcError() to compare this approach to the simple average.

  13. In the section called “Cross-Validation and Choice of ” we used cross-validation to choose , the number of neighbors. Another parameter to choose is the number of angles at which the signal strength was measured. Use cross-validation to select this value. You might also consider selecting the pair of parameter, i.e., and the number of angles, simultaneously.

  14. The researchers who collected these data implemented a Bayesian approach to predicting location from signal strength. Their work is described in a paper which is available from http://www.informatik.uni-mannheim.de/pi4/publications/King2006g.pdf . Consider implementing this approach to building a statistical IPS.

  15. Other statistical techniques have been developed to predict indoor positions from wireless local area networks. These include[bib:Krishnan, bib:Madigan and bib:Youssef]. Consider employing one their approaches to building and testing a statistical IPS with the CRAWDAD data.


2 Modeling Runners' Times in the Cherry Blossom Race

  1. Write a function that uses read.fwf() to read the 28 text tables in MenTxt/ and WomenTxt/ into R. These are called 1999.txt, 2000.txt, etc. and are described in greater detail in the section called “Reading Tables of Race Results into R”. Examine the tables in a plain text editor to determine the start and end position of each column of interest (name, hometown, age, and gun and net time). Use statistics to explore the results and confirm that you have extracted the information from the correct positions in the text.

  2. Revise the extractVariables() function (see the section called “Reading Tables of Race Results into R”) to remove the rows in menTables that are blank. In addition, eliminate the rows that begin with a '*' or a '#' . You may find the following regular expression helpful for locating blank rows in a table

    grep("^[[:blank:]]*$", body)
    

    The pattern uses several meta characters. The ^ is an anchor for the start of the string, the $ anchors to the end of the string, the [[:blank:]] denotes the equivalence class of any space or tab character, and * indicates that the blank character can appear 0 or more times. All together the pattern ^[[:blank:]]*$ matches a string that contains any number of blanks from start to end.

  3. Find the record where the time is only 1.5. What happened? Determine how to handle the problem and which function needs to be modified: extractResTable() , extractVariables() , or cleanUp() . In your modification, include code to provide a warning message about the rows that are being dropped for having a time that is too small.

  4. Examine the head and tail of the 2006 men's file. Look at both the character matrix in the list called menResMat and the character vector in the list called menFiles (see the section called “Reading Tables of Race Results into R”). (Recall that the desired character matrix in menResMat and the character vector in menFiles both correspond to the element named "2006"). What is wrong with the hometown? Examine the header closely to figure out how this error came about. Modify the extractVariables() function to fix the problem.

  5. Write the convertTime() function described in the section called “Data Cleaning and Reformatting variables”. This function takes a string where time is in either the format hh:mm:ss or mm:ss. The return value is the time as a numeric value of the number of minutes. Design this function to take a character vector with multiple strings and return a numeric vector.

  6. Modify the createDF() function in the section called “Data Cleaning and Reformatting variables” to handle the formatting problem with the 2006 male file. You will need to carefully inspect the raw text file in order to determine the problem. (Hint: ).

  7. Follow the approach developed in the section called “Reading Tables of Race Results into R” to read the files for the female runners and then process them using the functions in the section called “Data Cleaning and Reformatting variables” to create a data frame for analysis. You may need to generalize the createDF() and extractVariables() functions to handle additional oddities in the raw text files.

  8. Modify the call to the plot() function that created Figureá2.6, “Default Scatter Plot for Run Time vs. Age for Male Runners” to create Figureá2.7, “Revised Scatter Plot of Male Runners”. To do this, read the documentation for plot() to determine which parameters could be helpful; that is, ?plot.default contains helpful information about the commonly used graphical parameters.

  9. Modify the piecewise linear fit from the section called “Fitting Models to Average Performance” to include a hing at 70. Examine the coefficients from the fit and compare the fitted curve to the loess curve. Does the additional hing improve the fit? Is the piecewise linear model closer to the loess curve?

  10. We have seen that the 1999 runners were typically older than the 2012 runners. Compare the age distribution of the runners across all 14 years of the races. Use quantile-quantile plots, boxplots, and density curves to make your comparisons. How do the distributions change over the years? Was it a gradual change?

  11. Normalize each male runner's time by the fastest time for the runner of the same age. To do this, find the fastest runner for each year of age from 20 to 80. The tapply() function may be helpful here. Smooth these times using loess() , and find the smoothed time using predict() . Use these smoothed times to normalize each run time. Use density plots, quantile-quantile plots, and summary statistics to compare the distribution of the age-normalized times for the runners in 1999 and 2012. What do you find? Repeat the process for the women. Compare the women in 1999 to the women in 2012 and to the men in 1999 and 2012.

  12. Clean the strings in home in menRes to remove all leading and trailing blanks and multiple contiguous blanks. Also make all letters lower case and remove any punctuation such as '.' or ',' or ''' characters from the string. Assign the cleaned version of home into menRes. Call it homeClean.

  13. In the section called “Constructing a Record for an Individual Runner across Years” we created an id for a runner by pasting together name, year of birth and state. Consider using the home town instead of the state. What is the impact on the matching? How many runners have competed in at least 8 races using this new id? What if you reduced the number of races to 6? Should this additional restriction be used in the matching process?

  14. Further refine the set of athletes in the longitudinal analysis by dropping those IDs (see the section called “Constructing a Record for an Individual Runner across Years”) who have a large jump in time in consecutive races and who did not compete for two or more years in a row. How many unique IDs do you have when you include these additional restrictions? Does the longitudinal analysis in the section called “Modeling the Change in Running Time for Individuals” change?

  15. Follow the procedures developed in the section called “Constructing a Record for an Individual Runner across Years” to clean the female runners names and hometown and create longitudinal records for the females. Then follow the modeling of the section called “Modeling the Change in Running Time for Individuals” to investigate the run time – age relationship for women who competed in multiple races.

  16. Consider adapting a nonparametric curve fitting approach to the longitudinal analysis. Rice[bib:Rice] suggests modeling an individual's behavior as a combination of an average curve plus an individual curve. That is, the predicted performance for an individual comes from the sum of the average curve and the individual's curve: , where is the performance of individual at age . He suggests a two-step process to do this: (a) take a robust average of all of the smoothed curves for the individuals; (b) subtract this average smoothed curve from the individual data points and smooth the residuals.

    Rather than using only the individual's run times to produce the individual's curve, Rice also suggests smoothing over a set of nearest neighbors' times. Here a nearest neighbor is a runner with similar times for similar age.

  17. In the section called “Scraping Race Results from the Web”, we discovered that the HTML file for the male 2000 results was so poorly formatted that htmlParse() was unable to fix it to allow us to extract the text table from the <pre> tag. In this exercise, we programmatically edit this HTML file so that we can use htmlParse() as desired. To do this, begin by reading the HTML file located at http://www.cherryblossom.org/cb003m.htm using readLines() . Carefully examine the HTML displayed in the section called “Scraping Race Results from the Web” and come up with a plan for correcting it. Consider whether you want to drop <font>s or close them properly. Once you have fixed the problem so that the <pre> tag contains the text table, pass your corrected HTML to htmlParse() . You may want to use a text connection to do this rather than writing the file to disk and reading it in.

  18. Revise the extractResTable() function in the section called “Scraping Race Results from the Web” so that it can read the male 2009 results. Carefully examine the raw HTML to determine how to find the information about the runners. Work with XPath to locate <div> and <pre> tags and extract the text value. The female 2009 results do not need this special handling. Modify the extractResTable() function to determine whether to perform the special processing of the <div> and <pre> tags.

  19. Revise the extractResTable() function in the section called “Scraping Race Results from the Web” so that it takes an additional parameter: file. Give the file parameter a default value of NULL. When NULL, the parsed results are returned from extractResTable() as a character vector. If not NULL, the results are written to the file named in file. The writeLines() function should be helpful here.


3 Using Statistics to Identify Spam

  1. We hand selected email to belong to the sample set in sampleEmail. Instead of this approach, use the sample function to choose messages at random for the sample. Be sure to take files from all five directories of email. Read these files into R and test several functions with these new messages, e.g., getBoundary() and dropAttach() from the section called “Removing Attachments from the Message Body”, to make sure that they work properly.

  2. In the text mining approach to detecting spam we ignored all attachments in creating the set of words belonging to a message (see the section called “Removing Attachments from the Message Body”). Write a function to extract words from any plain text or HTML attachment and include these words in the set of a message's words. Try to reuse the findMsg() function and modify the dropAttach() function to accept an additional parameter that indicates whether or not the words in attachments are to be extracted. Does this change improve the classification?

  3. The string manipulation functions in R can be used instead of regular expression functions for finding, changing, extracting substrings from strings. These functions include: strsplit() to divide a string up into pieces, substr() to extract a portion of a string, paste() to glue together multiple strings, and nchar() which returns the number of characters in a string. Write your own version of getBoundary() (see the section called “Removing Attachments from the Message Body”) using these functions to extract the boundary string from the Content-Type. Debug your function with the messages in sampleEmail.

  4. Write the dropAttach() function for the section called “Removing Attachments from the Message Body”. This function has two inputs, the body of a message and the boundary string that marks the location of the attachments. It returns the body without its attachments. Include in the return value the lines of the body that follow the first boundary string up to the string marking the first attachment and the lines following the ending boundary string. Be sure to consider the idiosyncratic cases of no attachments and a missing ending boundary string.

  5. Write the function findMsgWords() of the section called “Extracting Words from a Message Body”. This function takes as input the message body (with no attachments) and the return value is a vector of the unique words in the message. That is, we only track which words are in the message, not the number of times it appears there. This function should eliminate punctuation, digits, and blanks from the message. Consider whether it is simpler to split the string by blanks first and then process the punctuation, digits, etc. The function should convert capital letters to lower case, and drop all stop words and words that are only one letter long. A vector of stop words is available in the tm package. Use the stopwords() function in this package to create the vector.

  6. Try to improve the text cleaning in findMsgWords() of the section called “Extracting Words from a Message Body” by stemming the words in the messages. That is, make plural words singular and reduce present and past tenses to their root words, e.g., run, ran, runs, running all have the same stem. To do this, use the stemming functions available in the text mining package tm. Incorporate this stemming process into the findMsgWords() function. Then recreate the vectors of words for all the email and see if the classification improves.

  7. Consider the treatment of s in the text cleaning in findMsgWords() of the section called “Extracting Words from a Message Body”. Notice that this function often turns a into gibberish. Should we drop s all together from the messages, or should we try to keep the as one whole word? Why might these alternatives be better or worse than the approach taken in the section called “Extracting Words from a Message Body”? Try one of these alternatives and compare it to the approach of that section to see if it improves the classification.

  8. In the section called “Computational Considerations” we saw a few alternative mathematical expressions for the na´ve Bayes approximation to the log likelihood ratio of the chance a message is spam or ham. Each suggests a different approach to carrying out the computation. Create alternative versions of computeFreqs() and computeMsgOdds() in the section called “Implementing the Na´ve Bayes Classifier” to calculate the log odds. Compare the accuracy of these alternatives to the approach used in the section called “Implementing the Na´ve Bayes Classifier”. Also consider timing the computation with a call to system.time() to determine if one approach is much faster or slower than the others.

  9. The function computeFreqs() in the section called “Implementing the Na´ve Bayes Classifier” uses as a default bag of words constructed from the words passed to it in the argument wordsList. However, it is possible to supply a different bag of words via the bow parameter. When this happens, it may be the case that some words in wordsList are not found in the bag of words. This causes an error when running the function. Determine which lines of code in computeFreqs() are problematic and update the function to handle this situation. Be sure to test your code.

  10. Use the typeIErrorRates() function in the section called “Classifying New Messages” as a guide to write a function that computes the Type II error rates, i.e., the proportion of spam messages that are misclassified as ham. As with the Type I error rates, convince yourself that the error rate is monotonic in , that it changes only at the values of the LLR in the provided messages, and that you only need to consider these values for the spam messages. This function, called typeIIErrorRates() , has the same inputs as the typeIErrorRates() function and returns the same structure. The only difference is that the rates returned are based on Type II errors.

  11. In the section called “Processing the Header”, we used the read.dcf() function to read the key: value data in the email headers. In this exercise, we use regular expressions to extract the keys and their values from the header.

    The first step in the process is to find the continuation lines for a value, and then collapse them with the first line of the value. These continuation lines start with either a blank space or a tab character. Use regular expressions to locate them. Then paste them to the first line of the value.

    Next break the revised set of key:value strings into the keys and values. Again use regular expressions to do this. Then create the names vector from these keys and values.

  12. Write the processAttach() function for the section called “Processing Attachments”. This function has two inputs, the body of a message and the Content-Type value. It returns a list of two elements. The first is called body and it contains the message body without its attachments. Be sure to consider the idiosyncratic cases of no attachments and a missing ending boundary string. The second element is a data frame called attachDF. This data frame has one row for each attachment in the message. There are two variables in this data frame, aLen and aType, which hold the number of lines and the MIME type of each attachment, respectively.

  13. Write a function to handle an alternative approach to measure yelling: count the number of lines in the email message text that consist entirely of capital letters. Carefully consider the case where the message body is empty. How do you modify your code to report a percentage rather than a count? In considering this modification, be sure to make clear how you handle empty lines, lines with no alpha-characters, and messages with no text.

  14. Write alternative implementations for the isRe() function in the section called “Deriving Variables from the Email Message”. For one alternative, instead of only checking whether the subject of a message begins Re:, look also for Fwd: Re:. For a second alternative, check for Re: any where in the subject, not just at the beginning of the string. Analyze the output from these three functions, including the original isRe() function in the section called “Deriving Variables from the Email Message”. How many more messages have a return value of TRUE for these alternatives, and are they all ham? Which do you think is most useful in predicting spam?

  15. Choose several of the ideas listed in Tableá3.1, “Variable Definition Table” for deriving features from the email and write functions for them. Be sure to check your code against what you expect. Try writing one of these functions in two different ways and compare the output from each. Use exploratory data analysis techniques to check that your code works as expected. Does the output of your function match the values in the corresponding columns of emailDF? If not, why do you think this might be the case? Does it appear that this derived variable will be useful in identifying spam or ham?

  16. Consider other variables that are not listed in Tableá3.1, “Variable Definition Table” that might by useful in the classification problem. Write functions to derive them from the messages in emailStruct and add them to emailDF. Refit the classification tree with the enhanced data frame. Were these new variables chosen to partition the messages? Is the error in classification improved?

  17. Carry out additional exploratory analysis as described in the section called “Checking Our Code for Errors”. Include in your analysis a quantile-quantile plot of perCaps for the ham and spam.

  18. Write code to handle the attachments in the message separately from the text in the body of the message. Since each attachment has its own header, try processing the header and body of the attachment in a manner similar to the message header and body. Use the processHeader() function to do this. You may need to revise processHeader() to handle cases that arise in the attachments and not in the main headers of the messages.

  19. Consider the other parameters that can be used to control the recursive partitioning process. Read the documentation for them in the rpart.control() documentation. Also, carry out an Internet search for more information on how to tweak the rpart() tuning parameters. Experiment with values for these parameters. Do the trees that result make sense with your understanding of how the parameters are used? Can you improve the prediction using them?

  20. In the section called “Classifying New Messages” we used the test set that we had put aside to both select , the threshold for the log odds, and to evaluate the Type I and II errors incurred when we use this threshold. Ideally, we choose from another set of messages that is both independent of our training data and our test data. The method of cross-validation is designed to use the training set for training and validating the model. Implement 5-fold cross-validation to choose and assess the error rate with our training data. To do this, follow the steps:

    1. Use the sample() function to permute the indices of the training set, and organize these permuted indices into 5 equal-size sets, called folds.

    2. For each fold, take the corresponding subset from the training data to use as a 'test' set. Use the remaining messages in the training data as the training set. Apply the functions developed in the section called “Implementing the Na´ve Bayes Classifier” to estimate the probabilities that a word occurs in a message given it is spam or ham, and use these probabilities to compute the log likelihood ratio for the messages in the training set.

    3. Pool all of the LLR values from the messages in all of the folds, i.e., from all of the training data, and use these values and the typeIErrorRate() function to select a threshold that achieves a 1% Type I error.

    4. Apply this threshold to our original/real test set and find its Type I and Type II errors.

  21. Use the sample() function to permute the indices of the training set, and organize these permuted indices into 5 equal-size sets, called folds.

  22. For each fold, take the corresponding subset from the training data to use as a 'test' set. Use the remaining messages in the training data as the training set. Apply the functions developed in the section called “Implementing the Na´ve Bayes Classifier” to estimate the probabilities that a word occurs in a message given it is spam or ham, and use these probabilities to compute the log likelihood ratio for the messages in the training set.

  23. Pool all of the LLR values from the messages in all of the folds, i.e., from all of the training data, and use these values and the typeIErrorRate() function to select a threshold that achieves a 1% Type I error.

  24. Apply this threshold to our original/real test set and find its Type I and Type II errors.

  25. Often in statistics, when we combine two methods, the resulting hybrid out performs the two pure methods. For example, consider a na´ve Bayes approach that incorporates the derived variables into the probability calculation. For simplicity, you might try a few variables that result from the important splits in the recursive partitioning tree from Figureá3.8, “Tree for Partitioning Email to Predict Spam”, e.g., whether or not the percentage of capitals in the message body exceeds 13%. These variables have only two values as the splits are based on yes-no questions. Develop a hybrid classifier that uses both the word vectors and these additional features.

  26. An alternative to recursive partitioning is the nearest neighbor method. This method computes the distance between two messages using the values of the derived variables from the section called “Deriving Variables from the Email Message”, or some subset of them. Use the email in trainDF to find the closest messages to each email in testDF. Then use these neighbors to classify the test message as spam or ham. That is, use the neighbors' classifications to vote on the classification for the test message. Compare this method for predicting spam to the recursive partitioning approach. Use both Type I and Type II errors in making your comparison. Include a comparison of the two approaches from a computational perspective. Is one much faster or easier to implement?


4 Processing Robot and Sensor Log Files: Seeking a Circular Target

  1. When creating the two data frames in the function, we extract a subset of values from each record, create a matrix and then a data frame. This essentially involves creating at least three copies of the relevant values in memory. What are they? Are there better ways to do this? Consider the time, x, y data frame, for example. We could use lapply() , rather than sapply() , to return a list of vectors of length 3. We could then use as.data.frame(do.call(rbind, listOfVectors)) to create the data rame. Does this reduce the amount of memory used? Are there other ways to arrange the computations to avoid having three copies of the data in memory at once? If so, are these faster or slower than our approach above in readLog() ?
  2. Did the robot find the target in that experiment?
  3. Instead of changing the color uniformly across observations, we can change it based on the time difference between consecutive records. This will help to convey velocity. Implement this and visualize the experiments.
  4. We have colored the points in each path using a different, but related, sequence of colors. In colorRamp() , we compute a vector of colors based on the number of records in the log. This means that the i-th color is different across the logs. Consider using the same set of colors across all logs. Does this convey more information, e.g., allow us to compare the number of records in each log? Does this improve the overall display?
  5. Create a variation of the plotLook() code that produces the plot in panel (b) of Figureá4.9, “Enhanced Display of a Look”. Divide the points into separate sub-groups that are all 2 meters or all less than 2 meters and use points(, type = "l") for each sub-group.
  6. How many different robots were used for these experiments? Are there any differences in their operating characteristics? That is, do different robots have a different error distribution?
  7. Can we vectorize the computations in getSegments() to avoid the for loop? Given that we are looping over the number of objects we see, not the 360 range values, does vectorizing significantly improve the performance? Think about how often we call getSegments() -- once for each look in each log file.
  8. The getSegments() function has similarities to the plotLook2() function (the function we developed in ???). Rewrite plotLook2() to call getSegments() . Is this a good idea? Does it simplify the code? avoid repeating the same computations in two places? avoid testing similar code?
  9. Test the function getWrappedSegments() for other looks, e.g., with no objects seen, more than 3 segments, different thresholds.
  10. Experiment with values of the this distance cut-off to find a reasonable value for separating segments that correspond to two obstacles.
  11. Implement the evalSegment() function described above. Also, implement a function evalSegments() which takes a look and computes the segments and calls evalSegment() on each of these.
  12. Our fit function used the sum of squares between the radii. We could use other criteria such as absolute value, i.e., the \(L_1\) norm. Explore different criteria. How does this change the code we use to determine if the arc corresponds to part of a circle? How does it change how we classify different looks?
  13. Implement the geometric approach to estimating the center of the target and use that as the initial guess we pass to nlm() .
  14. Modify robot.evaluation() to return all of the segments that could be considered part of a circle, and not just the first one.
  15. If the robot is very close to the target, it is hard identify the shape of the target as circular. The curvature of the part of the circle looks close to a line as the robot can see fewer points on the arc of the circle. How can we adjust the classifier or the criteria for accepting a segment as the target to handle this case? How do we guard against classifying an obstacle or side as the circular target in these cases?
  16. In the streaming data case where we process one look at a time and then determine where to move to next, we'd like to be able to identify a circle and move towards it if we are not certain whether it actually is a circle. How can we provide a level of confidence or uncertainty in our classification of a segment as the circular target?
  17. Study the code in this case study and modify it to allow a user to substitute his/her own functions for each of the different components we might want to change. For example, allow a different function for fitting the circle, or metric for evaluating the fit for a given triple of parameters. Similarly, allow for a different mechanism for computing an initial estimate of the center of the circle for the call to nlm() .

5 Strategies for Analyzing a 12 Gigabyte Data Set: Airline Flight Delays

  1. Using the UNIX« shell, create the AirlineDataAll.csv file without using a loop.
  2. Write a preprocessing script (using the shell, R, Python or any tools) to create a file that can be used with bigmemory, i.e., convert the non-numeric values to numeric values in some well-defined manner.
  3. How many flights were there for each day of the week?
  4. For each year, how many flights were there for each day of the week?
  5. For each year, how many of the tail codes are listed as NA?
  6. Which year had the greatest proportion of late flights? Is this result significant?
  7. Which flight day is best for minimizing departure delays? Which time of day?
  8. Which is the best day of the week to fly?
  9. Which is the best day of the month to fly?
  10. Are flights being given more time to reach their destination for later years?
  11. Which departure and arrival airport combination is associated with the worst delays?
  12. One of the examples in this section creates a vector, named planeStart, which gives the first month in which a plane with a given tail code appears in the data set. Estimate the amount of time the loop to create this vector will take to run sequentially? and in parallel?
  13. How many of the planes ages are censored?
  14. How much do weather delays contribute to arrival delay?
  15. Along with age, which other variables in the airline delay data set contribute to arrival delays?

6 Pairs Trading

  1. Instead of using download.file() and copying the data to a local file, we can use getForm() in the RCurl package to fetch the data directly into an R session. We can then pass the resulting content to read.csv() using a textConnection() . This approach is sometimes necessary when we need more control over how the request is sent to get the Web page. For example, some Web sites require a login or some form of authentication. Write the code to retrieve and read the data using this approach.
  2. What is the purpose of the ... in the function definition?
  3. Modify combine2Stocks() to accept an additional parameter that controls whether it adds the ratio to the data frame.
  4. In addition to drawing the lines, we might also want to identify them. Add code to display the value of k on each line.
  5. Instead of using rep() to repeat the color "red", can we use R's recycling rule and merely specify
    col = c("darkgreen", "red", "red")
    
    and have this work correctly a vector of values for k?
  6. We have used the colors red and darkgreen to indicate the lines for the average ratio and the two extremes, respectively. Why might the choice of colors green and red be bad? How would we define our function to allow the caller specify different colors?
  7. Why don't we start at a[2] + 1 rather than the same day we closed the position?
  8. Can we search over the different values of k more intelligently? For example, can we find just the set of unique values which will lead to different positions?
  9. Are we in danger of overfitting our training data?
  10. Download historical stock price information for many different stocks. Then explore the pairs trading strategy for each pair of stocks. Explore the distribution of profits. Are there characteristics of stocks that seem to lead to larger gains? Does this provide insight into when pairs trading might work or fail?
  11. In one implementation of our code, we used X = matrix(NA, n, 2) rather than X = matrix(0, n, 2). Why is the second slightly more efficient?
  12. Write a function to plot two series on the same plot using two separate axes, one on the left and one on the right.
  13. How can we test the runSim() function? Implement these tests.
  14. In addition to the total profit for a simulation, we would like to examine the number of trading positions and the optimal value of k used for trading. Modify the functions to collect this information and explore the results.
  15. Rather than using the global variable counter in R's global environment, use a closure, in other words, lexical scoping, to manage a "private" variable that keeps the count of the number of times findNextPosition() is called. You do this by creating a function that returns a list with two functions. One of these functions increments the counter variable. The other function returns the current value of that counter. You can also add a third function that resets the counter. Use the incrementing function with trace() to count the number of calls to findNextPosition() .
  16. Our pairs trading approach uses the mean and standard deviation from the training data for the thresholds for the future/test data. We could, however, update the thresholds after we receive the new stock prices, e.g., each day. We can use the same mechanism to determine the optimal value of k with the updated training set, created by adding the new observation to the previous training set. This then changes when we open or close a position in the future. Implement this approach and explore how the profit rate changes.
  17. We have used a symmetric trading rule, i.e., \(\bar{r} \pm k \sigma_r\). We could use an asymmetric rule with a different deviation from the mean above and below, i.e., \(\bar{r} + k_a\) and \(\bar{r} - k_b\). Implement a mechanism to find the optimal values for \(k_a\) and \(k_b\), and then explore the properties of that trading approach and how it comparse to the symmetric approach.

7 Simulation Study of a Branching Process

  1. Write an alternative function to genKids() (see the section called “Generating Offspring”), called genKidsB() , where the B stands for batch. In this function, generate the inter-arrival times in batches, where the size of the batch depends on the expected number of offspring for a job. The expected number depends on the rate and the birth and completion times of the parent. That is, for a job born at time and complete at time , we expect it to have offspring. The parameters to this function and their default values should be the same as genKids() .

  2. Write the function to genKidsU() described in the section called “Considering Alternative Implementations”. This function generates the birth and completion times for a job that is born at and completes at . The number of children follow a Poisson(. Once the number of children are known, their births can be generated according to the uniform on the interval . As mentioned in the section called “Considering Alternative Implementations”, these times are not generated in order so they need to be sorted. The parameters to genKidsU() and their default values should be the same as genKids() .

  3. Develop a test suite to confirm that the recursive function genKidsR() (see the section called “Generating Offspring”) is consistent with the while-loop version of genKids() (also in the section called “Generating Offspring”).

  4. Use Rprof() to profile the genKidsU() function in the previous exercise. As in the section called “Profiling and Improving Our Code”, profile the code by calling genKidsU() 1000 times with the parameter settings: bTime = 1, cTime = 100, and the default values for and . Also profile the code with one simulation, where cTime = 1000000. Do the profiles look different? Try improving the efficiency of your code based on the profile information. Additionally, does the code include calls to force() and deparse() ? Or some other unexpected functions? Can you determine why these functions are being called?

  5. Develop a set of test cases for genKidsV() (see the section called “Unit Testing”). Write code to check the output from genKidsV() for these test cases.

  6. Figureá7.4, “Visualization of a Randomly Generated Branching Process” is a custom visualization of the birth and completion times for a tree. Design an alternative custom visualization of the return value from familyTree() .

  7. Incorporate into familyTree() a limit on the time that the tree is observed. If we want to observe a process up until time then those offspring with birth times after are discarded. Also, the simulation stops once all of the observed births in a generation are after . See Figureá7.5, “Visualization of a Randomly Generated Branching Process Over a Fixed Time Interval” for a visualization of the first five generations a simulated process that is truncated at time 8.

  8. Update the genKidsV() function developed in the section called “From One Job's Offspring to an Entire Generation” to return one data frame rather than a list of data frames. This data frame needs to include additional columns that supply the parent and offspring identifiers. See the section called “A Structure for the Function's Return Value” for an example of the modified return value.

  9. The branching process was summarized by two statistics: the number of generations and the number of offspring (see the section called “Replicating the Simulation”). Consider other summary statistics for the process. Incorporate them into exptOne() . Carry out a simulation study and create a visualization of the simulation that uses these additional statistics. Do they confirm the earlier findings? Do they offer any new insights?

  10. Carry out a simulation study to see if the re-parameterization suggested in the section called “Analyzing the Simulation Results” is appropriate. For example, fix to be 1, and run the simulation for various values of . Compare the results to other simulations where is , but the ratio of matches one of the values from the earlier simulation when was 1.

  11. Consider other probability functions to describe the lifetime of a process. Revise familyTree() (see the section called “The Family Tree: Simulating the Branching Process”) and genKidsV() (see the section called “From One Job's Offspring to an Entire Generation”) to take as an argument the random number generator for any probability distribution. The functions familyTree() and genKidsV() are to use this probability distribution (with arguments that may be specific to the distribution) to generate the completion times of the jobs.

  12. Redesign the simulation study, where rather than generating one branching process at a time, the processes are generated in a vectorized fashion. This may require rewriting genKidsV() (the section called “From One Job's Offspring to an Entire Generation”) and familyTree() (the section called “The Family Tree: Simulating the Branching Process”).


8 A Self-Organizing Dynamic System with a Phase Transition

  1. Does the function above handle all input cases? If not, which ones could occur that haven't be handled?
  2. What if we used
    sample(c("red", "blue"), sum(numCars), prob = numCars, 
           replace = TRUE)
    
    to generate the colors for the sampled locations? Is this qualitatively the same as the previous code with rep() ? If not, what is the difference? Will this matter for our simulations?
  3. Does this lead to an imbalance in the number of cars? Is this what we want?
  4. Add code to the createGrid() function to raise an error or warning as appropriate when the caller specifies infeasible inputs for the number of cars or for the dimensions of the grid.
  5. Should we allow the caller to use names on the vector of number of cars numCars to indicate the colors or values for the two types of cars in the matrix? Modify the createGrid() function to use these.
  6. The quotes around the color names (e.g., "blue") are distracting. Write a different version of print.BMLGrid() that produces the display without the quotes. Reuse existing functionality in R rather than writing code to format individual lines of output. Hint: when does R display character vectors/strings without quotes?
  7. Instead of using the names of the colors, use arrows such as ↑ and →.
  8. How would we change our plot.BMLGrid() function to allow the caller to override the axes = FALSE argument in the call to image() ? In other words, we want no axes to appear by default, but to allow the caller to optionally show them.
  9. Why should we use a data frame to represent this information? What are the alternatives? A matrix? A list? Three variables i, j and pos?
  10. Note that we cannot use an lapply() /sapply() call in place of the for loop above. Why?
  11. Develop tests for the getCarLocations() function.
  12. This information doesn't tell us how many times each function has been called. How could we calculate how often a function was called? Consider using trace() to gather this information. (See Chapterá6, Pairs Trading.)
  13. Why is the use of drop = FALSE important? Under what circumstances will it yield a different result than cbind(nextRows, nextCols)[w, ]?
  14. Modify the functions to use this approach of maintaining the locations of the cars across calls to moveCars() . Then determine how this changes the performance.
  15. When timing the computations, also vary the density of the cars and draw a three-dimensional plot showing the relationship between number of cells, density, and time.
  16. We can avoid these calls to memcpy, both simplifying the code and also making it slightly faster by not copying the contents of newGrid to grid. Make the appropriate changes and verify that the results are the same and correct.
  17. Implement a class and methods for the integer version of the BMLGrid matrix. Try to inherit methods as much as possible and avoid modifying existing code where possible. You can consider different class hierarchies or ancestry orders.
  18. Write the function to compare two grids and verify they are the same. Raise an error if they are not. The error should identify in what way the grids are not the same. You can do this with the error message and/or by specifying a class for the error. (See the help for condition in R.)
  19. Use the function to compare the outputs from crunBML() and runBML() for a variety of different grids to verify they give the same results for the same inputs.
  20. Look at the velocity time series for a larger grid with the same density.
  21. Is there a difference in velocity for the vertical and horizontal directions? Does the fact that the horizontal moves first make a difference?
  22. Recreate Figureá8.13, “Average velocity of cars by occupancy density”.
  23. Are there better ways to display the data from Figureá8.13, “Average velocity of cars by occupancy density”?
  24. How long did it take to compute these results? How can we reduce this time?
  25. How do the dimensions of the grid affect velocity? What if the two dimensions are quite different? What if one is a multiple of the other? What if the two dimensions are relatively prime?
  26. Create an R package with all of the code to simulate a BML process.
  27. Use R's byte-compiler function cmpFun() to compile our different implementations of the functions moveCars() , getCarLocations() , etc. Compare the performance of these compiled functions to our different approaches.
  28. Run multiple BML processes in parallel to explore its behaviour for different densities and dimensions. Explore the parallel package to do this.
  29. Make an animation of the BML process so that we could show this outside of R. By this, we mean a display that shows the sequence of grid states over time. This might be a video or an SVG file. You can show the changes to the grid at different time intervals.
  30. We might want to look at the time-series of the movement of individual cars across iterations. We could use this to see if particular cars are persistently stuck and others are not, or do they all go through periods of being stuck and then free-flowing. Change the code to allow capturing this information.
  31. If we look at displays of the BML model on, for example, Wikipedia, the patterns in the traffic flow appear in the opposite direction, i.e. from left to right. Why is this?
  32. Write code to find (contiguous) regions in a grid that have very low mobility, i.e., traffic jams. Can we characterize the shapes of these regions?
  33. For grid configurations that end in deadlock, how many iterations does it take to reach complete deadlock? How does this depend on the density of cars and grid size.
  34. Now that we have the tools for performing experiments, we can explore different behaviors of different configurations. There are many different interesting features uncovered in[bib:Raissa2005]. Explore the intermediate phases. Use co-prime grid dimensions and see if the behavior changes in any way.
  35. We mentioned that the choice of red and blue as colors may not be a good one as some people have a rare form of color blindness that makes these two colors appear similar. Also, displaying colors on overhead projectors and other devices can give very different results than on computer screens. As a result, we may want to change the colors for the two different types of cars. What are better colors to use? Change the code in this case study so that the R user can specify different colors to represent the horizontal and vertical moving cars. Make certain that all of the code still yields the same results! In how many places does the caller have to specify the new colors? Ideally, they should only have to specify it in one place and the rest of the code should know which correspond to horizontal and vertical. How do we implement this?

9 Simulating Blackjack

  1. In the winnings() function in the section called “Blackjack Basics”, we chose to break the function down by the value of the dealer's cards first, and then the player's cards. What happens when we switch this order and start with the player's values? Is the function easier or harder to understand? Write this alternative version of the function and test it using the test code in the section called “Testing Functions” to make sure that your code works as expected.

  2. How does the terse version of the winnings() function in the section called “Blackjack Basics” work? Is it consistent with the original version? Test this new version of the function with the test cases developed in the section called “Testing Functions”. Is it easier to follow the logic in the first or second version of the winnings() function? Why? When do you think using code like this is appropriate?

  3. Consider how you might add checks to the hit() , dd() and splitPair() functions (see the section called “Creating Functions for the Player's Actions”) to ensure that they are being used properly. For example, add code to splitPair() so that an error message is returned if the player tries to split a hand when the two cards are not of equal value. Add other checks as well.

  4. Compile a set of test cases for the checks that you developed in the previous exercise. Write a function that calls these functions (hit() , dd() and splitPair() ) with the test cases and returns an informative message indicating whether the functions pass all of the tests and if not where problems occurred. Create other test cases and test functions for the play_hand() function.

  5. Implement the draw_n() method for the Shoe class of the section called “A More Accurate Card Dealer Shoe”. This function has one input: n, which indicates how many cards should be drawn from the shoe. It returns a vector of n cards taken from the cards field of the Shoe object. This vector of cards begins at the position indicated by the pos fields. This method also increments pos by n so that we do not draw the same cards multiple times. It also calls the helper method decks_left() to see if it is time to reshuffle. If there is less than one deck in the shoe then it calls the shuffle() method to shuffle the cards.

  6. Implement the hi_low() function from the section called “Counting Cards”. This function takes one input, which is a numeric vector cards, and it returns the Hi-Lo count (see Tableá9.2, “Card Counting Strategies”). If you want an additional challenge, write the function without using an if statement or any loops. For example, consider taking advantage of the fact that the vector of cards is a numeric vector, and use that to subset into a vector of count values.

  7. Implement the any_count() function for counting cards (see the section called “Counting Cards”). This function has two arguments: a numeric vector cards which contains the cards played, and a numeric vector, strategy, which is of length 10 and contains the card values for one of the strategies in Tableá9.2, “Card Counting Strategies”. Modify count_payoff() to use any_count() . This may require modifying the function definition to accept a strategy argument.

  8. Suppose a bet was larger than $1, say $4, how might the bet() function take this into consideration in deciding the size of the bet? Modify bet() (in the section called “Counting Cards”) so that bets are integers, but utilize a standard bet size. What impact does this change have on other functions, e.g., winnings() ? Is there a need to modify any of the earlier code?

  9. Implement other play options that are available in some games of blackjack, such as insurance and surrendering. Think about the play_hand() function: can you simply augment this function with code for these new actions, or do you need to fundamentally re-structure how the function works? See the section called “Strategies for Playing” for the play_hand() function.

  10. A major simplification in our simulation is that we have effectively given the gambler an infinite amount of money. How do things change if we also account for ruin, i.e., the gambler running out of money?

  11. Can you make the code faster? Use profiling techniques to find the slow parts of the code. Also, consider how you might use vectorization to play multiple games at once.


10 Baseball: Exploring Data in a Relational Database

  1. Which team lost the World series each year? Do these teams also have high payrolls? Some argue that teams with lower payrolls make it into the post season playoffs, but typically don't win the World Series. Do you find any evidence of this?
  2. Do you see a relationship between the number of games won in a season and winning the world series?
  3. Augment the team payrolls to include each team's name, division, and league. Create a visualization that includes this additional information.
  4. One might expect a team with old players to be paying these veteran players high salaries near the end of their careers. Teams with a large number of mature players would therefore have a large payroll. Is there any evidence supporting this?
  5. Examine the distribution of salaries of individual players over time for different teams.
  6. Not all of the people in the database are players, e.g., some are managers. How many are players? How many are managers? How many are both, or neither?
  7. What are the top ten collegiate producers of major league baseball players? How many colleges are represented in the database? Be careful in handling those records for players who did not attend college.
  8. Has the distribution of home runs for players increased over the years?
  9. Look at the distribution of how well batters do. Does this vary over the years? Do the same players excel each year? Is there a clustering, a bi-modal distribution?
  10. Are Hall-of-Fame players, in general, inducted because of rare, outstanding performances, or are they rewarded for consistency over years?
  11. Do pitchers get better with age? Is there an improvement and then a fall off in performance? Is this related to how old they are, or the number of years they have been pitching? What about the league they are in? Do we have information about each of these factors and how can we combine them to present information about the general question?
  12. How complete are the records for the earliest seasons recorded in this database? For example, we know that there is no salary information prior to 1985, but are all of the other tables complete?
  13. Are certain baseball parks better for hitting home runs? Can we tell from this data? Can we make inferences about this question?
  14. What is the distribution of the number of shut-outs for a team in a season?

11 CIA Factbook Mashup

  1. In the merge of infMortDF and popDF in the section called “Integrating Data from Different Sources” some rows in each data frame were excluded from the resulting data frame. Determine which rows in each data frame did not find a match in the other.

  2. The data frame codeMapDF contains a mapping of the ISO 3166 codes to the codes used in the CIA Factbook. It also contains country name. Below are the first few observations in this data frame:

    head(codeMapDF)
    
      cia           name iso
    1  af    Afghanistan  AF
    2  ax       Akrotiri   -
    3  al        Albania  AL
    4  ag        Algeria  DZ
    5  aq American Samoa  AS
    6  an        Andorra  AD
    

    Use this data frame to fix the problem with the merging of the location data, which uses ISO codes, with the demographic data, which uses the CIA Factbook coding. Be sure the final data frame contains both codes and the country name from codeMapDF, as well as the original variables.

  3. Remake the map in Figureá11.6, “Map of Infant Mortality and Population” to include the countries for which there is latitude and longitude, even if there is no demographic information. To do this you need to modify the call to merge() . In particular, the all.x and all.y parameters are helpful here. In the map, use a special plotting symbol to denote those countries where demographic information is missing.

  4. To determine the color of a circle in the map in Figureá11.6, “Map of Infant Mortality and Population”, we discretized the infant mortality rates. Investigate alternative cut points for this discretization that derive from standards used by international organizations such as the United Nations, World Health Organization, and World Bank. Remake the map to incorporate these external values.

  5. To determine the size of a circle in the map in Figureá11.6, “Map of Infant Mortality and Population”, we used a scaling of the square-root of population. We found that the range in population was so large that when the largest countries had a reasonably sized disk, the disks for the smaller countries could not be seen. To address this problem we created a lower bound for the disk size so that no matter how small the country a disk with this minimum size would be plotted. Find an alternative scaling for population that instead caps the size of the larger countries. That is, for very large countries, the no matter how large the population, the disk plotted on the map is no bigger than this cap. Remake the map using this approach to scaling the circles.

  6. The mapproj[bib:mapproj] library contains various projections of the 3D earth into 2D map. Install this package and investigate the projections. To do this you will want to read some resources and tutorials available online. Remake the map in Figureá11.6, “Map of Infant Mortality and Population” using a projection.

  7. There are many other statistics available in the Factbook. Choose demographic information other than infant mortality to extract and visualize.

  8. Wikipedia tables offer another source of data. For example, Wikipedia has a list of infant mortality rates by country at http://en.wikipedia.org/wiki/List_of_countries_by_infant_mortality_rate. These data are from the United Nations World Population Prospects report (http://esa.un.org/unpd/wpp/Excel-Data/mortality.htm) and the CIA World Factbook. Another source of information is the World Bank, which provides infant mortality rates at http://data.worldbank.org/indicator/SP.DYN.IMRT.IN. Create a mashup with data from one of these alternative sources.

  9. XPath is a powerful tool for locating content in an XML document, and there are often many XPath expressions that we can use to access the information we want. For example, rather than use the id attribute of <field> to find the relevant data, try another approach, such as locating the desired nodes by filtering on the value of the name attribute.

  10. Use XPath to extract the mapping between the two types of country codes from the CIA factbook. Once you have this mapping in a data frame, us it to remerge the lat/lon data with the factbook data. Then re-do the world map. It should look like Figureá11.6, “Map of Infant Mortality and Population”. To assist you in this task, we provide a nippet of the Factbook that shows a portion of the table that contains the various codings for countries.

    <table title="" lettergrouped="1">
     <columnHeader colspan="1" title="Entity"/>
     <columnHeader colspan="1" title="FIPS 10"/>
     <columnHeader colspan="3" title="ISO 3166"/>
     <columnHeader colspan="1" title="Stanag"/>
     <columnHeader colspan="1" title="Internet"/>
     <columnHeader colspan="1" title="Comment"/>
    ...
    <row>
     <cell country="ch" content="China"/>
     <cell center="1" content="CH"/>
     <cell content="CN"/>
     <cell content="CHN"/>
     <cell content="156"/>
     <cell center="1" content="CHN"/>
     <cell center="1" content=".cn"/>
     <cell content="see also Taiwan"/>
    </row>
    ...
    <row>
     <cell country="gb" content="Gabon"/>
     <cell center="1" content="GB"/>
     <cell content="GA"/>
     <cell content="GAB"/>
     <cell content="266"/>
     <cell center="1" content="GAB"/>
     <cell center="1" content=".ga"/>
     <cell content=" "/>
    </row>
    
    <row>
     <cell country="sz" content="Switzerland"/>
     <cell center="1" content="SZ"/>
     <cell content="CH"/><cell content="CHE"/>
     <cell content="756"/>
     <cell center="1" content="CHE"/>
     <cell center="1" content=".ch"/>
     <cell content=" "/>
    </row>
    

  11. In the section called “Extracting Latitude and Longitude from an HTML File” we saw that the latitudes and longitudes for the countries are available on the Web, embedded in the page at http://dev.maxmind.com/geoip/legacy/codes/average-latitude-and-longitude-for-countries/. A snippet of the HTML source for this page is shown in Figureá11.10, “Screenshot of the HTML Source of the MaxMind Web Page”. Extract these location coordinates from this HTML file.

    Screenshot of the HTML Source of the MaxMind Web Page

    This screenshot displays the HTML source of the Web page shown in Figureá11.9, “Screenshot of the MaxMind Web Page with Country Latitude and Longitude”. Notice that the latitudes and longitudes for the countries appear within a <pre> node in the HTML document.

    Examine the HTML source, and notice that the data are simply placed as plain text within a <pre> node in the document. If we can extract the contents of this <pre> node, then we can place this information in a data frame. Begin by parsing the HTML document with htmlParse() . Don't download the document, simply pass the function the , http://dev.maxmind.com/geoip/legacy/codes/average-latitude-and-longitude-for-countries/.

    Next access the root of the document using xmlRoot() , and use an XPath expression to locate the <pre> nodes in the document. The getNodeSet() function should be useful here. The getNodeSet() function takes the XML tree (or subtree) and an XPath expression as input and returns a list of all nodes in the tree that are located with this expression.

    Once you have located these <pre> nodes, check to see how many were found. It should be only one. The latitude and longitude values are in the text content of this node. Extract the text content with the xmlValue() function. It should look something like:

    [1] "\n\"iso 3166 country\",\"latitude\",\"longitude\"\n
    AD,42.5000,1.5000\nAE,24.0000,54.0000\nAF...
    

    The content is one long character string containing all the data.

    Complete the exercise by reading the plain text in your character vector into a data frame. Use read.table() to do this. The parameters, text, skip, header, and sep should be useful here.

  12. Collect all of the steps for generating the KML document (see the section called “Generating KML Directly”) into one function, called makeCIAPlot() . Consider generalizing it so that it can be used with other Factbook variables or other data.

  13. The plotting symbols used in Figureá11.1, “Display of Infant Mortality by Country on Google Earth” were created in R. Write a function as described in the section called “Creating Plotting Symbols” to create a set of PNG files that can be used as placemarks on Google Earth. Rather than simply making circles, consider making other shapes, including a miniature plot of the data. Also, use a different color scheme than the yellow-orange-red palette.

    This placemark function creates the icons by setting up a blank canvas that has a transparent background and no axes and no labels. On this canvas draw a circle or some other shape and fill it with the desired color. The function should take as input a vector of colors and create a set of circles (or some other shape) as PNG files, one for each color. The png() , plot() , draw.circle() functions might be helpful. In addition, you might try making the symbols partially transparent so that when they overlap on the plot, they can still be seen.

  14. Make an R plot for the pop-up window of each country. That is, for each country, make a plot that is country specific. Save this plot to a PNG file and add the file name to the <description> node, e.g., the content for the description might be something like

    Afghanistan:
    <img src = "af.png"/>
    

    The resulting KML document should be a file, i.e., a ZIP archive, that contains all of these PNG files as well as the KML document.

  15. Rewrite the addPlacemark() and makeStyleNode() functions to use the technique described in the section called “Efficiency in Generating KML from Strings”. That is, construct the placemarks by pasting strings together, and then convert the strings into XML nodes with a call to xmlParseAndAdd() .


12 Exploring Data Science Jobs with Web Scraping and Text Mining

  1. Test the cy.getFreeFormWords() function on several posts as suggested above.
  2. Test the cy.getNextPageLink() function by calling it again with the result of the previous call to cy.getNextPageLink() .
  3. Modify the getJobPosts() to have it optionally give progress messages as it completes pages of results.
  4. Suppose we download 400 jobs using getJobPosts() and later decide to retrieve another 200. How do we avoid processing the first 400 jobs again?
  5. We could have made our getJobPosts() function entirely generic by having the caller specify a function for performing the initial search query. Do this. Do you think this is a worthwhile addition?
  6. Instead of having the caller specify individual functions, we could define a closure or use a reference class with methods for getting the next page, finding the URLs for the job postings, and reading an individual post. We could then extend this class for each target Web site. Do this.
  7. Enhance this function to use a single curl handle for all of the HTTP requests.
  8. Check that no posting had any skill repeated two times or more.
  9. Visualize the skills for Data Science job postings from dice.com and monster.com. Are they similar across Web sites? If so, combine the words across all posts and visualize these.
  10. Why do we use for loop in getSalaryRange() rather than sapply() or lapply() ?
  11. Implement the fixNum() function. It takes a number as a salary value and determines whether it to multiply it by 1000 to put it on the proper scale. This is used to adjust salaries that are reported as, say, $100K.
  12. Display all of the jobs from different sites geographically on a Google Earth or Google Maps display. Group the points to allow them to be toggled on and off. For example, group them by the Web sites on which they were found, or by general job category, or by salary range. See Chapterá11, CIA Factbook Mashup for information about Google Earth and KML.
  13. Compare job postings for different keywords or job titles and compare their characteristics in various dimensions.
  14. Find other job sites and scrape data from those. How do these compare with the sites we have looked at.
  15. Compare the salaries for Data Science postings for the two sites Dice and CyberCoders.
  16. Compare the data you scraped now to the data we obtained in 2013 (available on the Web site). How have the salaries and skills changed?
  17. Determine the date for each job posting and see how the demand for different technologies has changed over time.
  18. Use pairs of words, i.e., 2-grams, rather than single words to analyze the posts. What are common pairs of words.
  19. Use Natural Language Processing to analyze the words in sentences to identify the metadata about the jobs from the free form text.
  20. We assumed the set of stop words was available. There are many different sets of stop words and we can retrieve others. Extract the stop words from http://www.ranks.nl/resources/stopwords.html. Additionally, there is a collection of stop words for different languages in a zip archive at http://stop-words.googlecode.com/files/stop-words-collection-2011.11.21.zip. Use the RCurl and Rcompression packages to retrieve and extract the files for your language and create the set of stop words from these.