R!

Presenter Notes

So why a third language?

  • Languages suited to different aspects of data collection and analysis
    1. Python: data collection and transformation
    2. MongoDB: data storage
    3. R: data examination and analysis
  • While you can use Python or R to do almost everything, each has a comparative advantage
  • One reason for exposing you to each of these languages is to show you the advantages of different approaches

Presenter Notes

Let's review the big ideas from the languages we've learned so far

  • Python big ideas
    • Scripting-based language
    • Use interpreter for interactive programming tasks
    • Implicit typing
    • Iterators
    • List comprehensions
    • Wrap other languages inside Python
    • Libraries to do just about anything web-related
  • MongoDB big ideas
    • Flexible record storage
    • Records are JSON-like objects
    • Indexes make only the queries you need fast
    • Natural interface to Python

Presenter Notes

R's big ideas

  • Data frames as the key object -- a cross between a table, array, and dictionary
  • Logical vectors used to access subsets of data
  • Categorical variables as factors
  • Functional programming paradigm a natural fit for data aggregation
  • Missing data has its own data type
  • Extensive libraries available due to vibrant open-source community

Presenter Notes

Leaving the 30,000-foot view for the weeds

  • To start R, simply type R at the command line
  • You get an interactive shell, just as in Python
  • Load in R code by running source("filename.txt")
  • Can also download and run local copy from http://www.r-project.org/
  • Eventually I will show you a Python wrapper called rpy2, which is similar to pymongo but for R.
  • To get help on a function, type help(functionname)
  • To see some examples uses of a function, type example(functionname)

Presenter Notes

R data types

  • R has many data types
    • vectors (one-dimensional collection of the same type)
    • arrays (multi-dimensional vectors, all of same type)
    • matrices (2-dimensional arrays, all of same type)
    • lists (sequence of objects of any type, most general)
    • data frames (list where each element is a vector of the same length)
  • Data frames are like a tables (collections) in a database
    • We will work with data frames unless an R function demands something else

Presenter Notes

R data frames

Recall from H0, you created a file called allcont.csv containing all Super PAC contributions:

"FECID","CNumber","CName","CAmount","Date","State","ZIP","IndOrg"
763062,"C00497412","Our Voice PAC",500.0,20110927,"CA",91355,"IND"
763062,"C00497412","Our Voice PAC",500.0,20110927,"CA",91355,"IND"
763062,"C00497412","Our Voice PAC",270.0,20111027,"NC",28403,"IND"
763062,"C00497412","Our Voice PAC",1000.0,20110811,"NV",89519,"IND"
763062,"C00497412","Our Voice PAC",200.0,20110811,"TX",78216,"IND"
763062,"C00497412","Our Voice PAC",200.0,20110930,"TX",78216,"IND"
763062,"C00497412","Our Voice PAC",250.0,20110917,"OK",74137,"IND"
763062,"C00497412","Our Voice PAC",656.0,20110726,"OR",97601,"IND"
763062,"C00497412","Our Voice PAC",120.0,20111007,"CA",92109,"IND"
763062,"C00497412","Our Voice PAC",500.0,20110927,"CA",91381,"IND"
763062,"C00497412","Our Voice PAC",500.0,20110927,"CA",91381,"IND"

Presenter Notes

Code from today's class

Presenter Notes

Example R Session

Begin by importing the data:

> setwd('/home/qtw/public_html/data/')
> ac<-read.table('allcontMod.csv',header=T,sep=',')
> summary(ac)
     FECID             CNumber
 Min.   :761774   C00498097:339
 1st Qu.:762520   C00499020:235
 Median :763187   C00490045:199
 Mean   :763033   C00487470:130
 3rd Qu.:763735   C00487363: 95
 Max.   :763953   C00499731: 65
                  (Other)  :723
                                             CName        CAmount
 Americans for a Better Tomorrow, Tomorrow, Inc.:339   Min.   :  -1000
 FREEDOMWORKS FOR AMERICA                       :235   1st Qu.:    250
 RESTORE OUR FUTURE, INC.                       :199   Median :    500
 CLUB FOR GROWTH ACTION                         :130   Mean   :  35600
 American Crossroads                            : 95   3rd Qu.:  10000
 Make Us Great Again                            : 65   Max.   :5000000
 (Other)                                        :723
      Date              State          ZIP        IndOrg
 Min.   :20110701   CA     :290   Min.   : 1062   CCM:   7
 1st Qu.:20110914   TX     :187   1st Qu.:20006   COM:   9
 Median :20111107   FL     :151   Median :48236   IND:1481
 Mean   :20111051   DC     :148   Mean   :51735   ORG: 224
 3rd Qu.:20111213   NY     :134   3rd Qu.:84108   PAC:  65
 Max.   :20120104   MA     : 71   Max.   :99803
                    (Other):805   NA's   :   14

Presenter Notes

There are lots of ways to access a data frame

You can slice and dice by column or row index (starting at 1, row, before column):

> ach<-head(ac,10) #let's just look at the 1st 10 rows for example
> #subsets
> #lets get the first row
> ach[1,]
   FECID   CNumber         CName CAmount     Date State   ZIP IndOrg
1 763062 C00497412 Our Voice PAC     500 20110927    CA 91355    IND
> #rows 10-15
> ach[5:8,]
   FECID   CNumber         CName CAmount     Date State   ZIP IndOrg
5 763062 C00497412 Our Voice PAC     200 20110811    TX 78216    IND
6 763062 C00497412 Our Voice PAC     200 20110930    TX 78216    IND
7 763062 C00497412 Our Voice PAC     250 20110917    OK 74137    IND
8 763062 C00497412 Our Voice PAC     656 20110726    OR 97601    IND
>
> #just the first column
> ach[,1]
 [1] 763062 763062 763062 763062 763062 763062 763062 763062 763062 763062
> #columns 1,4,5
> ach[,c(1,4,5)]
    FECID CAmount     Date
1  763062     500 20110927
2  763062     500 20110927
3  763062     270 20111027
4  763062    1000 20110811
5  763062     200 20110811
6  763062     200 20110930
7  763062     250 20110917
8  763062     656 20110726
9  763062     120 20111007
10 763062     500 20110927

Presenter Notes

There are lots of ways to access a data frame

Columns can also be accessed like a dictionary:

> ach$State
 [1] CA CA NC NV TX TX OK OR CA CA
54 Levels:  AE AK AL AR AZ CA CO CT DC DE FL GA GU HI IA ID IL IN KS KY ... WY
> ach[,c('CAmount','State','ZIP')]
   CAmount State   ZIP
1      500    CA 91355
2      500    CA 91355
3      270    NC 28403
4     1000    NV 89519
5      200    TX 78216
6      200    TX 78216
7      250    OK 74137
8      656    OR 97601
9      120    CA 92109
10     500    CA 91381

Presenter Notes

Now for something new: logical vectors

Suppose we are interested in finding out which records included contributions exceeding $500.

We can create a logical vector that will return True or False for each record based on a condition:

> ach$CAmount>500
 [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
> #suppose we want to keep track of this as a new column
> ach$G500<-ach$CAmount>500
> ach
    FECID   CNumber         CName CAmount     Date State   ZIP IndOrg  G500
1  763062 C00497412 Our Voice PAC     500 20110927    CA 91355    IND FALSE
2  763062 C00497412 Our Voice PAC     500 20110927    CA 91355    IND FALSE
3  763062 C00497412 Our Voice PAC     270 20111027    NC 28403    IND FALSE
4  763062 C00497412 Our Voice PAC    1000 20110811    NV 89519    IND  TRUE
5  763062 C00497412 Our Voice PAC     200 20110811    TX 78216    IND FALSE
6  763062 C00497412 Our Voice PAC     200 20110930    TX 78216    IND FALSE
7  763062 C00497412 Our Voice PAC     250 20110917    OK 74137    IND FALSE
8  763062 C00497412 Our Voice PAC     656 20110726    OR 97601    IND  TRUE
9  763062 C00497412 Our Voice PAC     120 20111007    CA 92109    IND FALSE
10 763062 C00497412 Our Voice PAC     500 20110927    CA 91381    IND FALSE

Presenter Notes

Logical vectors can be used to select subsets of records

The real power of logical vectors comes when used to access parts of a data frame:

> #suppose we only want the records with contributions > $500
> ach[ach$CAmount>500,]
   FECID   CNumber         CName CAmount     Date State   ZIP IndOrg G500
4 763062 C00497412 Our Voice PAC    1000 20110811    NV 89519    IND TRUE
8 763062 C00497412 Our Voice PAC     656 20110726    OR 97601    IND TRUE
> #now let's say we are only interested in the state they come from
> ach$State[ach$CAmount>500]
[1] NV OR
54 Levels:  AE AK AL AR AZ CA CO CT DC DE FL GA GU HI IA ID IL IN KS KY ... WY
> #now let's say we are only interested in the contribution amount
> ach$CAmount[ach$CAmount>500]
[1] 1000  656

Presenter Notes

Some more examples

  1. Contributions from Massachusetts
    • ac$CAmount[ac$State=='MA']
  2. Contributions from Committee C00497412

    • ac$CAmount[ac$CNumber=='C00497412']
  3. You can also apply functions to vectors (including the usual mean, median, sum)

    • For instance, mean of all contributions to C00497412
    • mean(ac$CAmount[ac$CNumber=='C00497412']) (answer: 1028.4)
    • For instance, mean of all contributions to C00497412 greater than $500
    • mean(ac$CAmount[ac$CNumber=='C00497412'&ac$CAmount>500]) (answer: 2099.17)

Presenter Notes

Exercise 1

  1. Download the starter code from http://cs.wellesley.edu/~qtw/code/r_ex1.R
  2. Get all the records with contributions exceeding $500,000
  3. Report how many contributions from Texas there are (use the length function)
  4. Sum up the total amount of contributions to FEC ID 763780 (Mitt Romney's Super PAC)
  5. Sum up the total amount of contributions to FEC ID 763780 from Massachusetts

Presenter Notes

Solution to Exercise 1

Presenter Notes

R calls categorical variables factors

  • One categorical variable from our example data set is State
  • One natural question is to compare the contribution amounts from different states
  • We already did it for Massachusetts: sum(ac$CAmount[ac$State=='MA'])
  • Generalizing here, what we are doing is computing a function on a subset of a numerical variable that have the same value for the categorical variable.
  • What if we wanted to get the contributions from all states?

We could compute it one-by-one:

contAL<-sum(ac$CAmount[ac$State=='AL'])
contAK<-sum(ac$CAmount[ac$State=='AK'])
contAZ<-sum(ac$CAmount[ac$State=='AZ'])
contAR<-sum(ac$CAmount[ac$State=='AR'])
contStates<-c(contAL,contAK,contAZ,contAR)
#BUT THERE IS A BETTER WAY!

Presenter Notes

Using tapply, we can get contributions by state in one line of code!

  • tapply is a function that takes three arguments
    1. vector of numerical values
    2. vector of factor values
    3. a function to apply to the numerical values grouped by factor values

Hence, to get the total contributions by state, we write:

> superAll<-tapply(ac$CAmount,ac$State,sum)
                     AE          AK          AL          AR          AZ
  260551.20      500.00    11250.00   519450.00  1018900.00   141150.00
         CA          CO          CT          DC          DE          FL
 4862354.60   507696.49  1473700.00  8625142.03   223000.00  3822840.65
         GA          GU          HI          IA          ID          IL
  231584.87      245.00      500.00     4820.00  1001000.00  1161820.64
         IN          KS          KY          LA          MA          MD
 1211700.00    13800.12    12457.36     8350.00  2269817.76   129125.00
         ME          MI          MN          MO          MS          MT
     400.00   472801.00   112113.32    12752.00    53324.16     2432.49
         NC          ND          NE          NH          NJ          NM
   64770.00     3241.00      880.00     9000.00   188300.00   106510.00
         NV          NY          OH          OK          OR          PA
 1182621.36  8340193.26    49900.00  1737428.96   123387.56   590160.00
         RI          SC          SD          TN          TX          UT
    2250.00    55125.00     5500.00     8100.00 16919837.50  4093144.80
         VA          VT          WA          WI          WV          WY
  944348.00      650.00   253450.00   352501.00     4600.00   381000.00

Presenter Notes

Using tapply with logical vectors

Suppose we want to see state-level contributions for Stephen Colbert's Super PAC (761774)

> superStephen<-tapply(ac$CAmount[ac$FECID==761774],ac$State[ac$FECID==761774],sum)
> superStephen
               AE       AK       AL       AR       AZ       CA       CO
      NA   250.00   500.00       NA   500.00   850.00 20752.00  2531.49
      CT       DC       DE       FL       GA       GU       HI       IA
  550.00       NA       NA  2013.98  3430.00       NA   250.00  1570.00
      ID       IL       IN       KS       KY       LA       MA       MD
      NA  2251.00  1000.00   750.00       NA   500.00 12400.00  1300.00
      ME       MI       MN       MO       MS       MT       NC       ND
  400.00  1801.00   600.00   952.00   250.00   432.49   500.00       NA
      NE       NH       NJ       NM       NV       NY       OH       OK
      NA   250.00   500.00  1550.00  2000.00  5576.00   900.00   878.96
      OR       PA       RI       SC       SD       TN       TX       UT
  557.68  2010.00       NA       NA       NA   250.00  3467.00       NA
      VA       VT       WA       WI       WV       WY
 1766.00       NA  2850.00   751.00  3000.00       NA

Presenter Notes

What fraction of Super PAC contributions go to Stephen Colbert's Super PAC?

Just divide the two vectors:

> superStephen/superAll
                       AE           AK           AL           AR           AZ
          NA 0.5000000000 0.0444444444           NA 0.0004907253 0.0060219625
          CA           CO           CT           DC           DE           FL
0.0042678911 0.0049862271 0.0003732103           NA           NA 0.0005268281
          GA           GU           HI           IA           ID           IL
0.0148109848           NA 0.5000000000 0.3257261411           NA 0.0019374763
          IN           KS           KY           LA           MA           MD
0.0008252868 0.0543473535           NA 0.0598802395 0.0054629936 0.0100677638
          ME           MI           MN           MO           MS           MT
1.0000000000 0.0038092136 0.0053517281 0.0746549561 0.0046883064 0.1777972366
          NC           ND           NE           NH           NJ           NM
0.0077196233           NA           NA 0.0277777778 0.0026553372 0.0145526242
          NV           NY           OH           OK           OR           PA
0.0016911584 0.0006685696 0.0180360721 0.0005058969 0.0045197425 0.0034058560
          RI           SC           SD           TN           TX           UT
          NA           NA           NA 0.0308641975 0.0002049074           NA
          VA           VT           WA           WI           WV           WY
0.0018700733           NA 0.0112448215 0.0021304904 0.6521739130           NA

Presenter Notes

Did you notice the NAs?

  • R has a special value that represents missing data called NA
  • Missing data is unavoidable in data collection
  • Question is how to handle it
  • Replacing with 0s usually a bad idea
  • Most functions in R are aware of NAs and give the user control on how to handle them

For instance:

> vals<-c(3,4,5,2,3,4,NA,6,8,NA)
> mean(vals)
[1] NA
> mean(vals,na.rm=TRUE)
[1] 4.375

Presenter Notes

Recall the combined CSV created in H1

We can import this file into R:

> rs<-read.table('regSuperCensusMod.csv',header=T,sep=',')
> rsi<-rs[rs$IndOrg=="IND",]  #only consider individual contributions
> summary(rsi)
         Candidate      RegularSuper        CNumber
 barack obama :28864   Regular:40600   C00431445:28845
 mitt romney  :10470   Super  :  218   C00431171:10469
 newt gingrich: 1484                   C00496497: 1266
                                       C00490045:  159
                                       C00495861:   41
                                       C00507525:   18
                                       (Other)  :   20
                      CName          CAmount             Date
 Obama for America       :28845   Min.   : -30800   Min.   :20110302
 Romney for President    :10469   1st Qu.:    250   1st Qu.:20110501
 Newt 2012               : 1266   Median :    500   Median :20110527
 RESTORE OUR FUTURE, INC.:  159   Mean   :   1376   Mean   :20110543
 Priorities USA Action   :   41   3rd Qu.:   2500   3rd Qu.:20110621
 Winning Our Future      :   18   Max.   :1000000   Max.   :20111231
 (Other)                 :   20
     State            ZIP         IndOrg          PopSt
 CA     : 7593   Min.   :    16   IND:40818   Min.   :  563626
 NY     : 4534   1st Qu.: 16502   ORG:    0   1st Qu.: 5988927
 TX     : 2456   Median : 45243   PAC:    0   Median :12702379
 MA     : 2452   Mean   : 48788               Mean   :15688506
 FL     : 2330   3rd Qu.: 85016               3rd Qu.:19378102
 (Other):21383   Max.   : 99999               Max.   :37253956
 NA's   :   70   NA's   :   101               NA's   :     261
    FrWhite            FrBlack           USADiversity
 Min.   :  0.2274   Min.   :  0.00407   Min.   :  0.1100
 1st Qu.:  0.4533   1st Qu.:  0.06171   1st Qu.:  0.4053
 Median :  0.5834   Median :  0.11849   Median :  0.5866
 Mean   :  0.6036   Mean   :  0.12576   Mean   :  0.5486
 3rd Qu.:  0.7565   3rd Qu.:  0.15862   3rd Qu.:  0.6618
 Max.   :  0.9442   Max.   :  0.50709   Max.   :  0.8106
 NA's   :261.0000   NA's   :261.00000   NA's   :261.0000

Presenter Notes

Let's try examining the data

We can compare Super PAC to regular contributions by candidate:

> #first look at the overall contributions by candidate
> tapply(rsi$CAmount,rsi$Candidate,sum)
 barack obama   mitt romney newt gingrich
     23557099      29411808       3181415
> tapply(rsi$CAmount[rsi$Candidate=="barack obama"],
    rsi$RegularSuper[rsi$Candidate=="barack obama"],sum)
 Regular    Super
23111899   445200
> tapply(rsi$CAmount[rsi$Candidate=="barack obama"],
    rsi$RegularSuper[rsi$Candidate=="mitt romney"],sum)
Error in tapply(rsi$CAmount[rsi$Candidate == "barack obama"], rsi$RegularSuper[rsi$Candidate ==  :
  arguments must have same length
> #whoops common mistake, only changed one factor value, not both. fixed now
> tapply(rsi$CAmount[rsi$Candidate=="mitt romney"],
    rsi$RegularSuper[rsi$Candidate=="mitt romney"],sum)
 Regular    Super
17243855 12167953
> tapply(rsi$CAmount[rsi$Candidate=="newt gingrich"],
    rsi$RegularSuper[rsi$Candidate=="newt gingrich"],sum)
Regular   Super
1101165 2080250

Presenter Notes

Can we use tapply again on a second factor?

It is tiresome that we have to go through each candidate to compare a different categorical variable. What if there were many more candidates?

One option is to create a new combined factor indicating candidate name plus whether super or regular:

> rsi$candSuper<-factor(paste(rsi$Candidate,rsi$RegularSuper))
> tapply(rsi$CAmount,rsi$candSuper,sum)
 barack obama Regular    barack obama Super   mitt romney Regular
             23111899                445200              17243855
    mitt romney Super newt gingrich Regular   newt gingrich Super
             12167953               1101165               2080250

Presenter Notes

Exercise 2

  1. Download starter code from http://cs.wellesley.edu/~qtw/code/r_ex2.R
  2. Report the mean contributions broken down by candidate and regular/super PAC
  3. Report the mean fraction of black population for contributions broken down by candidate and regular/super PAC (do not normalize by contribution size)
  4. Report the total contributions from Massachusetts broken down by candidate and regular/super PAC

Presenter Notes

Solution to Exercise 2

Presenter Notes