R
at the command linesource("filename.txt")
rpy2
, which is similar to pymongo
but for R.help(functionname)
example(functionname)
Recall from H0, you created a file called allcont.csv
containing all Super PAC contributions:
"FECID","CNumber","CName","CAmount","Date","State","ZIP","IndOrg"
763062,"C00497412","Our Voice PAC",500.0,20110927,"CA",91355,"IND"
763062,"C00497412","Our Voice PAC",500.0,20110927,"CA",91355,"IND"
763062,"C00497412","Our Voice PAC",270.0,20111027,"NC",28403,"IND"
763062,"C00497412","Our Voice PAC",1000.0,20110811,"NV",89519,"IND"
763062,"C00497412","Our Voice PAC",200.0,20110811,"TX",78216,"IND"
763062,"C00497412","Our Voice PAC",200.0,20110930,"TX",78216,"IND"
763062,"C00497412","Our Voice PAC",250.0,20110917,"OK",74137,"IND"
763062,"C00497412","Our Voice PAC",656.0,20110726,"OR",97601,"IND"
763062,"C00497412","Our Voice PAC",120.0,20111007,"CA",92109,"IND"
763062,"C00497412","Our Voice PAC",500.0,20110927,"CA",91381,"IND"
763062,"C00497412","Our Voice PAC",500.0,20110927,"CA",91381,"IND"
Begin by importing the data:
> setwd('/home/qtw/public_html/data/')
> ac<-read.table('allcontMod.csv',header=T,sep=',')
> summary(ac)
FECID CNumber
Min. :761774 C00498097:339
1st Qu.:762520 C00499020:235
Median :763187 C00490045:199
Mean :763033 C00487470:130
3rd Qu.:763735 C00487363: 95
Max. :763953 C00499731: 65
(Other) :723
CName CAmount
Americans for a Better Tomorrow, Tomorrow, Inc.:339 Min. : -1000
FREEDOMWORKS FOR AMERICA :235 1st Qu.: 250
RESTORE OUR FUTURE, INC. :199 Median : 500
CLUB FOR GROWTH ACTION :130 Mean : 35600
American Crossroads : 95 3rd Qu.: 10000
Make Us Great Again : 65 Max. :5000000
(Other) :723
Date State ZIP IndOrg
Min. :20110701 CA :290 Min. : 1062 CCM: 7
1st Qu.:20110914 TX :187 1st Qu.:20006 COM: 9
Median :20111107 FL :151 Median :48236 IND:1481
Mean :20111051 DC :148 Mean :51735 ORG: 224
3rd Qu.:20111213 NY :134 3rd Qu.:84108 PAC: 65
Max. :20120104 MA : 71 Max. :99803
(Other):805 NA's : 14
You can slice and dice by column or row index (starting at 1, row, before column):
> ach<-head(ac,10) #let's just look at the 1st 10 rows for example
> #subsets
> #lets get the first row
> ach[1,]
FECID CNumber CName CAmount Date State ZIP IndOrg
1 763062 C00497412 Our Voice PAC 500 20110927 CA 91355 IND
> #rows 10-15
> ach[5:8,]
FECID CNumber CName CAmount Date State ZIP IndOrg
5 763062 C00497412 Our Voice PAC 200 20110811 TX 78216 IND
6 763062 C00497412 Our Voice PAC 200 20110930 TX 78216 IND
7 763062 C00497412 Our Voice PAC 250 20110917 OK 74137 IND
8 763062 C00497412 Our Voice PAC 656 20110726 OR 97601 IND
>
> #just the first column
> ach[,1]
[1] 763062 763062 763062 763062 763062 763062 763062 763062 763062 763062
> #columns 1,4,5
> ach[,c(1,4,5)]
FECID CAmount Date
1 763062 500 20110927
2 763062 500 20110927
3 763062 270 20111027
4 763062 1000 20110811
5 763062 200 20110811
6 763062 200 20110930
7 763062 250 20110917
8 763062 656 20110726
9 763062 120 20111007
10 763062 500 20110927
Columns can also be accessed like a dictionary:
> ach$State
[1] CA CA NC NV TX TX OK OR CA CA
54 Levels: AE AK AL AR AZ CA CO CT DC DE FL GA GU HI IA ID IL IN KS KY ... WY
> ach[,c('CAmount','State','ZIP')]
CAmount State ZIP
1 500 CA 91355
2 500 CA 91355
3 270 NC 28403
4 1000 NV 89519
5 200 TX 78216
6 200 TX 78216
7 250 OK 74137
8 656 OR 97601
9 120 CA 92109
10 500 CA 91381
Suppose we are interested in finding out which records included contributions exceeding $500.
We can create a logical vector that will return True or False for each record based on a condition:
> ach$CAmount>500
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
> #suppose we want to keep track of this as a new column
> ach$G500<-ach$CAmount>500
> ach
FECID CNumber CName CAmount Date State ZIP IndOrg G500
1 763062 C00497412 Our Voice PAC 500 20110927 CA 91355 IND FALSE
2 763062 C00497412 Our Voice PAC 500 20110927 CA 91355 IND FALSE
3 763062 C00497412 Our Voice PAC 270 20111027 NC 28403 IND FALSE
4 763062 C00497412 Our Voice PAC 1000 20110811 NV 89519 IND TRUE
5 763062 C00497412 Our Voice PAC 200 20110811 TX 78216 IND FALSE
6 763062 C00497412 Our Voice PAC 200 20110930 TX 78216 IND FALSE
7 763062 C00497412 Our Voice PAC 250 20110917 OK 74137 IND FALSE
8 763062 C00497412 Our Voice PAC 656 20110726 OR 97601 IND TRUE
9 763062 C00497412 Our Voice PAC 120 20111007 CA 92109 IND FALSE
10 763062 C00497412 Our Voice PAC 500 20110927 CA 91381 IND FALSE
The real power of logical vectors comes when used to access parts of a data frame:
> #suppose we only want the records with contributions > $500
> ach[ach$CAmount>500,]
FECID CNumber CName CAmount Date State ZIP IndOrg G500
4 763062 C00497412 Our Voice PAC 1000 20110811 NV 89519 IND TRUE
8 763062 C00497412 Our Voice PAC 656 20110726 OR 97601 IND TRUE
> #now let's say we are only interested in the state they come from
> ach$State[ach$CAmount>500]
[1] NV OR
54 Levels: AE AK AL AR AZ CA CO CT DC DE FL GA GU HI IA ID IL IN KS KY ... WY
> #now let's say we are only interested in the contribution amount
> ach$CAmount[ach$CAmount>500]
[1] 1000 656
ac$CAmount[ac$State=='MA']
Contributions from Committee C00497412
ac$CAmount[ac$CNumber=='C00497412']
You can also apply functions to vectors (including the usual mean
, median
, sum
)
mean(ac$CAmount[ac$CNumber=='C00497412'])
(answer: 1028.4)mean(ac$CAmount[ac$CNumber=='C00497412'&ac$CAmount>500])
(answer: 2099.17)length
function)State
sum(ac$CAmount[ac$State=='MA'])
We could compute it one-by-one:
contAL<-sum(ac$CAmount[ac$State=='AL'])
contAK<-sum(ac$CAmount[ac$State=='AK'])
contAZ<-sum(ac$CAmount[ac$State=='AZ'])
contAR<-sum(ac$CAmount[ac$State=='AR'])
contStates<-c(contAL,contAK,contAZ,contAR)
#BUT THERE IS A BETTER WAY!
tapply
, we can get contributions by state in one line of code!tapply
is a function that takes three argumentsHence, to get the total contributions by state, we write:
> superAll<-tapply(ac$CAmount,ac$State,sum)
AE AK AL AR AZ
260551.20 500.00 11250.00 519450.00 1018900.00 141150.00
CA CO CT DC DE FL
4862354.60 507696.49 1473700.00 8625142.03 223000.00 3822840.65
GA GU HI IA ID IL
231584.87 245.00 500.00 4820.00 1001000.00 1161820.64
IN KS KY LA MA MD
1211700.00 13800.12 12457.36 8350.00 2269817.76 129125.00
ME MI MN MO MS MT
400.00 472801.00 112113.32 12752.00 53324.16 2432.49
NC ND NE NH NJ NM
64770.00 3241.00 880.00 9000.00 188300.00 106510.00
NV NY OH OK OR PA
1182621.36 8340193.26 49900.00 1737428.96 123387.56 590160.00
RI SC SD TN TX UT
2250.00 55125.00 5500.00 8100.00 16919837.50 4093144.80
VA VT WA WI WV WY
944348.00 650.00 253450.00 352501.00 4600.00 381000.00
Suppose we want to see state-level contributions for Stephen Colbert's Super PAC (761774)
> superStephen<-tapply(ac$CAmount[ac$FECID==761774],ac$State[ac$FECID==761774],sum)
> superStephen
AE AK AL AR AZ CA CO
NA 250.00 500.00 NA 500.00 850.00 20752.00 2531.49
CT DC DE FL GA GU HI IA
550.00 NA NA 2013.98 3430.00 NA 250.00 1570.00
ID IL IN KS KY LA MA MD
NA 2251.00 1000.00 750.00 NA 500.00 12400.00 1300.00
ME MI MN MO MS MT NC ND
400.00 1801.00 600.00 952.00 250.00 432.49 500.00 NA
NE NH NJ NM NV NY OH OK
NA 250.00 500.00 1550.00 2000.00 5576.00 900.00 878.96
OR PA RI SC SD TN TX UT
557.68 2010.00 NA NA NA 250.00 3467.00 NA
VA VT WA WI WV WY
1766.00 NA 2850.00 751.00 3000.00 NA
Just divide the two vectors:
> superStephen/superAll
AE AK AL AR AZ
NA 0.5000000000 0.0444444444 NA 0.0004907253 0.0060219625
CA CO CT DC DE FL
0.0042678911 0.0049862271 0.0003732103 NA NA 0.0005268281
GA GU HI IA ID IL
0.0148109848 NA 0.5000000000 0.3257261411 NA 0.0019374763
IN KS KY LA MA MD
0.0008252868 0.0543473535 NA 0.0598802395 0.0054629936 0.0100677638
ME MI MN MO MS MT
1.0000000000 0.0038092136 0.0053517281 0.0746549561 0.0046883064 0.1777972366
NC ND NE NH NJ NM
0.0077196233 NA NA 0.0277777778 0.0026553372 0.0145526242
NV NY OH OK OR PA
0.0016911584 0.0006685696 0.0180360721 0.0005058969 0.0045197425 0.0034058560
RI SC SD TN TX UT
NA NA NA 0.0308641975 0.0002049074 NA
VA VT WA WI WV WY
0.0018700733 NA 0.0112448215 0.0021304904 0.6521739130 NA
NA
For instance:
> vals<-c(3,4,5,2,3,4,NA,6,8,NA)
> mean(vals)
[1] NA
> mean(vals,na.rm=TRUE)
[1] 4.375
We can import this file into R:
> rs<-read.table('regSuperCensusMod.csv',header=T,sep=',')
> rsi<-rs[rs$IndOrg=="IND",] #only consider individual contributions
> summary(rsi)
Candidate RegularSuper CNumber
barack obama :28864 Regular:40600 C00431445:28845
mitt romney :10470 Super : 218 C00431171:10469
newt gingrich: 1484 C00496497: 1266
C00490045: 159
C00495861: 41
C00507525: 18
(Other) : 20
CName CAmount Date
Obama for America :28845 Min. : -30800 Min. :20110302
Romney for President :10469 1st Qu.: 250 1st Qu.:20110501
Newt 2012 : 1266 Median : 500 Median :20110527
RESTORE OUR FUTURE, INC.: 159 Mean : 1376 Mean :20110543
Priorities USA Action : 41 3rd Qu.: 2500 3rd Qu.:20110621
Winning Our Future : 18 Max. :1000000 Max. :20111231
(Other) : 20
State ZIP IndOrg PopSt
CA : 7593 Min. : 16 IND:40818 Min. : 563626
NY : 4534 1st Qu.: 16502 ORG: 0 1st Qu.: 5988927
TX : 2456 Median : 45243 PAC: 0 Median :12702379
MA : 2452 Mean : 48788 Mean :15688506
FL : 2330 3rd Qu.: 85016 3rd Qu.:19378102
(Other):21383 Max. : 99999 Max. :37253956
NA's : 70 NA's : 101 NA's : 261
FrWhite FrBlack USADiversity
Min. : 0.2274 Min. : 0.00407 Min. : 0.1100
1st Qu.: 0.4533 1st Qu.: 0.06171 1st Qu.: 0.4053
Median : 0.5834 Median : 0.11849 Median : 0.5866
Mean : 0.6036 Mean : 0.12576 Mean : 0.5486
3rd Qu.: 0.7565 3rd Qu.: 0.15862 3rd Qu.: 0.6618
Max. : 0.9442 Max. : 0.50709 Max. : 0.8106
NA's :261.0000 NA's :261.00000 NA's :261.0000
We can compare Super PAC to regular contributions by candidate:
> #first look at the overall contributions by candidate
> tapply(rsi$CAmount,rsi$Candidate,sum)
barack obama mitt romney newt gingrich
23557099 29411808 3181415
> tapply(rsi$CAmount[rsi$Candidate=="barack obama"],
rsi$RegularSuper[rsi$Candidate=="barack obama"],sum)
Regular Super
23111899 445200
> tapply(rsi$CAmount[rsi$Candidate=="barack obama"],
rsi$RegularSuper[rsi$Candidate=="mitt romney"],sum)
Error in tapply(rsi$CAmount[rsi$Candidate == "barack obama"], rsi$RegularSuper[rsi$Candidate == :
arguments must have same length
> #whoops common mistake, only changed one factor value, not both. fixed now
> tapply(rsi$CAmount[rsi$Candidate=="mitt romney"],
rsi$RegularSuper[rsi$Candidate=="mitt romney"],sum)
Regular Super
17243855 12167953
> tapply(rsi$CAmount[rsi$Candidate=="newt gingrich"],
rsi$RegularSuper[rsi$Candidate=="newt gingrich"],sum)
Regular Super
1101165 2080250
It is tiresome that we have to go through each candidate to compare a different categorical variable. What if there were many more candidates?
One option is to create a new combined factor indicating candidate name plus whether super or regular:
> rsi$candSuper<-factor(paste(rsi$Candidate,rsi$RegularSuper))
> tapply(rsi$CAmount,rsi$candSuper,sum)
barack obama Regular barack obama Super mitt romney Regular
23111899 445200 17243855
mitt romney Super newt gingrich Regular newt gingrich Super
12167953 1101165 2080250
Table of Contents | t |
---|---|
Exposé | ESC |
Full screen slides | e |
Presenter View | p |
Source Files | s |
Slide Numbers | n |
Toggle screen blanking | b |
Show/hide slide context | c |
Notes | 2 |
Help | h |