An import part of data analysis is being able plot the data in a way which communicates an aspect of our data set that we are interested in. Often, to be able see what I’m interested in requires some faceted plots broken down by groups in my data set. I thought that today I’d put up a short post on creating faceted plots using three plotting packages.

  • matplotlib – very flexible, but has more of a learning curve than the other two, also ggplot and seaborn use this under the hood. http://matplotlib.org/
  • seaborn – makes great looking plots with much simpler syntax. http://stanford.edu/~mwaskom/software/seaborn/
  • ggplot – Anyone coming from a background in R will be familiar with this package. Also creates plots in with a very concise syntax. The guys over at yhat have ported this plotting package to python, it is relatively new but they have open sourced it and seem to be getting more features all the time. https://github.com/yhat/ggplot

So before we can plot anything we need some data! The data I’ll be using today is available from a fivethirtyeight github repo here: https://github.com/fivethirtyeight/uber-tlc-foil-response/tree/master/uber-trip-data. The data is a collection of Uber pick ups in New York City from 2014, I am using the Sept-2014 data set. I thought it would be interesting to group the data by day and hour and take a look at the total number of rides throughout the day for each day in Sept-2014. One other quick note, all of the plots are bit small since they were plotted in an ipython notebook which was pasted into this post for convenience, if you plot them your self from the terminal you will be able to resize them to a more readable size.

Imports/Packages needed

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
from ggplot import *
import seaborn as sns
# line below allows inline plotting in ipython notebooks
%matplotlib inline 
plt.rcParams['figure.figsize'] = (20.0, 8.0) # make plots bigger by default

Importing the data

First I want to load the data, and extract a day for pick ups and hour of the day.

parser =  lambda x: datetime.strptime(x,'%m/%d/%Y %H:%M:%S') ## needed for pandas to recognize this date format

uber = pd.read_csv('uber-raw-data-sep14.csv', parse_dates=['Date/Time'], date_parser=parser)
uber['Day'] = uber['Date/Time'].apply(lambda x: x.day)
uber['Hour'] = uber['Date/Time'].apply(lambda x: x.hour)
Date/Time Lat Lon Base Day Hour
0 2014-09-01 00:01:00 40.2201 -74.0021 B02512 1 0
1 2014-09-01 00:01:00 40.7500 -74.0027 B02512 1 0
2 2014-09-01 00:03:00 40.7559 -73.9864 B02512 1 0
3 2014-09-01 00:06:00 40.7450 -73.9889 B02512 1 0
4 2014-09-01 00:11:00 40.8145 -73.9444 B02512 1 0

Grouping the data

I am interested in looking at the number of uber pick ups each hour of each day, to get at that information I need to group and aggregate the data

grouped = uber.groupby(['Day','Hour'],as_index = False).aggregate(len)
grouped = grouped[['Day','Hour','Lat']]
grouped.columns = ['Day','Hour','Count']
#Take a look at how the data looks now 
Day Hour Count
0 1 0 699
1 1 1 490
2 1 2 363
3 1 3 333
4 1 4 261

Plotting using matplotlib (functionality built into Pandas)

col_num = 7 # one for each day of the week
row_num = 5 # one/ week
weekend_days = [6, 7, 13, 14, 20, 21, 27, 28] # these are the dates of weekend days in Sept 2014
color_dict = { True  : 'red', False : 'blue'} # give weekend days a different color, just for fun
row, col = 0, 1 
fig, axes = plt.subplots(nrows=row_num, ncols=col_num, sharex= True, sharey= True)
for name, group in grouped.groupby('Day'):
    group.plot('Hour', 'Count', ax=axes[row,col], legend=False, color = color_dict[name in weekend_days])
    axes[row,col].set_ylim(0, 3500) # set y limit
    axes[row,col].set_ylabel('Rides/Hr') # set y label
    axes[row,col].annotate('9/'+str(name), xy= (1, 3000)) # write date in corner of each plot
    col += 1
    if col%col_num == 0: 
        row += 1
        col  = 0 

Plotting using ggplot (still matplotlib under the hood)

ggplot(grouped, aes(x='Hour',y='Count')) +\
geom_line() +\
facet_wrap('Day', ncol = 7, nrow = 5, scales = 'fixed')

Plotting using Seaborn (still matplotlib under the hood)

grid = sns.FacetGrid(grouped, col="Day", col_wrap=7)

Alright, that’s it for today. Hope you guys find these examples useful. It’s much easier to quickly look at things using ggplot and seaborn, but matplotlib allows you to specify very precisely what you would like. Also, it is worthwhile getting comfortable with matplotlib because the other packages implement it, so if you are using ggplot or seaborn but want to change something about your plot, you can alter the plots with the usual matplotlib syntax.


Posted in Python | Leave a comment

Data Mining the California Solar Statistics with R: Part V

Building a Shiny App to explore the model and the data

About the Shiny App

In my previous post I built several models to try to predict the amount of residential solar installed per county by quarter as a function of solar insolation, price of solar electricity, county population and county median income. To explore the data and model predictions I’ve build a Shiny app in R-Studio. I could not install R-studio using the hosting service that I have now so if you want to check it out, you’ll need to head over to


The app allows you to look at the actual installations vs. predicted installation by year and quarter. Additionally, I included a bar plot of the same data as the map, which makes it a bit easier to see. Also, I created an interactive scatter plot where you can see the effect of the different predictors on the total installed residential solar by county. Rather than post the code here for the Shiny App, the code is hosted on my Github at


Examining the effect of solar subsidies on the total amount of residential solar installed from 2009-2013

In the previous posts, I have been using the subsidized cost of solar installs as a predictor, but now I would like to predict how much residential installed solar would have occurred if no CA subsidies were given. To do this I first need to create a variable for the actual up front cost of solar paid.

##load previous data
##create variable for cost/watt
actualCostByQuarter = ddply(solarData, .(year,quarter),summarise, cost=mean(Total.Cost/(solarData$CEC.PTC.Rating*1000),na.rm=TRUE))
##merge with data set

Now that I have the subsidy free cost as a predictor, I can use the random forest model to predict how much residential solar would have been installed if there had been no subsidies.

noSubPreds = round(sum(predict(solarForest, newdata = installsByYearCountyQuarter))) ## predictions with no subsidies
SubPreds   = round(sum(solarForest$predicted)) ## predictions with subsidies
actual = round(sum(installsByYearCountyQuarter$Total.kW)) ## actual residential installs from 2009-2013
TotalSolarInstalled = c(noSubPreds,SubPreds,actual)
type = c('Without CA subsidies (predicted)','With CA subsidies (predicted)','With CA subsidies (actual)') 
finale = data.frame(TotalSolarInstalled,type) 
ggplot(finale,aes(type,TotalSolarInstalled))+geom_bar(stat="identity")+theme_bw()+ylab('Total residential solar installed from 2009-2013 (kW)') +
 geom_text(aes(label = TotalSolarInstalled), vjust=-0.5, position = position_dodge(0.9), size = 4) ##add labels


This bar chart really surprised me. The random forest model predicts that if no CA subsidies existed then ~ 25% less residential solar would have been installed between 2009 and 2013. That is quite a dramatic effect. Judging by my analysis, the Go Solar California program has helped California take a big step towards a future that is less dependent on fossil fuels.

Well, I think that about wraps things up for this project. If you have any ideas for analysis I didn’t think of or have any comments or questions about what I’ve done please don’t hesitate to reach out to me.

Posted in CA-Solar, R | Tagged , , , , , , , , , | Leave a comment

Data Mining the California Solar Statistics with R: Part IV

Predicting the residential solar power installations by county by quarter in CA from 2009-2013

So far I have gathered three data sets and combined them into one which I will now use to try to predict the number of solar installations by county by quarter in CA from 2009-2013. The three data sets I am using are; the California Solar Statistics which are a record of applications for state incentives for solar power installations, American Community Survey for information about median income and population of each county and NASA Surface Metrology and Solar Energy Release which contains information about the average amount of sun shinning on a given latitude and longitude. My dependent variable that I would like to predict is the installed solar in a given county per quarter in kW (variable is named Total.kW). The predictors are as follows:
1. population (has yearly changes) called pop (#)
2. median income (has yearly changes) called income ($)
3. cost of solar (has quarterly changes) called cost ($)
4. solar insolation (no changes, annual average) called insolation (kWh/m2 /day)

Getting Started

To try to predict the installed solar power, I will start with the simplest and easiest to interpret linear models and then work towards more complex models which increase prediction accuracy but are not as easy to make inferences about what the model is doing. Step 1 is always to load the packages I will be using and the .Rdata file which has all my data frames from the last several posts.

require(boot) ## needed for cross validation
require(gam) ##model for generalized additive models
require(splines) ##splines model, will use with gam
load("CaSolarIII.RData") ###Load data from last post

Individual Linear Models

To start out, I am fitting the installed solar power against each predictor individually to get an idea of how strongly the dependent variable depends on each of the predictors.

##fit predictors one at a time against total install of residential kW of PV/county/quarter
pop.fit = lm(Total.kW ~  pop , data = installsByYearCountyQuarter) ##fit against population
cost.fit = lm(Total.kW ~  cost , data = installsByYearCountyQuarter) ##fit against cost of PV systen
income.fit = lm(Total.kW ~  income , data = installsByYearCountyQuarter) ##county median income
insolation.fit = lm(Total.kW ~  insolation , data = installsByYearCountyQuarter) ##solar insolation by county

Below you can see the fits of the response against each individual predictor, it is clear that none of them alone can be an accurate predictor for this problem which is not surprising. The low r squared values indicate that none of the predictors can explain more than 46% of the variance in the data.

par(mfrow=c(2,2)) ##set plot mode to a 2 x 2 grid
##plots of each fit
plot(Total.kW ~ pop , data = installsByYearCountyQuarter)
abline(pop.fit, col="red")
text(7000000,8000,paste("Rsq = ",round(summary(pop.fit)$r.squared,2)))
plot(Total.kW ~ cost , data = installsByYearCountyQuarter)
abline(cost.fit, col="red")
text(7.0,8000,paste("Rsq = ",round(summary(cost.fit)$r.squared,2)))
plot(Total.kW ~ income , data = installsByYearCountyQuarter)
abline(income.fit, col="red")
text(80000,8000,paste("Rsq = ",round(summary(income.fit)$r.squared,2)))
plot(Total.kW ~ insolation , data = installsByYearCountyQuarter)
abline(insolation.fit, col="red")
text(5.0,9500,paste("Rsq = ",round(summary(insolation.fit)$r.squared,2)))
legend(4.5,8000,c("data","fit"),pch=c(1,NA),lty=c(NA,1),col=c("black","red")) ## add legend to lower left plot

plot of chunk unnamed-chunk-3

Combined Linear Models

Next lets fit two more models, one will take a look at a linear model which takes into account each of the predictors at the same time and the other will include all the predictors as well as their interactions.

###fit linear model to data set
comb.fit = lm(Total.kW ~  pop + cost + income + insolation , data = installsByYearCountyQuarter)
## Call:
## lm(formula = Total.kW ~ pop + cost + income + insolation, data = installsByYearCountyQuarter)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3167.3  -270.6   -64.6   124.7  8289.6 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.807e+03  6.689e+02  -2.702  0.00702 ** 
## pop          4.182e-04  1.582e-05  26.432  < 2e-16 ***
## cost        -2.235e+02  2.537e+01  -8.809  < 2e-16 ***
## income       6.846e-03  1.700e-03   4.028 6.07e-05 ***
## insolation   6.149e+02  1.273e+02   4.831 1.57e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 696.6 on 988 degrees of freedom
## Multiple R-squared:  0.5183, Adjusted R-squared:  0.5164 
## F-statistic: 265.8 on 4 and 988 DF,  p-value: < 2.2e-16

Looking at the summary above, you can see that each of the predictors is significant. The combined model improves upon the population only model by increasing the adjusted r squared to 51.6 %. Now lets take a look at how much better we can do by adding the interaction terms.

comb.fit.w.int = lm(Total.kW ~  pop + cost + income + insolation +
                      pop:cost + pop:income + pop:insolation +
                      cost:income + cost:insolation + income:insolation, data = installsByYearCountyQuarter)
## Call:
## lm(formula = Total.kW ~ pop + cost + income + insolation + pop:cost + 
##     pop:income + pop:insolation + cost:income + cost:insolation + 
##     income:insolation, data = installsByYearCountyQuarter)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2217.2  -202.8   -21.0    75.3  7484.3 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.230e+04  4.769e+03  -2.579 0.010050 *  
## pop                1.159e-02  1.358e-03   8.534  < 2e-16 ***
## cost               2.506e+03  5.999e+02   4.177 3.22e-05 ***
## income            -2.218e-01  5.193e-02  -4.270 2.14e-05 ***
## insolation         2.791e+03  9.438e+02   2.957 0.003181 ** 
## pop:cost          -2.513e-04  1.468e-05 -17.119  < 2e-16 ***
## pop:income         3.639e-09  2.413e-09   1.508 0.131871    
## pop:insolation    -1.821e-03  2.404e-04  -7.573 8.41e-14 ***
## cost:income        4.027e-03  1.578e-03   2.552 0.010854 *  
## cost:insolation   -5.426e+02  1.182e+02  -4.592 4.97e-06 ***
## income:insolation  3.781e-02  1.015e-02   3.726 0.000206 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 566.5 on 982 degrees of freedom
## Multiple R-squared:  0.6833, Adjusted R-squared:  0.6801 
## F-statistic: 211.9 on 10 and 982 DF,  p-value: < 2.2e-16

plot of chunk unnamed-chunk-5
The interaction terms have varying degrees of significance but all of them appear to be significant except for the interaction between population and income. This model improves the rsquared to 68%, later I will take a look at the 10-fold cross validation error to see if this improvement is just due to overfitting the training data. Looking at the predicted values next to the actual values on the plots above, you can tell that we are moving in the right direction with the model.

Generalized Additive Models

For the next step to increase complexity and hopefully accuracy, I am using a generalized additive model with 4th order natural splines to fit the data.

gam.comb.fit = lm(Total.kW ~  ns(pop,4) + ns(cost,4) + ns(income,4) + ns(insolation,4) , data = installsByYearCountyQuarter)
## Call:
## lm(formula = Total.kW ~ ns(pop, 4) + ns(cost, 4) + ns(income, 
##     4) + ns(insolation, 4), data = installsByYearCountyQuarter)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2124.4  -215.7    -0.2   175.3  7391.0 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           31.08     188.80   0.165  0.86928    
## ns(pop, 4)1          224.98      73.70   3.053  0.00233 ** 
## ns(pop, 4)2         3958.79     171.19  23.125  < 2e-16 ***
## ns(pop, 4)3         3893.32     179.21  21.726  < 2e-16 ***
## ns(pop, 4)4         3162.82     155.95  20.281  < 2e-16 ***
## ns(cost, 4)1        -312.25     111.93  -2.790  0.00538 ** 
## ns(cost, 4)2        -441.49     105.38  -4.190 3.05e-05 ***
## ns(cost, 4)3        -550.15     188.12  -2.924  0.00353 ** 
## ns(cost, 4)4        -498.50      75.32  -6.618 5.99e-11 ***
## ns(income, 4)1       200.41     165.60   1.210  0.22649    
## ns(income, 4)2        83.57     142.85   0.585  0.55866    
## ns(income, 4)3        96.63     357.44   0.270  0.78697    
## ns(income, 4)4        67.98     126.75   0.536  0.59183    
## ns(insolation, 4)1  -103.04     179.52  -0.574  0.56612    
## ns(insolation, 4)2   323.45     159.79   2.024  0.04322 *  
## ns(insolation, 4)3   557.13     309.44   1.800  0.07210 .  
## ns(insolation, 4)4  -121.77     140.43  -0.867  0.38608    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 605.1 on 976 degrees of freedom
## Multiple R-squared:  0.641,  Adjusted R-squared:  0.6351 
## F-statistic: 108.9 on 16 and 976 DF,  p-value: < 2.2e-16

Adding this much flexibility to the model appears to have reduced the significance of income and insolation, I suspect that at this point we are overfitting the data. The median residual is reduced by this more flexible model but the adjusted r squared has actually decreased compared to the previous model because so many degrees of freedom have been added. The gam package has a nice function to allow you to see your fit vs each predictor and it's standard error very easily, below is an example.

par(mfrow=c(2,2)) ##set plot mode to a 2 x 2 grid

plot of chunk unnamed-chunk-7
Now let's check how accurate each of our models is so far using 10 fold cross validation.

set.seed(72) ## set seed for reproducibility
cv.error=as.data.frame(matrix(data=NA, nrow = 7,  ncol=10)) ##data frame which will hold cv error for each model
for (i in 1:10){
  cv.pop.fit = glm(Total.kW ~  pop , data = installsByYearCountyQuarter)
  cv.cost.fit = glm(Total.kW ~  cost , data = installsByYearCountyQuarter)
  cv.income.fit = glm(Total.kW ~  income , data = installsByYearCountyQuarter)
  cv.insolation.fit = glm(Total.kW ~  insolation , data = installsByYearCountyQuarter)
  cv.comb.fit = glm(Total.kW ~  pop + cost + income + insolation , data = installsByYearCountyQuarter)
  cv.comb.fit.w.int = glm(Total.kW ~  pop + cost + income + insolation , data = installsByYearCountyQuarter)
  cv.gam.comb.fit = glm(Total.kW ~  ns(pop,4) + ns(cost,4) + ns(income,4) + ns(insolation,4) , data = installsByYearCountyQuarter)
levels=c("population only","cost only","income only","insolation only","all 4 combined","all 4 +\n interactions","all 4 combined\n-splines")
bp = barplot(rowMeans(cv.error),width=1.0,space=0.1,col="red",xlab="",ylab="Mean Squared Error")
text(bp, labels = levels, srt = 45, adj = c(1.0,1.0), xpd = TRUE, cex=0.9)

plot of chunk unnamed-chunk-8
The cross-validated mean squared error results are indicating that adding the interaction terms was not useful but using the gam model with splines was able to successfully reduce the MSE over the linear model.

Random Forests & Gradient Boosted Tree Models

Now let's look at an extremely flexible model, the random forest. I'm only going to use the 4 original predictors and not the interaction terms. Random forest models are easily implemented in R as you can see below. Variable importance plots can be used to determine which parameters reduce the SSE the most, this is useful since you don't have p-values as you do in a linear regression.

solarForest=randomForest(Total.kW ~ pop + cost + income + insolation , data = installsByYearCountyQuarter, mtry=2, ntree=5000,importance = TRUE)
varImpPlot(solarForest,type=1) ##variable importance plot

plot of chunk unnamed-chunk-9
As in the linear model, population and cost are the most important predictors. In this example I chose to fit 5000 decision tree's in the random forest, but looking at the plot below I probably could have gone with as few as 140 trees. The error on the y axis of the plot below is the mean squared error for the out of bag samples, so it is fair to compare these results with our cross-validated MSE on the linear fits. The random forest achieves MSE < 90,000 while the best previous model only achieved a MSE of ~375,000. The random forest performed 4 times better than the gam model with splines. Around a 4X improvement in MSE.

Another powerful tool based on decision tree's is boosting, this is provided by the gbm package in R.

solarBoostedForest=gbm(Total.kW ~ pop + cost + income + insolation , data = installsByYearCountyQuarter, distribution="gaussian", n.trees=5000,interaction.depth = 2,shrinkage=.05, cv.folds = 10)

plot of chunk unnamed-chunk-10

##                   var   rel.inf
## pop               pop 69.818739
## cost             cost 19.901584
## income         income  5.995389
## insolation insolation  4.284288

The boosted tree model has also picked up that population and cost are the most important predictors. Let's take a look and see how the random forest and boosting model compare. The boosted model's error is 10 fold cross validation error while the random forest is out of bag error.

legend(2000,160000,c("randomForest","gradient boosted trees"),lty=c(1,NA),pch=c(NA,19),col=c("black","red"))

plot of chunk unnamed-chunk-11
For this example the random forest outperformed the boosted trees, however I only spent a little time optimizing the shrinkage factor and interaction depth, it is possible that with the right parameters the boosted tree's could have done better.

Finally, lets take a look at how the residuals for our best fit (the random forest) and see how the compare to the actual total solar power installed. The model does quite well for instances when installed power is less than ~ 1000 kw/m2 /day. As the size increases the model becomes less accurate, this is likely because the data as much more sparse in this range.

plot((installsByYearCountyQuarter$Total.kW - solarForest$predicted)~installsByYearCountyQuarter$Total.kW

plot of chunk unnamed-chunk-12

Wrapping up

Today I've taken this merged data set and fit it with various linear models with and without interaction terms and splines. Also the data was fit with a random forest which was a vast improvement over the other models. In my next post I'll use this random forest to try to predict how many solar installations would have occurred over this same time span if the incentive program did not exist. I'd love to hear any questions or comments you may have about this project.

Posted in CA-Solar, R | Tagged , , , , , , , , | 1 Comment

Data Mining the California Solar Statistics with R: Part III

Data Mining the California Solar Statistics with R: Part III

Today I want to combine the California solar statistics with information about the annual solar insolation in each county as well as information about the population and median income. These can then be used as predictors in the models I'll build in the next post. Information about the annual solar insolation can be found in a NASA data set, http://eosweb.larc.nasa.gov/sse/. This data set lists the annual insolation incident on a horizontal surface in kWh/m2 /day for the entire globe at a 1 degree latitude/longitude precision. The population and median income data can be downloaded from the American Community Survey which is run by the census bureau, http://www.census.gov/acs/www/.

Annual insolation in California by county

Ideally, I would like to average the annual solar insolation over the area of the county and assign that insolation to the county, however, the insolation data is much too sparse for that; not to mention that finding a way to average anything over those odd county shapes would be very difficult. What I settled on is to assign the annual solar insolation of each county's county seat to the entire county, this should capture most of the variation across the state. To find the latitude and longitude of each county seat I used the Google Maps API that is built into a great package called ggmaps. Once I have the latitude and longitude of each county seat then I just look up the average insolation at that latitude and longitude.

##load text file containing solar insolation data
avgAnnualInsolation=read.csv("nasaSolarData.txt",sep = " ", skip = 13 , header = T)
##remove monthly totals, only need lat,lon and annual insolation
avgAnnualInsolation = avgAnnualInsolation[,c(1,2,15)]
##subset the data so that only california area is included
avgAnnualInsolation = subset(avgAnnualInsolation, Lat > 32 & Lat < 43 & Lon > -125 & Lon < -113)

The following code is what I wrote to look up the latitdue and longitude using the Google Maps API

##I saved the CA county seats in this text file, read it into a variable
##remove "county" from the end of the county name
countySeats$County=tolower(gsub(" County", "" ,countySeats$County ))
##add latitide,longitude and annual solar insolation variables to the county seat data frame
##This for loop uses the geocode funciton of ggmaps to look up the latitude and longitude of each county seat
##then the instolation from the nasa data set at that lat,lon pair is assigned to the county
for ( counties in 1:nrow(countySeats)){
  seat =  geocode(paste(countySeats$Seat[counties],", CA"))
  countySeats$lat[counties] = seat$lat
  countySeats$lon[counties] = seat$lon
  countySeats$insolation[counties]=avgAnnualInsolation$Ann[avgAnnualInsolation$Lat == round(seat$lat) & 
                                                           avgAnnualInsolation$Lon == round(seat$lon)]
  print(paste(seat$lat," ",seat$lon,countySeats$insolation[counties]))
##save the data to csv file so that i only have to do this once

Now that I have the insolation per county I can merge that data with my California map and plot the results.

##load the data
##merge the insolation data with our state map from post 1
##plot the data
ggplot(CASUN ,aes(x = long, y = lat,group=group,fill=insolation)) +
  geom_polygon(colour = "white", size = 0.1) +  
  theme_bw(base_size = 16)+
  scale_fill_gradientn(name=expression(paste("kWh/"*m^2*"/day")),colours = brewer.pal(8,"YlOrRd"),limits=c(min(CASUN$insolation),max(CASUN$insolation)))+
  ggtitle("Insolation incident on a horizontal surface")+
  theme(axis.text.x = element_text( angle = 25,vjust=-0.1))+ 
  coord_fixed(ratio = 1)

plot of chunk unnamed-chunk-3

As you would expect, southern California gets much more sun that northern California and the coast get's less sun than inland areas because the coast is often more cloudy.

Population and median income in California by county

As I mentioned before, this data is available from the American Community Survey. They have data for each county available from 2009-2013, the data sets from 2007 and 2008 did not include every county. The data was a bit tedious to get. I had to search around to find it and then download by year one year at a time and then combine them in an excel file. I'll put these files in my public dropbox folder so that anyone else interested doesn't have to go through the same tiresome exercise that I did.

##load population by county, data collected from american community survey - census.gov, data available form 2009-2013
pop2013=subset(pop,year == 2013)
###Susbset out 2013 as an example to plot
caPop2013 = merge(CA,pop2013,all.x=TRUE,sort=FALSE, by="County")
##must be sorted after merging for plot to work
##load income by county, data collected from american community survey - census.gov, data available form 2009-2013
income2013=subset(income,year == 2013)
caInc2013 = merge(CA,income2013,all.x=TRUE,sort=FALSE, by="County")
##create plots using grid and ggplot2 packages
caPopPlot = ggplot(caPop2013 ,aes(x = long, y = lat,group=group,fill=pop)) +
  geom_polygon(colour = "black", size = 0.1) +  
  theme_bw(base_size = 16)+
  scale_fill_gradientn(name="Population",colours = brewer.pal(8,"Blues"),limits=c(min(caPop2013$pop),max(caPop2013$pop)))+
  ggtitle("2013 population by county")+
  theme(axis.text.x = element_text( angle = 25,vjust=-0.1))+ 
  coord_fixed(ratio = 1)

caIncPlot = ggplot(caInc2013 ,aes(x = long, y = lat,group=group,fill=income)) +
  geom_polygon(colour = "black", size = 0.1) +  
  theme_bw(base_size = 16)+
  scale_fill_gradientn(name="Median\n Income ($)",colours = brewer.pal(8,"Greens"),limits=c(min(caInc2013$income),max(caInc2013$income)))+
  ggtitle("2013 median income by county")+
  theme(axis.text.x = element_text( angle = 25,vjust=-0.1))+ 
  coord_fixed(ratio = 1)
##create a new page for plot
##tell viewport that you want 2 rows x 3 cols
pushViewport(viewport(layout = grid.layout(1, 2)))
vplayout <- function(x, y) viewport(layout.pos.row = x, layout.pos.col = y)
##specify where each plot should be located
print(caPopPlot, vp = vplayout(1, 1))
print(caIncPlot, vp = vplayout(1, 2))

plot of chunk unnamed-chunk-4
I did not realize how concentrated the population in California is around Los Angeles until I saw this plot. Financially, the bay area has the highest median salary.Regions around Los Angeles, San Diego and Sacramento also have elevated incomes compared to the rest of the state.

The effective cost of solar by county

In my last post we looked at how the cost of solar decreased over the duration of this program. Now I would like to calculate what I am calling the “effective cost” which is the cost of the system minus the incentive received. I should also note that the owners of these system would have been available for the federal invesment tax credit (ITC) as well , but the incentive amounts for that are not included in this data set and are thus not inlcuded in my analysis.

##create new vairable for effective cost which subtracts incentive from total cost
##group installs by quarter and year, calculate average cost for each quarter by year
costByQuarter = ddply(solarData, .(year,quarter),summarise, cost=mean(Effective.Cost,na.rm=TRUE))


##use ddply to group install kW/county/quarter
installsByYearCountyQuarter=ddply(solarData, .(year, County, quarter),summarise, 
                                  Total.kW = sum(na.omit(CSI.Rating)))
##only include 2009-2013, because that is the range we have all the variables for
installsByYearCountyQuarter=subset(installsByYearCountyQuarter,year > 2008 & year < 2014)
##Merge our external data sets with the california solar statistics set 

Now that we've got our combined data set together i want to look for correlation between the variables


plot of chunk unnamed-chunk-6
Looking at the correlation plot, the trends seem to be what I would expect; total install kW of PV increases with increasing population, increasing insolation and decreasing cost. The trend between installed kW and income is not as clear as I would have guessed.

Wrapping up

Well, it's been a lot of work through these last 3 posts, but now I'm ready to perform some machine learning algorithms on this data set I've pulled together. I am eager to see how accurately I'll be able to predict the totaled installed residential PV per quarter using these predictors. Check in next time to see the results!


population by year/county – https://dl.dropboxusercontent.com/u/18368377/population.csv
median income by year/county – https://dl.dropboxusercontent.com/u/18368377/income.csv
CA county seats – https://dl.dropboxusercontent.com/u/18368377/caCountySeats.csv
This R markdown file – https://dl.dropboxusercontent.com/u/18368377/CaSolarIII.Rmd

Posted in CA-Solar, R | Tagged , , , , , , | Leave a comment

Data Mining the California Solar Statistics with R: Part II

Data Mining the California Solar Statistics with R: Part II

In today's post I'll be working some more with the working data set from California Solar Statistics. Last time I imported the data, cleaned it up a bit, grouped it by county and year, and made some plots to look at how residential solar installations have been changing from year to year by county as well as statewide. Now I would like to look a bit closer at the data set and try to answer some further questions about the residential solar market in California. The questions I would like to answer using R with this data set are:
1. What companies are installing the most panels?
2. What brand solar panels are being installed the most?
3. What time of year are the panels being installed?
4. How has the cost of solar energy per watt changed during the time range in the data set?
5. What is the average incentive given to customers installing residential systems?

Question 1-What companies are installing the most panels?

First, I need to load the variables I created in my first post and load the necessary packages.

##Load the dataset that I made in Part I
##I really like the tableau palettes that are a part of the ggthemes package 

Now I need to group the systems by who installed them, then I will take the top 5 installers from each year and take a look at how many systems they have installed between 2007 and 2013. I actaully grouped the installers by their license # rather than their name, I did this because there are some input errors in the data, sometimes the installer name may be capitalized while other times it may not be and then sometimes it may say LLC at the end while others it doesn't, so grouping by license # helps me get around that problem. Then I assign the most frequently occuring name for each installer as the name that goes with the license #.

##use plyr package to aggregate data by installer's contractor #, i use contractor # instead of name because R is case sensitive and the
##contractor names arent put in the exact same way every time
installers=ddply(solarData,.(Contractor.License.Number, year),summarise, 

##sort the data by # of installs
##some installers names are very long, cut off to 20 chars
## Solar City was listed as 3 different names on different years, combine them all just to sat "SolarCity"
installers$installer[installers$installer=="SolarCity Corporatio" |installers$installer== "USB Solar City Owner" ]="SolarCity"
##want to find top 5 installers for each year from 2007-2013
##create a data frame to store the top 5 installers from each year, intialize data frame with 2007 data then row bind on all the rest of the years
topinstallers = head(installers[installers$year==2007,],n=5)
for (years in 2008:2013){
##plot the # of installs by top installers using ggplot

plot of chunk unnamed-chunk-2

These results are not too surprising, SolarCity has by far been installing more panels than any other installer. By 2013 they were installing about twice as many as the next nearest competitor. I suspected that SolarCity would be the leading installer on residences but I did not realize by how much.

Question 2-What brand solar panels are being installed the most?

The approach I took for this section is almost identical to question 1, mostly swapping out variable names.

##I can recycle the code I used to get the top installers by year to look at the top panels by year just by changing variable names
##aggreagate the data
panelsByYear=ddply(solarData,.(PV.Module.1.Manufacturer, year),summarise, installs=length(PV.Module.1.Manufacturer))
##sort it by # of installs
panelsByYear=panelsByYear[order(panelsByYear$year,panelsByYear$installs,decreasing = TRUE),]
##select the top 5 from each year
topPanels = head(panelsByYear[panelsByYear$year==2007,],n=5)
for (years in 2008:2013){
##plot the # of Panels installs by top module manufacturers
  scale_color_tableau("tableau20",name = "Module Manufacturer")    

plot of chunk unnamed-chunk-3

The plot above is for the number of projects the manufacturers panels are used in, not the total number of panels sold or total kW of panels sold. Though that would be interesting to look at…maybe in a future post. It is interesting that most years more homeowners choose Sunpower panels than any other brand. This could be because Sunpower sells the most efficient panels on the market, so if you have a small roof and want to generate as much power as possible, Sunpower is a good way to go. Also, it is fitting for Sunpower to get much of the California business since it is a spin-off from Stanford. I find it interesting that in 2013 Trina solar panels were used in more projects than any other manufacturer despite not being one of the top 5 manufacturers in the previous 6 years.

Question 3. What time of year are the panels being installed?

I decided to group the system installs by the quarter they were installed in rather than the month, so I need to create a new variable called quarter in my solarData data set. I assigned Jan, Feb, March as Winter; April, May, June as Spring and so on. Once the quarter variable exists, it is easy to aggregate and plot the data using ddply and ggplot.

##create a variable for the install quarter
##assign quarter based on month installed
solarData$quarter[solarData$month == 3 |solarData$month == 1 |solarData$month == 2 ] = "Winter"
solarData$quarter[solarData$month == 6 |solarData$month == 4 |solarData$month == 5 ] = "Spring"
solarData$quarter[solarData$month == 9 |solarData$month == 7 |solarData$month == 8 ] = "Summer"
solarData$quarter[solarData$month == 12 |solarData$month == 10 |solarData$month == 11 ] = "Autumn"
##Reorder the quarters, want chronological not alphabeitcal ordering when I plot
##aggregate number if installs by year and by quarter
installsByQuarter = ddply(solarData, .(year,quarter),summarize,
#plot the data
  theme_bw(base_size= 16)+

plot of chunk unnamed-chunk-4

I took the install date to be the date that the application was submitted for reimbursement, so there is probably some lag there, but since I group it by quarter most projects are grouped into the right quarter. I would have thought there may be a trend in the data where more projects are installed after tax refunds are received in the spring or perhaps after year end bonuses. Also, I would have thought Summer could have more installs because of the longer days. But looking at the above plot I can't find a consistent trend. So I thought I'd look at the overall projects installed by quarter instead of by year. The boxplot below shows this result, the median number of installs by quarter is very similar for each quarter.


plot of chunk unnamed-chunk-5

Question 4. How has the cost of solar energy per watt changed during the time range in the data set?

According to the go solar California web site, http://www.gosolarcalifornia.ca.gov/about/csi.php, customer owned system costs are often reported differently from business owned system costs in this data set. I have already refined the data set so that all the systems are on residences not businesses or power plants, but some of them are owned by third parties like SolarCity. So to make sure I am using accurate cost numbers, I will further refine the data set so that all systems are customer owned.

resOwned = subset(solarData,System.Owner.Sector == "Residential")
##create vector for cost
##calculate cost
##group installs by quarter and year, calculate average cost for each quarter by year
costByQuarter = ddply(resOwned, .(year,quarter),summarise, cost=mean(cost,na.rm=TRUE))
##plot the results
  ylab("cost ( $/W )")+
  theme_bw(base_size= 16)+

plot of chunk unnamed-chunk-6

To determine the $/W for these projects I took the to total cost reported and divided it by the alternating current output power of the module. This analysis does not take into account any changes in consumer price index over this time period. This is probably the most interesting results form this data set so far. By 2013 the cost of solar dropped to nearly 50% of the 2007 cost. This has been fueled by larger scale manufacturing, dramatically cheaper poly Si prices (the biggest single cost component of modules) and higher module efficiencies to name a few.

Question 5. What is the average incentive given to customers installing residential systems?

Since I have already created this subset, all I need to do is plot the results.

  theme_bw(base_size = 16)+
  xlab("Incentive Amount ( $ )")

plot of chunk unnamed-chunk-7

It looks like the majority of residential installs received about a $1000 incentive for their system but there is a long tail to this distribution of installs which received much larger incentives. It's surprising that there are many systems which have incentives in the $100,000 range. It's hard to believe that these could be residential systems. Maybe, I'll take a closer look at this later. Perhaps it will be more interesting to look at the incentives as a function of the total cost of the system. This can be seen in the plot below.

  theme_bw(base_size = 16)+
  xlab("Total Cost ( $ )")+
  ylab("Incentive Amount ( $ )")+
  scale_color_continuous_tableau(name="Nameplate Rating ( kW )",limits=c(0,25))+

plot of chunk unnamed-chunk-8

##grad all the systems where cost = incentive

There is a very large spread of incentives for a given system cost, but the general relationship is as you would expect, as the system cost increases so does the incentive amount. according to the California solar initiative incentive amounts are determined by system size, location, angle and tilt. There are several different incentive programs in this data, that could also help explain the large variation. One thing that surprised me when I created this figure was that for many systems the incentive amount was identical to the cost. In other words they were totally free! I took the subset of the data where the cost is the same as the incentive, and almost all of them were a part of a program called “Single-family Affordable Solar Homes” or SASH, this is a program for low-income single-family households which gives larger incentives than the other programs. The mean of the incentive amount is $5194 and the median incentive is $2433.

Wrapping up

Well, I have answered the questions that I was initially curious about and found a few surprises along the way. In my next post, I will merge this data set with some other publicly available data sets on population and income by county as well as annual solar insolation and fit a regression model to try to understand which factors affect the number of residential solar installations in California. I would love to hear any questions/comments anyone has about the data set and/or approach I took to analyzing the data.

Posted in CA-Solar, R | Tagged , , , , , , | Leave a comment

Data Mining the California Solar Statistics with R: Part I

Data Mining the California Solar Statistics with R: Part I


Today I’m taking a look at the data set available from California Solar Statistics availalbe from https://www.californiasolarstatistics.ca.gov/. This data set lists all the applications for state incentives for both residential and commercial systems, it contains information about the PV (Photovoltaic) system size, location, cost, incentive amount, panel and inverter manufacturer and a lot more, if you want more details check out their web page. I’m interested in taking a look at this data set and seeing what we can learn about residential solar installations in CA.

Getting Started – Cleaning the data

The working data sets have already been screened for input errors but the data still isn’t in the format I am looking for, I would like to see the total installed kW of solar by county by year. Additionally, the input which contains the install county has inconsistent nomenclature. For example, some values for county will say “Los Angles County” while others will just say “Los Angeles”, this is a problem because it will create more counties than exist when I try to group by county. Also, the data currently contains canceled applications, I’m only interested in taking a look at installed systems. In this first block of code I’m going to get the data in a more usable format for my purposes.

##First I'll load the packages I plan on using 
##load data, this may take a min
solarData=read.csv(file = "WorkingDataSet_4-15-2015.csv")
## We're interested in the residential data
solarData=subset(solarData,Host.Customer.Sector == "Residential")
## same labels are redundant, remove "county" to avoid this using gsub command
solarData$Host.Customer.Physical.Address.County=gsub( " County","",solarData$Host.Customer.Physical.Address.County)
##There are two instances where no county was listed, I am removing them here
solarData=solarData[solarData$Host.Customer.Physical.Address.County != "",]
#remove cancelled applications, we're insteresed installed systems
solarData=solarData[solarData$First.Cancelled.Date == "",]
#only keep installed applications, if people have filed to receieve their incentives, I am counting it as installed
solarData=solarData[solarData$First.Incentive.Claim.Request.Review.Date != "",]
#extract install year and month variables
##using a package called lubridate with the functions year and month to extract year and month as variables form the data

Next I want to group the data by county and year. The plyr package is great for aggregating data like this, it can be achieved with just a few lines of code. For more about the plyr package see http://plyr.had.co.nz/.

#Get data by county
countyData = ddply(solarData, .(tolower(Host.Customer.Physical.Address.County),year ),summarize,
      Systems = length(na.omit(year)),
      Total.Size = sum(na.omit(CSI.Rating)))  
##rename column name to "County" from "Host.Customer.Physical.Address.County" this will be important when I want to merge that data set with another one
colnames(countyData)[1] ="County"

Now that I’ve got the data set in the format I’m looking for, I’ll use the gglpot2 package to load a county map of CA. Later We’ll be able to plot the solar install data we have per county on this map. I create a variable called “sort” so that after I merge this data set with the solar data I can put it back in the original order from my CA data frame. The merge function doesn’t keep the order of either your data frames. This is important because I won’t be able to plot the data by county correctly if it isn’t in the same order that it came from the ggplot package in.

Loading the California Map from ggplot2

##create a sort variable so that after we merge the data sets, we can put in proper order to plot
for (i in 1:nrow(CA)){
colnames(CA)[6] ="County"

I also need to subset the data so that each year is it’s own data set if I want to compare state wide solar install plots by year. To do this I’m merging the county labels from my CA data frame with the county labels from my countyData data frame.

countyData2007=subset(countyData,year == 2007)
CA2007 = merge(CA,countyData2007,all.x=TRUE,sort=FALSE, by="County")

countyData2008=subset(countyData,year == 2008)
CA2008 = merge(CA,countyData2008,all.x=TRUE,sort=FALSE, by="County")

countyData2009=subset(countyData,year == 2009)
CA2009 = merge(CA,countyData2009,all.x=TRUE,sort=FALSE, by="County")

countyData2010=subset(countyData,year == 2010)
CA2010 = merge(CA,countyData2010,all.x=TRUE,sort=FALSE, by="County")

countyData2011=subset(countyData,year == 2011)
CA2011 = merge(CA,countyData2011,all.x=TRUE,sort=FALSE, by="County")

countyData2012=subset(countyData,year == 2012)
CA2012 = merge(CA,countyData2012,all.x=TRUE,sort=FALSE, by="County")

countyData2013=subset(countyData,year == 2013)
CA2013 = merge(CA,countyData2013,all.x=TRUE,sort=FALSE, by="County")

countyData2014=subset(countyData,year == 2014)
CA2014 = merge(CA,countyData2014,all.x=TRUE,sort=FALSE, by="County")

##Data set for the installs statewide

Plotting the data with ggplot2 and grid

Next I want to visualize the data, for this I’ll use two great packages, ggplot2 along with the viewport function of the grid package which makes it easy to make very flexible panel plots.

## create countywide plots for 2007, 2010 and 2013
p2007 = ggplot(CA2007 ,aes(x = long, y = lat,group=group,fill=Total.Size)) +
  geom_polygon(colour = "white", size = 0.1) +  
  theme_bw(base_size = 16)+
  scale_fill_gradientn(name="Installed \nkW/year",colours = brewer.pal(8,"YlOrRd"),limits=c(0,max(countyData$Total.Size)))+
  ggtitle("2007 by county")+
  theme(axis.text.x = element_text( angle = 25,vjust=-0.1))

p2010 = ggplot(CA2010 ,aes(x = long, y = lat,group=group,fill=Total.Size)) +
  geom_polygon(colour = "white", size = 0.1) +  
  theme_bw(base_size = 16)+
  scale_fill_gradientn(name="Installed \nkW/year",colours = brewer.pal(8,"YlOrRd"),limits=c(0,max(countyData$Total.Size)))+
  ggtitle("2010 by county")+
  theme(axis.text.x = element_text( angle = 25,vjust=-0.1))

p2013 = ggplot(CA2013 ,aes(x = long, y = lat,group=group,fill=Total.Size)) +
  geom_polygon(colour = "white", size = 0.1) +  
  theme_bw(base_size = 16)+
  scale_fill_gradientn(name="Installed \nkW/year",colours = brewer.pal(8,"YlOrRd"),limits=c(0,max(countyData$Total.Size)))+
  ggtitle("2013 by county")+
  theme(axis.text.x = element_text( angle = 25,vjust=-0.1))

## create statewide plot
pCA = ggplot(CA_comb_data,aes(x=year,y=CAtotal/1000))+
  ylab("CA annually installed\n residential solar (MW)")

Below is the code required to generate my plot

##create a new page for plot
##tell viewport that you want 2 rows x 3 cols
pushViewport(viewport(layout = grid.layout(2, 3)))
vplayout <- function(x, y) viewport(layout.pos.row = x, layout.pos.col = y)
##specify where each plot should be located
print(p2007, vp = vplayout(1, 1))
print(p2010, vp = vplayout(1, 2))
print(p2013, vp = vplayout(1, 3))
##plots can take up more that one spot in your layout
print(pCA, vp=vplayout(2,1:3))


Looking at the plots by county it’s clear that the applications are more focused in southern CA. In a later post we’ll try and figure out how much of this is due to the high population in southern CA and how much is due to the increased solar insolation. One thing that lookes really strange/surprising to me is the drop in applications in 2014, during this period of time the cost of solar was dropping and the economy was improving. I reached out to the California Energy Comission to ask them about the drop in applications for incentives and they explained that while the CSI (California Solar Incentive) program was intended to last until 2016, the total budget for the program was starting to dry up in 2014 due the larger than anticipated growth in solar. So the drop in applications isn’t related to a decline in PV, just a lack of budget. It would be interesting to see how much the incentives running out changed the number of PV system installs but I haven’t been able to find a source for that data.

In my next post we’ll take a closer look at this data set and see what else we can learn.

Posted in CA-Solar, R | Tagged , , , , , , , | Leave a comment