Project Part I
Project Part I
01/27/2023
This dataset contains weather data and solar irradiance data along with power generation for a
solar photovoltaic array. I have set aside a randomly selected test dataset consisting of 20% of the data,
and a training dataset consisting of the remaining 80% of the data. My main concern with the data is
that the source of the data provides little to no context on where the data points come from. It is
unclear if they are all sampled from the same location or different locations, or what the time scale is
between the samples (i.e. are they sampled each day at the same time, different time, etc.). All of the
data provided seems realistic, but the lack of context is reason for concern. Perhaps through analysis we
will be able to glean insight into some of these questions, such as if the data is from one location or
multiple different locations.
With the increasing penetration of renewable energy on the electric grid, there is a growing
need for predicting the power generation of these sources, which are unpredictable and subject to the
weather conditions. This dataset will hopefully be used to glean insight into how different weather and
solar irradiance conditions impact solar power generation. This information can then be used to predict
solar power generation in the future using weather forecasts, which will be critical to proper power
planning and electric grid operation. One potential issue with using a model derived from a dataset for
prediction is that there are scientific equations relating to solar irradiance that can be used to calculate
power directly, something I’ve worked on doing at the Pacific Northwest National Laboratory (PNNL).
While it’s important to explore both methods of predicting solar power generation, my worry is that
using solar power generation equations rather than predictive models based off historical data will be
more accurate in predicting power generation.
One covariate that I wish was collected in this dataset is the location of all these measurements,
whether it be a categorical variable such as the city or state, or a continuous variable like the
coordinates. It’s important to know if these data points come from the same location or not. The time of
day and date of measurement would also be nice to have, as different times of day and times of year
correspond to different levels of solar irradiance. While the effects of these other variables are likely
(hopefully) captured in the covariates that are present, I would have collected those fields as well to
provide a more informative dataset.
The main response variable of interest is the power generation, ‘generated_power_kw’, which is
a continuous variable. Being able to predict the power generation given the other covariates is
important in grid planning, as mentioned above, which makes it the logical choice to model as a
continuous response variable. There isn’t an obvious choice for a binary response variable in this
dataset, but I am going to use ‘precipitation’ as the binary response variable, with a value of 0
corresponding to 0.0 precipitation, and a value of 1 corresponding to > 0.0 precipitation from the
‘total_precipitation_sfc’ column. The covariates, mostly containing solar irradiance or weather data,
should be able to accurately predict whether or not it is raining.
Due to my prior experience in the field, there are a few variables present that I wouldn’t expect
would impact solar power generation. Namely, pressure and wind speed theoretically should not impact
solar power generation, as the solar power generation equations rely solely on solar irradiance and
temperature data. I’m not going to exclude the data, however, as it’s possible that they have correlation
to power generation (although it would likely be due to interactions with other covariates).
Scott Underwood
01/27/2023
Plotting the covariates against the binary response variable ‘precipitation’ in Figure 2 yields less
obvious results, but a few correlations can still be seen. Jitter was added to show the density of the
points due to the binary nature of the precipitation variable. Namely it appears that
‘mean_sea_level_pressure_MSL’ and ‘shortwave_radiation_backwards_sfc’ are negatively correlated
with ‘precipitation’ (higher values correspond to a value of zero for precipitation) and
‘relative_humidity_2_m_above_gnd’ appears to be positively correlated. These make sense, as
precipitation tends to come with higher humidity levels and lower solar intensity and pressure. I’d also
expect to see a correlation between some of the cloud cover variables and precipitation, but they don’t
show an obvious pattern in the plots displayed.
Scott Underwood
01/27/2023
One concern with the data after looking at some of the plots are the abnormal patterns in some
of the covariate values. For example, azimuth has a couple of gaps between 150-200 where there are no
data points, which seems unlikely due to the density of the rest of the range. Additionally, the low and
medium cloud cover variables show a strong vertical line around 10%. These abnormal behaviors may
indicate some sort of bias in the data collection and will be something to look for as we investigate the
dataset further.
Looking at the mean value of all the covariates for precipitation levels of 0 and 1 reveals trends
that weren’t visible in the plots in Figure 2. Namely, total cloud cover is more than twice as high for
precipitation level 1 as for precipitation level 0 (79 vs. 30). There are other slight differences between
the mean values of covariates between the two precipitation levels, but none that jump out that
weren’t mentioned in analysis of the plots. Looking at the variance between the two levels, the
shortwave radiation has approximately twice as high variance for precipitation level 0 as for
precipitation level 1 (7,881 vs. 3,194). Additionally, the generated power has twice as high variance for
precipitation level 0 as for precipitation level 1 (89,251 vs. 43,183).
While there is concern about the lack of knowledge of how the data was collected and
where/when it is from, overall there seems to be informative data in the dataset. There also appear to
be relevant correlations, minimal null or NA values, and realistic looking data. In further parts of the
project, we will look at uncovering more correlations and interactions between covariates and response
variables and use this information to inform predictions about solar power generation.