Paper Review 2

By: Chengyi (Jeff) Chen__

Article: A Large-Scale Study on Predicting and Contextualizing Building Energy Usage


What is the main problem/task addressed by the paper?

To use recent parametric and non-parametric Machine Learning models to provide normative feedback about individual household energy usage relative to other similar households in Cambridge, MA through a graphical end user-interface called EnergyView which the authors had developed.


What was done before, and how does this paper improve on it?

… our work builds most directly upon the studies, mentioned above, that highlight the importance of normative feedback for improving energy efficiency …

  • The paper builds on the work of others within the Energy community and seeks to provide a basis of comparison of one’s individual household consumption of energy use amongst others in their area.

… several companies that work in this area, but as they do not share the details of their models or data, it is difficult to know what methods they employ …

  • There is a lack of information regarding what ML / statistical models have been successful in this application even though there have been many for-profit / private companies that are working in this area, so the goal of the research paper was to explore algorithms, mainly Linear Regression and Gaussian Process Regression that can help predict individual households’ current energy consumption given features about where they are living at. The authors of the paper are trying to also make the EnergyView GUI available publicly to allow volunteers to input their own data which can improve the model’s performance, while the current system should only be made available to authorized parties because of the privacy issues associated with displaying the predicted energy consumption rates for everyone.

… Several academic studies exist that examine individual home residential energy usage at a high resolution, (e.g., (Berges et al. 2009)), but this work is roughly orthogonal to this paper, as we here consider much lower resolution data, but for many more houses. Likewise, there exist several studies on building energy consumption at a highly aggregate level (Berry 2005), but again these differ from this work as we consider a data set obtained directly from a large number of individual buildings.

  • I am honestly unsure about the difference between High resolution and Low resolution energy usage data, but the paper does highlight the uniqueness and importance of using a dataset that has specific energy usage features for each building in that area rather than general / average energy usage features for buildings in that area.


What is the one cool technique/idea/finding that was learned from this paper?

The primary techniques used in the paper deal with Regression, in the parametric form of Linear Regression and non-parametric form of Gaussian Process Regression, under different likelihood distributions like Normal, Laplace, T - distribution likelihoods.

An interesting observation found during the preliminary analysis of the dataset was that plotting the number of houses against their total energy consumption on a logarithmic scale on the x-axis displayed a Gaussian-like distribution which suggested a log-normal distribution at first. However, under the different likelihoods used later on in the analysis and modelling, the log Student-t distribution seemed to model the distribution better (P.s. Honestly really unsure about what this really means).

A bulk of the research paper was focused on explaining the rationale behind their feature selection process. This ranged from first using “only the first 9 features decrease the RMSE as measured via cross validation to a statistically significant degree”, to Lasso / L1 norm regularization technique for forward feature selection. There was an additional artificial feature that was added, the Laplacian Eigenvectors that were derived after using the k-Nearest neighbours algorithm to plot the 2D building locations (? They didn’t explain in detail how it was done, so I’m a tad confused about which features they used and such…), but unfortunately, did not improve the model’s performance when added.

Greedy forward selection also removes correlated features because adding a mnew feature that is correlated will not decrease the RMSE that the greedy forward selection minimizes on.


What part of the paper was difficult to understand?

A few key terms mostly specific to the math / statistics behind the models used were hard to digest:

  1. Student-t distributed error term

  2. Laplace distributed error term

  3. Non-Gaussian likelihoods

  4. Maximum marginal likelihood

  5. EM Algorithm

Areas of confusion: I am pretty confused as to how the different distributions - Student-t, Gaussian, Laplace - came into play when training and executing the regression models.


Brainstorming ideas: What generalization or extension of the paper could be done? Or what are other applications that can benefit from the described approach? What more could have been done in this application space that could leverage AI? Does this give you an idea about a possible project a student could do?

As seen from the k-means algorithm used in the previous review paper, we could first use k-means on this dataset to find clusters that have similar energy consumption patterns and then run the regression models explained in this research study on those specific clusters, giving us k different regression models that are specifically tuned to each of those clusters. From here, we could analyze and derive patterns as to which clusters are the most inefficient at energy usage and possibly find out the root cause as to why these buildings are so wasteful in energy consumption - the demographic of people living in those areas or maybe the actual buildings are outdated and require higher priority in servicing. This could also serve as a highly valuable piece of data for for-profit companies that are developing solutions to reduce energy consumption so as to reduce electricity payment of these households.

Furthermore, in order to supplement this research, if a time series dataset - like how the total energy consumption in individual households evolved over time - was available, we could use other ML / Deep learning (Recurrent Neural Network) / Statistical model like Seasonal Autoregressive Integrated Moving Average model to predict the energy consumption in the next time period and provide more accurate electricity bill forecasts for individual households and maybe encourage the decrease in electricity usage.