View on GitHub

power-outage-predictor

Power Outage Predictor

Introduction

Power outages continue to be some of the most disruptive infrastructure failures in the United States, given our reliance on electricity for basic services, work, and other essential components of our daily lives. They affect millions of Americans yearly. The Outage dataset, with 1,534 rows and 56 columns, highlights details and metadata about outages in the Continental US from 2000 to 2016. Some of these features include outage causes, number of customers affected, economic and geographic indictors, grid and utility data, and more.

The question we aim to answer is: Can we predict the duration of a power outage using a combination of climate, economic, demographic, and grid-related features?

This is a regression problem, with our target variable being the column ‘OUTAGE.DURATION’. The relevant features we wanted to include are:

With this question, we aim to discover which features are most predictive of how long outages last — and identify which features tend to cause longer interruptions.

Data Cleaning and Exploratory Data Analysis

Cleaning Steps

After loading the excel file to our notebook, we formatted appropriately, setting the header file to the right row, and dropping the placeholder column in the dataset. We also set the index to be a column that the dataset already had as an index, called OBS. We also kept only 13 columns that we were interested in exploring for our regression task, including economic data, urbanization metrics, and climate/geographical data. Furthermore, to reduce skew and modeling noise, we decided to eliminate outliers, or outages that lasted longer than three days (4320 minutes). This left us with 1230 rows, compared to our original 1534. Lastly, all the columns had dtype=’OBJECT’, so we converted the columns with numeric values to have numeric types.

Below is the head of our cleaned dataset:

Univariate Analysis

The histogram above shows the distribution OUTAGE.DURATION under 3 days in the US. Even after removing all the entries with an outage duration longer than 3 days, the data is still heavily right-skewed, with nearly 40% of our data having an outage duration of under 200 minutes.

Bivariate Analysis

This scatter plot shows the relationship Utility Sector contribution to State GDP and Outage Duration. We theorized that if the contribution to the utility sector is a high percentage, then outage duration would be lower due to the larger investment in infrastructure. However, from the data, we saw that there wasn’t a strong linear relationship between the two fields, and that it would not be a good predictor for outage duration alone.

Grouped Table

CAUSE.CATEGORY Avg Duration (min) Median Duration (min) Event Count Avg Customers Affected Median Customers Affected
fuel supply emergency 2283.5 2283.5 2 0.5 0.5
severe weather 1700.1 1500 487 153252 100000
public appeal 920.3 394 15 9426.3 7935
system operability disruption 543.8 191 81 210562 69000
equipment failure 389.7 184 26 109223 49250
intentional attack 308.2 44.5 190 1875.3 0
islanding 237.1 115.5 34 6169.1 2342.5

This grouped table shows the aggregated outage and customers affected statistics by the CAUSE.CATEGORY column, which has types like severe weather, equipment failure, intentional attack, and others listed in the table above. Severe weather accounts for most of the events and a high mean outage duration. Fuel Supply Emergency accounts for a the longest average duration, but only had 2 events, suggesting that those events are not frequent but extreme. Intentional attacks had the second highest frequency of occurrences, but a low median and mean duration, potentially indicating that these types of outages could be contained and fixed faster.

Imputation

For our task, we had to impute 12 values for the TOTAL.PRICE column, and 395 values for the CUSTOMERS.AFFECTED column.

For customers affected, we saw that the distribution was heavily right-skewed, and knew that the median would be more robust against outliers, better reflecting the typical outage event. The figure below shows the distribution of the column before and after the imputation, with the taller spike caused by the 395 values imputed with the median.

For total price, we knew that the field was the average monthly electricity price in the corresponding U.S. state of the event. Since there were also only 12 missing values, we decided to impute these by going with the median of the other TOTAL.PRICE values of entries in the same state to preserve regional pricing differences. The figure below shows the difference in the distribution before and after the imputation, with there being little to no difference between the two.

Framing a Prediction Problem

Our goal is to predict the duration of a power outage in minutes, using only information known at the start of the outage This is a regression problem, as the response variable OUTAGE.DURATION is numerical and continuous.

We chose outage duration as the prediction target because it is a critical piece of information for both utility companies and the public. Estimating how long an outage might last helps with operational decisions, communication with customers, and improving future response strategies.

Included Features:

All features used in the model are available at the time of prediction:

Evaluation Metrics:

We use two metrics to evaluate performance:

The features that we chose to include are known only at the start of the outage, which is crucial for real-world prediction.

Baseline Model

For our baseline, we trained a Linear Regression model to predict outage duration. Our dataset contained both quantitative and categorical features, so we used a full sklearn Pipeline to preprocess and train the model in one step.

Features Used

We selected 11 total features for our baseline model:

These categorical features were encoded using OneHotEncoding with drop='first' to avoid multicollinearity.

Model Evaluation

We trained our model using an 80/20 train-test split. The baseline results were:

Although the baseline model’s absolute error is high with an average nearly 12 hours, this is expected due to the high variance and skew in outage durations, given our histogram from earlier. Some outages last only minutes, while others extend over multiple days.

Final Model

For our final model, we aimed to improve on our baseline linear regression by introducing new engineered features, applying more appropriate scaling methods to certain features, and experimenting with more flexible models that could capture non-linear relationships in the data.

Engineered Features and Transformations

StandardScaler

We used a StandardScaler on

This column is already on a percentage scale, but it has low variance with a lot of the utility sector’s contribution to the GDP being between 1% and 3%. Standardization ensures that it is on the same scale as other inputs, particularly helpful for regularized models like Lasso.

QuantileTransformer

We applied a QuantileTransformer with a normal output distributionto:

We chose these variables as they had highly skewed distributions. The quantile transformer helps smooth extreme values and reshapes the data into a standard normal distribution, which makes learning more stable.

Modeling approach

We trained and evaluated two advanced models: Lasso Regression and Random Forest Regression.

Lasso Regression

We used GridSearchCV to find the best alpha for lasso regression.

Top Weighted Features for Lasso

Feature Coefficient
U.S._STATE_Michigan 1107.60
CAUSE.CATEGORY_severe weather 1074.04
CAUSE.CATEGORY_fuel supply emergency 814.37
U.S._STATE_Minnesota 725.66
U.S._STATE_Pennsylvania 650.61

Lasso helped us identify the most important predictors by assigning zero weight to less relevant features, however it struggled with non-linear relationships with features, and we felt that was more important for prediction as multiple factors combined for outage duration, rather than one feature having a strict importance over the other.

Random Forest Regressor

GridSearchCV was used to tune max_depth, min_samples_split, and n_estimators.

5 Most Important Features for Random Forest:

Feature Importance
CAUSE.CATEGORY_severe weather 0.301
CUSTOMERS.AFFECTED_MED_IMPUTED 0.154
TOTAL.PRICE_IMPUTED 0.099
UTIL.CONTRI 0.096
PC.REALGSP.STATE 0.068

Comparison with Baseline Model

Model MAE (mins) MSE (sq. mins)
Baseline (Linear) 720.22 985,250.22
Lasso Regression 708.42 973,790.75
Random Forest 652.68 922,291.54

The Random Forest model did better than both Lasso and the Baseline model on MAE and MSE, which we prioritized as evaluation metrics since they one more intuitively measured the average prediction error in minutes and the other was robust to outliers.

The Random Forest model also captures nonlinear relationships and interactions between features, which was not possible with the Baseline linear model or Lasso. With our multitude of inpt features, this was the ideal choice for the final model. This,combined with its strong MAE performance and better interpretability with feature importance scores, made Random Forest the best candidate for our final model.