Power Outage Predictor
Introduction
Power outages continue to be some of the most disruptive infrastructure failures in the United States, given our reliance on electricity for basic services, work, and other essential components of our daily lives. They affect millions of Americans yearly. The Outage dataset, with 1,534 rows and 56 columns, highlights details and metadata about outages in the Continental US from 2000 to 2016. Some of these features include outage causes, number of customers affected, economic and geographic indictors, grid and utility data, and more.
The question we aim to answer is: Can we predict the duration of a power outage using a combination of climate, economic, demographic, and grid-related features?
This is a regression problem, with our target variable being the column ‘OUTAGE.DURATION’. The relevant features we wanted to include are:
CUSTOMERS.AFFECTED
: Number of customers impactedCAUSE.CATEGORY
: High-level cause of the outage (e.g., severe weather, equipment failure)UTIL.CONTRI
: Utility sector’s contribution to the state GDPPOPPCT_URBAN
: Population percentage of the state that is urbanAREAPCT_URBAN
: Area percentage of the state that is UrbanTOTAL.PRICE
: Average electricity price in the stateCLIMATE.CATEGORY
: Weather climate classification (e.g. hot, cold)U.S._STATE
: State where the outage occurred
With this question, we aim to discover which features are most predictive of how long outages last — and identify which features tend to cause longer interruptions.
Data Cleaning and Exploratory Data Analysis
Cleaning Steps
After loading the excel file to our notebook, we formatted appropriately, setting the header file to the right row, and dropping the placeholder column in the dataset. We also set the index to be a column that the dataset already had as an index, called OBS. We also kept only 13 columns that we were interested in exploring for our regression task, including economic data, urbanization metrics, and climate/geographical data. Furthermore, to reduce skew and modeling noise, we decided to eliminate outliers, or outages that lasted longer than three days (4320 minutes). This left us with 1230 rows, compared to our original 1534. Lastly, all the columns had dtype=’OBJECT’, so we converted the columns with numeric values to have numeric types.
Below is the head of our cleaned dataset:
Univariate Analysis
The histogram above shows the distribution OUTAGE.DURATION
under 3 days in the US. Even after removing all the entries with an outage duration longer than 3 days, the data is still heavily right-skewed, with nearly 40% of our data having an outage duration of under 200 minutes.
Bivariate Analysis
This scatter plot shows the relationship Utility Sector contribution to State GDP and Outage Duration. We theorized that if the contribution to the utility sector is a high percentage, then outage duration would be lower due to the larger investment in infrastructure. However, from the data, we saw that there wasn’t a strong linear relationship between the two fields, and that it would not be a good predictor for outage duration alone.
Grouped Table
CAUSE.CATEGORY | Avg Duration (min) | Median Duration (min) | Event Count | Avg Customers Affected | Median Customers Affected |
---|---|---|---|---|---|
fuel supply emergency | 2283.5 | 2283.5 | 2 | 0.5 | 0.5 |
severe weather | 1700.1 | 1500 | 487 | 153252 | 100000 |
public appeal | 920.3 | 394 | 15 | 9426.3 | 7935 |
system operability disruption | 543.8 | 191 | 81 | 210562 | 69000 |
equipment failure | 389.7 | 184 | 26 | 109223 | 49250 |
intentional attack | 308.2 | 44.5 | 190 | 1875.3 | 0 |
islanding | 237.1 | 115.5 | 34 | 6169.1 | 2342.5 |
This grouped table shows the aggregated outage and customers affected statistics by the CAUSE.CATEGORY
column, which has types like severe weather, equipment failure, intentional attack, and others listed in the table above. Severe weather accounts for most of the events and a high mean outage duration. Fuel Supply Emergency accounts for a the longest average duration, but only had 2 events, suggesting that those events are not frequent but extreme. Intentional attacks had the second highest frequency of occurrences, but a low median and mean duration, potentially indicating that these types of outages could be contained and fixed faster.
Imputation
For our task, we had to impute 12 values for the TOTAL.PRICE
column, and 395 values for the CUSTOMERS.AFFECTED
column.
For customers affected, we saw that the distribution was heavily right-skewed, and knew that the median would be more robust against outliers, better reflecting the typical outage event. The figure below shows the distribution of the column before and after the imputation, with the taller spike caused by the 395 values imputed with the median.
For total price, we knew that the field was the average monthly electricity price in the corresponding U.S. state of the event. Since there were also only 12 missing values, we decided to impute these by going with the median of the other TOTAL.PRICE values of entries in the same state to preserve regional pricing differences. The figure below shows the difference in the distribution before and after the imputation, with there being little to no difference between the two.
Framing a Prediction Problem
Our goal is to predict the duration of a power outage in minutes, using only information known at the start of the outage This is a regression problem, as the response variable OUTAGE.DURATION
is numerical and continuous.
We chose outage duration as the prediction target because it is a critical piece of information for both utility companies and the public. Estimating how long an outage might last helps with operational decisions, communication with customers, and improving future response strategies.
Included Features:
All features used in the model are available at the time of prediction:
CUSTOMERS.AFFECTED
: Number of customers impactedCAUSE.CATEGORY
: High-level cause of the outage (e.g., severe weather, equipment failure)UTIL.CONTRI
: Utility sector’s contribution to the state GDPPOPPCT_URBAN
: Population percentage of the state that is urbanAREAPCT_URBAN
: Area percentage of the state that is UrbanTOTAL.PRICE
: Average electricity price in the stateCLIMATE.CATEGORY
: Weather climate classification (e.g. hot, cold)U.S._STATE
: State where the outage occurred
Evaluation Metrics:
We use two metrics to evaluate performance:
- Mean Absolute Error/MAE: Reports average absolute prediction error in minutes. It is easy to interpret but less sensitive to outliers. We preferred MAE to MSE as our data has a lot of outliers and there is a skew in the distribution, so it is robust to these outliers.
- Mean Squared Error/MSE: Penalizes large errors heavily and is useful for model comparisons.
The features that we chose to include are known only at the start of the outage, which is crucial for real-world prediction.
Baseline Model
For our baseline, we trained a Linear Regression model to predict outage duration. Our dataset contained both quantitative and categorical features, so we used a full sklearn Pipeline to preprocess and train the model in one step.
Features Used
We selected 11 total features for our baseline model:
- Quantitative Features :
CUSTOMERS.AFFECTED_MED_IMPUTED
UTIL.CONTRI
PC.REALGSP.STATE
PC.REALGSP.USA
TOTAL.PRICE_IMPUTED
POPPCT_URBAN
AREAPCT_URBAN
POPDEN_URBAN
- Categorical features:
CAUSE.CATEGORY
CLIMATE.CATEGORY
U.S._STATE
These categorical features were encoded using OneHotEncoding
with drop='first'
to avoid multicollinearity.
Model Evaluation
We trained our model using an 80/20 train-test split. The baseline results were:
- Mean Absolute Error: 720.22 minutes
- Mean Squared Error: 985,250.22 sq. minutes
Although the baseline model’s absolute error is high with an average nearly 12 hours, this is expected due to the high variance and skew in outage durations, given our histogram from earlier. Some outages last only minutes, while others extend over multiple days.
Final Model
For our final model, we aimed to improve on our baseline linear regression by introducing new engineered features, applying more appropriate scaling methods to certain features, and experimenting with more flexible models that could capture non-linear relationships in the data.
Engineered Features and Transformations
URBAN_DENSITY_RATIO
- Calculated as the percent of population living in urban areas divided by the percent of land that is urban, or the ratio betweenPOPPCT_URBAN
andAREAPCT_URBAN
. This feature captures the population in cities relative to their land use, and could affect infrastructure resilience, and in turn outage duration. We applied tailored preprocessing techniques to specific features:
StandardScaler
We used a StandardScaler
on
UTIL.CONTRI
: Utility sector’s contribution to the state GDP
This column is already on a percentage scale, but it has low variance with a lot of the utility sector’s contribution to the GDP being between 1% and 3%. Standardization ensures that it is on the same scale as other inputs, particularly helpful for regularized models like Lasso.
QuantileTransformer
We applied a QuantileTransformer
with a normal output distributionto:
TOTAL.PRICE_IMPUTED
URBAN_DENSITY_RATIO
We chose these variables as they had highly skewed distributions. The quantile transformer helps smooth extreme values and reshapes the data into a standard normal distribution, which makes learning more stable.
Modeling approach
We trained and evaluated two advanced models: Lasso Regression and Random Forest Regression.
Lasso Regression
We used GridSearchCV
to find the best alpha for lasso regression.
- Hyperparameter tuning:
GridSearchCV
over 10 values of alpha (log-spaced) - Best alpha: 2.15
- Performance:
- MAE: 708.42 mins
- MSE: 973,790.75 sq. mins
Top Weighted Features for Lasso
Feature | Coefficient |
---|---|
U.S._STATE_Michigan | 1107.60 |
CAUSE.CATEGORY_severe weather | 1074.04 |
CAUSE.CATEGORY_fuel supply emergency | 814.37 |
U.S._STATE_Minnesota | 725.66 |
U.S._STATE_Pennsylvania | 650.61 |
Lasso helped us identify the most important predictors by assigning zero weight to less relevant features, however it struggled with non-linear relationships with features, and we felt that was more important for prediction as multiple factors combined for outage duration, rather than one feature having a strict importance over the other.
Random Forest Regressor
GridSearchCV
was used to tune max_depth, min_samples_split, and n_estimators.
- Hyperparameter tuning:
GridSearchCV
withmax_depth
:[10, 20, None]
n_estimators
:[100, 200]
- Best parameters:
max_depth=10
,min_samples_split=2
,n_estimators=200
- Performance:
- MAE: 652.68 mins
- MSE: 922,291.54 sq. mins
5 Most Important Features for Random Forest:
Feature | Importance |
---|---|
CAUSE.CATEGORY_severe weather | 0.301 |
CUSTOMERS.AFFECTED_MED_IMPUTED | 0.154 |
TOTAL.PRICE_IMPUTED | 0.099 |
UTIL.CONTRI | 0.096 |
PC.REALGSP.STATE | 0.068 |
Comparison with Baseline Model
Model | MAE (mins) | MSE (sq. mins) |
---|---|---|
Baseline (Linear) | 720.22 | 985,250.22 |
Lasso Regression | 708.42 | 973,790.75 |
Random Forest | 652.68 | 922,291.54 |
The Random Forest model did better than both Lasso and the Baseline model on MAE and MSE, which we prioritized as evaluation metrics since they one more intuitively measured the average prediction error in minutes and the other was robust to outliers.
The Random Forest model also captures nonlinear relationships and interactions between features, which was not possible with the Baseline linear model or Lasso. With our multitude of inpt features, this was the ideal choice for the final model. This,combined with its strong MAE performance and better interpretability with feature importance scores, made Random Forest the best candidate for our final model.