Predicting Christmas Spending | NJD

Predicting Christmas Spending

May 29, 2024 | Categories: Inference Models
Last Modified: July 23, 2024, 3:52 a.m.

Back in December 2023, I wanted to predict how much Americans would spend on Christmas gifts prior to the release by the National Retail Federation. Prior to release, it seemed that it was on an upward trajectory:


Imgur


As we can see, with year as a predictor, the trend seems nonlinear, even on a log scale. So, there are many different approaches we can take. This post documents each approach I took, evaulating how each performed back when I wrote them, and then seeing how they each did knowing now how much Americans spent.


Models


I figured with year as a predictor, it would be best to use time-series models. However, the growth over time could instead be modeled as a function of several predictors, of which the growth is derived from. So, I decided a good place to start was a simple linear regression model, alongside its Bayesian counterpart, as well as two timer series models: one that deals only with observed values and another that takes into account predictors. Each will be covered in depth.


Predictors


For predictors, I chose the following based on some preliminary research (and by that, I mean my basic knowledge of economics paired with some online articles):



From these predictors, it was found that leaving them untransformed provided by thest preliminary correlation (this may have been a bad decision on my part, as my search was not exhaustive). This leaves us with the following:


Imgur


However, in examining the predictors, I noted some co-linearity via a heatmap:


Imgur


So, we also examine the use of PCA in our OLS model.


Prediction


OLS


First, we attempt to use some baseline linear regression on the raw data. Quick reminder that OLS assumes:


$$Y|X \sim N(\beta X, \sigma^2)$$


for some estimated \( \sigma^2 \). The results:


Model R-Squared MSE Predicted Value Lower Bound Upper Bound
Linear Regression, non-PCA 0.99 370.67 890.06 838.64 941.49
Linear Regression, PCA 0.96 863.65 891.76 754.64 1028.87


As expected, the PCA model has higher MSE and lower R-squared, but it generalizes well it seems with a higher predictive value for the trend. It also results in a wider 95\% CI, and it seems that both predict far less than what might be expected from the trend.


Bayesian Linear Regression


Bayesian Linear Regression has a similar set-up, with:


$$Y|X, \beta \sim N(\beta X, \sigma^2) $$


But we also assume priors on these values. Typically, we would use noninformative priors, but using pymc, we can better use already existing distributions. That is, we allow priors of:


$$\sigma^2 \sim C$$
$$\beta_0 \sim N(\hat\beta_0, 1)$$
$$\beta_{1:p} \sim N(\hat\beta_{1:p}, I)$$


Since we know that, under noninformative priors, our posterior should have a center similar to the OLS prediction.


Running this setup on 4 chains gives us convergence on all chains for each dsitribution,


Imgur


So, a slightly lower value than OLS (which is usual, since Bayesian estimates tend to provide more conservative estimates compared to its OLS counterpart as a result of the priors)


ARIMA and SARIMAX


Next, we make use of Auto-Regressive Integrated Moving Average models, are ARIMA for short. These models combine 3 separate practices that are usual in time series:



From playing around with the data and convergence, I found that I got the best results with ARIMA(3, 3, 3), meaning taking a look at the past 3 points and using iid shocks to represent differences, as well as also viewing roots of size 3. This is represented as:


$$\left(1 - \sum_{i=1}^3(1-\phi_i L^i)\right) (1-L)^3 Y_t = \left(1 + \sum_{i=1}^3 \theta_i L^i\right) \epsilon_t$$


Where "L" is what is called the "Lag operator".


Ex:


$$L Y_t = Y_{t - 1}$$


$$L^2 Y_t = Y_{t - 2}$$


When expanded out, we get that:


$$Y_t = f(Y_{t-1}, Y_{t-2}, Y_{t-3}, \theta_{t-1}, \theta_{t-2}, \theta_{t-3}, \phi_{t-1}, \phi_{t-2}, \phi_{t-3})$$


And we can use a log-likelihood maximizer to find the best predictions.


This is great, but what if we also want to include some of our linear predictors like before? Well, for this, we have to use a Seasonal Auto-Regressive Integrated Moving Average with Exogeneous Regressors (SARIMAX). Now, to be completely honest, I only understand this to a baseline, but essentially (assuming no seasonality), this ends up being a simila result to ARIMA with the introduction of predictors on \( X_t \). So, it will be:


$$Y_t = f(Y_{t-1}, Y_{t-2}, Y_{t-3}, \theta_{t-1}, \theta_{t-2}, \theta_{t-3}, \phi_{t-1}, \phi_{t-2}, \phi_{t-3}, \boldsymbol{\beta}, X_t)$$


Running both, we get some interesting results:


Name Mean Lower Bound Upper Bound
ARIMA 963.957 928.775 999.138
SARIMAX 971.21 960.691 981.728


We get much more realistic predictions and not only that, we get even tighter bounds. We can already hypothesize that this will be the best model, but let's see how all of them did.


Results


Spending ended up being $964.4 billion.


Comparing each result, it is clear that ARIMA did the best by far on predicting 2023 spending. Plotting with 95\% confidence intervals (or for the Bayesian Linear Regression, credible interval)


Imgur


And looking at the residuals:


Imgur


ARIMA did the best by far, getting within 0.5 of the true value. Does this mean that it is better than SARIMAX? Maybe not: it could be that for last year, it just proved to be a better estimate, but it may not be that way continuing on. That being said, if we want to predict next year's spending, it is much easier to predict using ARIMA.


Conclusions and Next Year's Prediction


Running the ARIMA model again, we get that for next year, we can possibly get the first year that American Christmas spending surpasses $1 trillion.