Time series prediction using VAR in R

In this article I describe the main approach to create multivariate time series models. While univariate time series models are famous, they don’t consider other factors.

Roman Kyrychenko https://www.linkedin.com/in/kirichenko17roman/
08-14-2019

ARIMA, Exponential smoothing.. are famous models for time series forecasting. But when you want to improve your model for ts-forecasting, it will be difficult for you to add new predictors in the model. It’s a simple task for methods like linear regression, but not for time series forecast methods. Typical approaches for time series prediction include time series decomposition into trend, seasonality and noise, which are parts of a variable, that is interesting for us. It appears that we make time series prediction based on past values of the same feature. But current values depend not only on the same variable, but also on others variables.

For example, we are interested in bitcoin trends. We can build a simple ARIMA model. To do this we need only bitcoin trends data. But bitcoin trends also depend on other trends like blockchain, mining, and nvidia as well. It makes real sense. Let’s imagine that one week many people were searching for good nvidia GPU for bitcoin mining. THE next week the same people (who have good nvidia GPUs) were searching for minining, because they don’t know how to do this. Then they will be looking for blockchain, because here they will mining bitcoins. So, googling for nvidia gpu causes googling for mining, and googling for mining gpu causes googling for blockchain. This is theoretical model of behavior of users in Google. Such model can be implemented using so called VAR-model. What is it?

VAR is abbreviation for vector autoregression model. Roughly speaking VAR is an extended version of linear regression that contains lagged independ variables and variables for trend and seasonality description.

\[ y_t = c + A_1 y_{t-1} + A_2 y_{t-2} + \cdots + A_p y_{t-p} + e_t, \, \]

Even the appearance of the formula is very reminiscent of multiple linear regression.

So let’s try to put this model into practice!

Suppose you have multivariate time series like this:


suppressPackageStartupMessages({
  require(dplyr)
  require(gt)
  require(ggplot2)
  require(hrbrthemes)
})

bitcoin_trend <- purrr::map_dfr(
  c("bitcoin", "mining", "gpu", "money", "nvidia", "musk", 
    "startup", "blockchain", "python", "petroleum", "credit", "iphone"),
  ~ trendyy::get_interest(trendyy::trendy(., 
                                          from = "2015-01-01", 
                                          to = "2019-08-01"))
) %>% 
  select(date, hits, keyword) %>% 
    mutate(date = lubridate::date(date))

head(bitcoin_trend) %>% 
  gt() %>%
  tab_header(
    title = md("**Google trends**")
  ) %>%
  fmt_date(
    columns = vars(date),
    date_style = 8
  ) %>%
  tab_style(
    style = cells_styles(
      bkgd_color = "white", text_align = "center",
      text_size = 3, text_font = "PT Sans"),
    locations = cells_data(
      columns = vars("date", "hits", "keyword")
  )) %>%
  tab_style(
    style = cells_styles(
      bkgd_color = "white", text_weight = "bold", 
      text_size = 3, text_font = "PT Sans"),
    locations = cells_column_labels(
      columns = vars("date", "hits", "keyword")
  )) 
Google trends
date hits keyword
4 січня 2015 3 bitcoin
11 січня 2015 3 bitcoin
18 січня 2015 2 bitcoin
25 січня 2015 2 bitcoin
1 лютого 2015 2 bitcoin
8 лютого 2015 2 bitcoin

This is data from Google trends collected with trendyy package.

Lets convert this data to wide format:


bitcoin_trend %>% tidyr::spread(keyword, hits) %>% head() %>% gt() %>%
  tab_header(
    title = md("**Google trends**")
  ) %>%
  fmt_date(
    columns = vars(date),
    date_style = 8
  ) %>%
  tab_style(
    style = cells_styles(
      bkgd_color = "white", text_align = "center",
      text_size = 3, text_font = "PT Sans"),
    locations = cells_data(
      columns = vars("date", "bitcoin", "mining", "gpu", "money", "nvidia", "musk", 
                     "startup", "blockchain", "python", "petroleum", "credit", "iphone")
  )) %>%
  tab_style(
    style = cells_styles(
      bkgd_color = "white", text_weight = "bold", 
      text_size = 3, text_font = "PT Sans"),
    locations = cells_column_labels(
      columns = vars("date", "bitcoin", "mining", "gpu", "money", "nvidia", "musk", 
                     "startup", "blockchain", "python", "petroleum", "credit", "iphone")
    ))
Google trends
date bitcoin blockchain credit gpu iphone mining money musk nvidia petroleum python startup
4 січня 2015 3 2 87 57 48 35 80 7 76 79 38 80
11 січня 2015 3 3 87 54 45 37 76 7 72 80 41 81
18 січня 2015 2 3 82 57 43 37 78 8 75 82 41 81
25 січня 2015 2 2 86 57 43 36 79 9 75 79 42 81
1 лютого 2015 2 2 89 55 43 37 79 6 71 79 45 82
8 лютого 2015 2 2 86 53 41 38 76 7 80 78 44 84

So, we have 12 time serieses that is collected from Google trends. Notice data is weekly, therefore we set season_duration value to 52 (count of weeks in year). Also we cant to predict values for one year, so we set also forecast window to 52 (the equivalent of one year on our data set):


window <- 52
season_duration <- 52

For best and fast data processing we need to convert this data from data frame to multivariate time series format. We can do it as follows:


time_series <- tsibble::as_tsibble(
  bitcoin_trend, 
  index = date, 
  key = keyword, 
  regular = T
  ) %>% 
    tidyr::spread(keyword, hits)  %>% 
  as.ts(frequency = season_duration)

So, now we can create initial VAR-model with default parameters using all variables. VAR-models is provided in R by vars package. VAR function has a lot of parameters, but from my experience I can say that in most cases default parameters give the best result. The main parameter is the size of the lag - p. I have not yet had one case where its increase did not impair the result, therefore I keep it the default value.


var_model <- vars::VAR(time_series) 

summary(var_model)$varresult$iphone %>% 
  broom::tidy() %>% 
  mutate_if(is.numeric, round, 3) %>% 
  gt() %>%
  tab_header(
    title = md("**VAR summary (bitcoin)**")
  ) %>%
  tab_style(
    style = cells_styles(
      bkgd_color = "white", text_align = "center",
      text_size = 10, text_font = "PT Sans"),
    locations = cells_data(
      columns = vars("term", "estimate", "std.error", "statistic", "p.value")
  )) %>%
  tab_style(
    style = cells_styles(
      bkgd_color = "white", text_weight = "bold", 
      text_size = 10, text_font = "PT Sans"),
    locations = cells_column_labels(
      columns = vars("term", "estimate", "std.error", "statistic", "p.value")
  )) %>%
  data_color(
    columns = vars(p.value),
    colors = scales::col_numeric(
      palette = c("#1a9850", "#d73027"),
      domain = c(0, 1)
    )
  )
VAR summary (bitcoin)
term estimate std.error statistic p.value
bitcoin.l1 -0.047 0.079 -0.590 0.556
blockchain.l1 0.060 0.090 0.671 0.503
credit.l1 -0.010 0.106 -0.097 0.922
gpu.l1 -0.364 0.138 -2.642 0.009
iphone.l1 0.546 0.055 9.939 0.000
mining.l1 0.264 0.139 1.894 0.059
money.l1 -0.158 0.117 -1.347 0.179
musk.l1 0.068 0.052 1.313 0.190
nvidia.l1 0.104 0.056 1.851 0.066
petroleum.l1 -0.298 0.099 -3.004 0.003
python.l1 -0.060 0.051 -1.179 0.239
startup.l1 0.127 0.106 1.203 0.230
const 50.388 14.357 3.510 0.001

The output we get reminds the one of linear regression. Let’s visualize the prediction:


var_model %>% 
  forecast::forecast(window) %>% 
  forecast::autoplot() + 
  scale_y_continuous(limits = c(0, 100)) +
  theme_ipsum(base_family = "PT Sans") +
  theme(
    panel.grid = element_line(linetype = "dashed")
  ) + facet_wrap(~series, scales = "fixed", ncol = 2)