Time series prediction using VAR in R

In this article I describe the main approach to create multivariate time series models. While univariate time series models are famous, they don’t consider other factors.

Author

Roman Kyrychenko

Published

August 14, 2019

Modified

April 5, 2023

ARIMA, Exponential smoothing.. are famous models for time series forecasting. But when you want to improve your model for ts-forecasting, it will be difficult for you to add new predictors in the model. It’s a simple task for methods like linear regression, but not for time series forecast methods. Typical approaches for time series prediction include time series decomposition into trend, seasonality and noise, which are parts of a variable, that is interesting for us. It appears that we make time series prediction based on past values of the same feature. But current values depend not only on the same variable, but also on others variables.

For example, we are interested in bitcoin trends. We can build a simple ARIMA model. To do this we need only bitcoin trends data. But bitcoin trends also depend on other trends like blockchain, mining, and nvidia as well. It makes real sense. Let’s imagine that one week many people were searching for good nvidia GPU for bitcoin mining. THE next week the same people (who have good nvidia GPUs) were searching for minining, because they don’t know how to do this. Then they will be looking for blockchain, because here they will mining bitcoins. So, googling for nvidia gpu causes googling for mining, and googling for mining gpu causes googling for blockchain. This is theoretical model of behavior of users in Google. Such model can be implemented using so called VAR-model. What is it?

VAR is abbreviation for vector autoregression model. Roughly speaking VAR is an extended version of linear regression that contains lagged independ variables and variables for trend and seasonality description.

$y_{t} = c + A_{1} y_{t - 1} + A_{2} y_{t - 2} + \dots + A_{p} y_{t - p} + e_{t},$

Even the appearance of the formula is very reminiscent of multiple linear regression.

So let’s try to put this model into practice!

Suppose you have multivariate time series like this:

Code

suppressPackageStartupMessages({
  require(dplyr)
  require(gt)
  require(ggplot2)
  require(hrbrthemes)
})

get_interest <- function(x){
  gtrendsR::gtrends(
    x, time="2017-01-01 2025-01-01",
    onlyInterest = T)$interest_over_time %>% 
    as_tibble()
}

bitcoin_trend <- purrr::map_dfr(
  c("bitcoin", "mining", "gpu", "money", "nvidia", "musk", 
    "startup", "blockchain", "python", "petroleum", "credit", "iphone"),
  get_interest
) %>% 
  select(date, hits, keyword) %>% 
    mutate(date = lubridate::date(date))

head(bitcoin_trend) %>% 
  gt() %>%
  tab_header(
    title = md("**Google trends**")
  ) %>%
  fmt_date(
    columns = c(date),
    date_style = 8
  ) %>%
  tab_style(
    style = list(
      cell_fill(color = "white"), 
      cell_text(
        align = "center",
        size = 3, 
        font = "Lato")),
    locations = cells_body(
      columns = c("date", "hits", "keyword")
  )) %>%
  tab_style(
    style = list(
      cell_fill(color = "white"),  
      cell_text(
        weight = "bold", 
        size = 3, 
        font = "Lato")),
    locations = cells_column_labels(
      columns = c("date", "hits", "keyword")
  ))

Google trends
date	hits	keyword
1 January 2017	6	bitcoin
1 February 2017	5	bitcoin
1 March 2017	7	bitcoin
1 April 2017	6	bitcoin
1 May 2017	15	bitcoin
1 June 2017	14	bitcoin

This is data from Google trends collected with trendyy package.

Lets convert this data to wide format:

Code

bitcoin_trend %>% tidyr::spread(keyword, hits) %>% head() %>% gt() %>%
  tab_header(
    title = md("**Google trends**")
  ) %>%
  fmt_date(
    columns = c(date),
    date_style = 8
  ) %>%
  tab_style(
    style = list(
      cell_fill(color = "white"), 
      cell_text(
        align = "center",
        size = 3, 
        font = "Lato")),
    locations = cells_body(
      columns = c("date", "bitcoin", "mining", "gpu", "money", "nvidia", "musk", 
                     "startup", "blockchain", "python", "petroleum", "credit", "iphone")
  )) %>%
  tab_style(
    style = list(
      cell_fill(color = "white"),  
      cell_text(
        weight = "bold", 
        size = 3, 
        font = "Lato")),
    locations = cells_column_labels(
      columns = c("date", "bitcoin", "mining", "gpu", "money", "nvidia", "musk", 
                     "startup", "blockchain", "python", "petroleum", "credit", "iphone")
    ))

Google trends
date	bitcoin	blockchain	credit	gpu	iphone	mining	money	musk	nvidia	petroleum	python	startup
1 January 2017	6	16	87	28	70	38	65	6	29	76	42	85
1 February 2017	5	18	90	29	63	42	60	7	27	77	48	82
1 March 2017	7	20	88	30	66	42	61	7	26	78	52	86
1 April 2017	6	20	88	30	62	41	63	9	25	72	50	87
1 May 2017	15	28	87	29	64	49	60	8	24	76	47	81
1 June 2017	14	33	93	37	64	63	60	8	25	71	48	81

So, we have 12 time serieses that is collected from Google trends. Notice data is weekly, therefore we set season_duration value to 52 (count of weeks in year). Also we cant to predict values for one year, so we set also forecast window to 52 (the equivalent of one year on our data set):

Code

window <- 12
season_duration <- 12

For best and fast data processing we need to convert this data from data frame to multivariate time series format. We can do it as follows:

Code

time_series <- tsibble::as_tsibble(
  bitcoin_trend, 
  index = date, 
  key = keyword, 
  regular = F
  ) %>% as.ts(frequency=12)

So, now we can create initial VAR-model with default parameters using all variables. VAR-models is provided in R by vars package. VAR function has a lot of parameters, but from my experience I can say that in most cases default parameters give the best result. The main parameter is the size of the lag - p. I have not yet had one case where its increase did not impair the result, therefore I keep it the default value.

Code

var_model <- vars::VAR(time_series) 

summary(var_model)$varresult$iphone %>% 
  broom::tidy() %>% 
  mutate_if(is.numeric, round, 3) %>% 
  gt() %>%
  tab_header(
    title = md("**VAR summary (bitcoin)**")
  ) %>%
  tab_style(
    style = list(
      cell_fill(color = "white"),cell_text(align = "center",
      size = 10, font = "Lato")),
    locations = cells_body(
      columns = c("term", "estimate", "std.error", "statistic", "p.value")
  )) %>%
  tab_style(
    style = list(
      cell_fill(color = "white"), cell_text(weight = "bold", 
      size = 10, font = "Lato")),
    locations = cells_column_labels(
      columns = c("term", "estimate", "std.error", "statistic", "p.value")
  )) %>%
  data_color(
    columns = c(p.value),
    fn = scales::col_numeric(
      palette = c("#1a9850", "#d73027"),
      domain = c(0, 1)
    )
  )

VAR summary (bitcoin)
term	estimate	std.error	statistic	p.value
bitcoin.l1	-0.181	0.122	-1.483	0.142
blockchain.l1	-0.130	0.149	-0.869	0.387
credit.l1	0.059	0.237	0.247	0.805
gpu.l1	0.011	0.208	0.052	0.959
iphone.l1	0.439	0.101	4.329	0.000
mining.l1	0.315	0.194	1.626	0.108
money.l1	-0.143	0.192	-0.745	0.458
musk.l1	0.026	0.078	0.331	0.742
nvidia.l1	0.106	0.125	0.854	0.396
petroleum.l1	-0.075	0.206	-0.361	0.719
python.l1	-0.047	0.108	-0.433	0.666
startup.l1	-0.074	0.143	-0.516	0.607
const	44.941	29.665	1.515	0.134

The output we get reminds the one of linear regression. Let’s visualize the prediction:

Code

var_model %>% 
  forecast::forecast(window) %>% 
  forecast::autoplot() + 
  scale_y_continuous(limits = c(0, 100)) +
  theme_ipsum(base_family = "Lato") +
  theme(
    panel.grid = element_line(linetype = "dashed")
  ) + facet_wrap(~series, scales = "fixed", ncol = 2)

It doesn’t look interesting, because all predicted trends are linear. To fix it I suggest taking the following steps:

add cross-validation for quality check;
add seasonality in model;
determine best variables for bitcoin trend prediction through cross-validation (we understand that often variables can be detrimental rather than helpful in forecasting).

Let’s start with cross-validation. Time series models cross-validation doesn’t look like random or k-fold cross-validation for not time series data. We cannot shuffle all observations. We schould take the time order into account. In train split we can put only earlier dates than in test split. We will realise such logic using rolling origin cross-validation.

To split into train/test datasets we use rolling_origin function from rsample package.

It has four arguments:

data - time series that may be split;
assess - size of the test split;
initial - minimal size of the train split;
skip - difference in size of the train splits.

Code

rolling_cv <- rsample::rolling_origin(
  time_series %>% as_tibble(), 
  assess = window, 
  initial = window+round(window/2), 
  skip = round(window/2)
  )

rolling_cv

# Rolling origin forecast resampling 
# A tibble: 10 × 2
   splits          id     
   <list>          <chr>  
 1 <split [18/12]> Slice01
 2 <split [25/12]> Slice02
 3 <split [32/12]> Slice03
 4 <split [39/12]> Slice04
 5 <split [46/12]> Slice05
 6 <split [53/12]> Slice06
 7 <split [60/12]> Slice07
 8 <split [67/12]> Slice08
 9 <split [74/12]> Slice09
10 <split [81/12]> Slice10

This function creates data frame with samples. For a better understanding I visualize it as follows:

Code

purrr::map_dfr(
  1:nrow(rolling_cv),
  function(x) {
    tibble(
      split = x,
      id = c(rolling_cv$splits[[x]]$in_id, 
             rolling_cv$splits[[x]]$out_id, 
             (1:nrow(time_series))[!(1:nrow(time_series) %in% c(rolling_cv$splits[[x]]$in_id,
                                                                rolling_cv$splits[[x]]$out_id))]),
      status = c(
        rep("TRAIN", length(rolling_cv$splits[[x]]$in_id)),
        rep("TEST", length(rolling_cv$splits[[x]]$out_id)),
        rep("OUT", length((1:nrow(time_series))[!(1:nrow(time_series) %in% c(rolling_cv$splits[[x]]$in_id,
                                                                             rolling_cv$splits[[x]]$out_id))]))
      )
      )
  }
) %>% mutate(
  id = lubridate::as_date(id*7 + 17160)
) %>% 
  ggplot(aes(id, split, fill = status)) + 
  geom_tile(color = "black", linewidth = 0.01) + 
  scale_fill_manual(values = c("#e41a1c", "#377eb8", "#4daf4a")) +
  scale_y_continuous(
    limits = c(0.5, nrow(rolling_cv) + 0.5), 
    expand = c(0,0,0,0), 
    labels = c(1:nrow(rolling_cv)), 
    breaks = c(1:nrow(rolling_cv))
  ) +
  scale_x_date(expand = c(0,0,0,0)) +
  labs(title = "Rolling Origin Forecast Resampling") +
  ylab("cv split") +
  xlab("date") +
  theme_ipsum(base_family = "Lato")

We can see thath each train split is different by size.

Let’s create lists of train-test split data frames using rolling origin.

Code

train <- rolling_cv$splits %>% purrr::map(rsample::analysis)
test <- rolling_cv$splits %>% purrr::map(~rsample::assessment(.)[, 1])

Now we come to the main part of the post. Here I create function for cross-validation and variable selection in VAR.

I came up with a greedy approach to solving this problem. It has a few steps:

initialization of best_score (RMSE) and current iteration score;
initialization of goog variables list;
first iteration during which we select one variable with which the algorithm produces the best result, then we update best_score and best_current;
second (and more) iteration during which we select one more variable with which the algorithm produces the best result, the if new current score is better we update best_score, otherwise we quit.

Such approach allows you to determine the best combination of variables that best explain the dependent feature.

The best and in my case current method is cross-validation RMSE score. I understand that for forecasting it is more important to us that the model gives the best result on the freshest split. To take this into account I weigh the results on the split and give more weight to the split with the most up-to-date data. I do it using softmax function. For example, when I have 5 splits, last split has weight 0.64 and first split only 0.01.

You can check it from below:

Code

softmax <- function(z) exp(z) / sum(exp(z))

softmax(1:5)

[1] 0.01165623 0.03168492 0.08612854 0.23412166 0.63640865

To select good features using my greedy approach I create and run the next function:

Code

cv_search <- function(depend_variable, train_data, season = season_duration) {
  independ_variables <- colnames(train_data[[1]])[colnames(train_data[[1]]) != depend_variable]
  good_independ <- NULL
  best_score <- 1000
  best_current <- 1000
  while (best_score == best_current) {
    g <- purrr::map_dbl(independ_variables, function(x) {
      k <- train_data %>% 
        purrr::map(~vars::VAR(ts(.[, c(depend_variable, x, good_independ)]), 
                                             p = 1, 
                                             season = season
                                            ))
      g <- purrr::map2_dbl(
        purrr::map(k, ~round(forecast::forecast(., window)$forecast$bitcoin$mean)),
        test %>% purrr::map(~.[[depend_variable]]),
        ~Metrics::rmse(.x, .y)
      )
      g %>% weighted.mean(w = softmax(c(1:length(g))))
    })
    
    result <- tibble(variable = independ_variables, mape = g)
    best_current <- min(result$mape, na.rm = T)
    if(best_score > best_current) {
      best_score <- best_current
      cat(paste0("New best score = ", round(best_score, 3), "\n"))
      good_independ <- c(good_independ, result$variable[which.min(result$mape)])
      cat(paste0("Added new feature - ", result$variable[which.min(result$mape)], "\n"))
      independ_variables <- independ_variables[!(independ_variables %in% good_independ)]
    }
  }
  c(depend_variable, good_independ)
}

good_features <- cv_search("bitcoin", train)

New best score = 6.896
Added new feature - mining

This function returns an array of variables, which can best explain the dependent feature through vector autoregression. VAR-model with such variables gives 13.545 RMSE-score.

Let’s take a look at time series with these variables:

Code

time_series[, good_features] %>% head()

         bitcoin mining
Jan 2017       6     38
Feb 2017       5     42
Mar 2017       7     42
Apr 2017       6     41
May 2017      15     49
Jun 2017      14     63

So, our VAR eqution will look like $b i t c o i n_{t} = b i t c o i n_{t - 1} + s t a r t u p_{t - 1} + n v i d i a_{t - 1} + m u s k_{t - 1} + m o n e y_{t - 1} + i p h o n e_{t - 1} + c o n s t + t r e n d + s e a s o n a l i t y$

Lets create model with only best features:

Code

better_var_model <- vars::VAR(
  time_series[, good_features], 
  p = 1,
  type = "both", season = season_duration, ic = "FPE"
) 

summary(better_var_model)$varresult$bitcoin %>% 
  broom::tidy() %>% 
  mutate_if(is.numeric, round, 3) %>% 
  gt() %>%
  tab_header(
    title = md("**VAR summary (bitcoin)**")
  ) %>%
  tab_style(
    style = list(
      cell_fill(color = "white"), cell_text(align = "center",
      size = 10, font = "Lato")),
    locations = cells_body(
      columns = c("term", "estimate", "std.error", "statistic", "p.value")
  )) %>%
  tab_style(
    style = list(
      cell_fill(color = "white"), cell_text(weight = "bold", 
      size = 10, font = "Lato")),
    locations = cells_column_labels(
      columns = c("term", "estimate", "std.error", "statistic", "p.value")
    )) %>%
  data_color(
    columns = c(p.value),
    colors = scales::col_numeric(
      palette = c("#1a9850", "#d73027"),
      domain = c(0, 1)
    )
  )

VAR summary (bitcoin)
term	estimate	std.error	statistic	p.value
bitcoin.l1	0.694	0.145	4.794	0.000
mining.l1	0.073	0.185	0.396	0.693
const	3.076	7.461	0.412	0.681
trend	0.012	0.039	0.307	0.760
sd1	-6.610	5.126	-1.290	0.201
sd2	-9.061	5.116	-1.771	0.080
sd3	-9.732	5.152	-1.889	0.062
sd4	-9.919	5.098	-1.946	0.055
sd5	-5.231	5.104	-1.025	0.308
sd6	-9.816	5.095	-1.927	0.058
sd7	-11.908	5.099	-2.335	0.022
sd8	-9.139	5.154	-1.773	0.080
sd9	-11.344	5.145	-2.205	0.030
sd10	-9.141	5.208	-1.755	0.083
sd11	-0.993	5.165	-0.192	0.848

Although we have used other variables, the strongest and most statistically significant influence on the bitcoin trend still has the bitcoin trend itself.

Now we can visualize the forecast for best features:

Code

better_var_model %>% 
  forecast::forecast(window) %>% 
  forecast::autoplot() + 
  scale_y_continuous(limits = c(0, 100)) +
  theme_ipsum(base_family = "Lato") +
  theme(
    panel.grid = element_line(linetype = "dashed")
  ) + facet_wrap(~series, scales = "fixed", ncol = 2)

Compared to the first model it looks more attractive. Our model not only predicts the overall trend but also takes fluctuations in the year into account. But our main result is that we took the effects of other variables for the forecast into account, which is not provided by traditional time series forecasting methods. However, there is still a potentially better method for predicting multidimensional series - neural networks. However, it is better to apply them to large datasets, which we will definitely try to do in future.

Citation

BibTeX citation:

@online{kyrychenko2019,
  author = {Kyrychenko, Roman},
  title = {Time Series Prediction Using {VAR} in {R}},
  date = {2019-08-14},
  url = {https://randomforest.run/posts/var-time-series-analysis-using-r/var-time-series-analysis-using-r.html},
  langid = {en}
}

For attribution, please cite this work as:

Kyrychenko, Roman. 2019. “Time Series Prediction Using VAR in R.” August 14, 2019. https://randomforest.run/posts/var-time-series-analysis-using-r/var-time-series-analysis-using-r.html.