Analysing voters flows without using sociological data

In this article, we describe an approach to analyze voters flows using non-negative least squares and only election data.

Roman Kyrychenko https://www.linkedin.com/in/kirichenko17roman/
08-02-2019

Note: in this piece I describe in the fist place the approach for analysis of voters flows. The results of the analysis are better described in the other piece by me here (ukrainian).

It’s interesting how electorate flows from one candidate/party to another in next elections. Sometimes good flow visualizations which are also called alluvial diagrams or flows charts, that show flows of electorate, can appear in mass-media. But for some visualization, we need data from sociological surveys to contain questions about old and new choices. Usually, some surveys provide only questions about current electoral sympathies.

Sometimes after elections, we wonder: who voted for that stupid and populistic man? But analysis can show that they are some people who earlier voted for much better candidates and parties.

But I have good news. We can estimate voters flows without using sociological data. We need only election data aggregated to polling stations level. This data helps us to model all voters flows between elections given that the borders of polling stations did not change.

Suppose we know how people at a certain polling station voted in earlier elections and we know how they voted in a new poll.

Suppose it looks like this:


suppressPackageStartupMessages({
  require(dplyr)
  require(ggplot2)
  require(hrbrthemes)
  require(ggalluvial)
  require(gt)
})

parliament_election_2019_by_vd <- readr::read_csv("https://cutt.ly/87rsCe", col_types = readr::cols())
president_election_2019_1_by_vd <- readr::read_csv("https://cutt.ly/f4PlNS", col_types = readr::cols())

parliament_election_2019_by_vd %>%
  head() %>%
  gt::gt()
polling_station_number parliament_voters Oppobloc Strength and Honor Fatherland OP for Life European Solidarity Ukrainian Strategy Servant of the People Radical Party Party of Shariy Voice Freedom Other parties district_number
50130 798 13 46 65 40 61 191 287 33 4 30 5 23 11
50131 1005 41 63 92 39 99 194 356 25 18 39 12 27 11
50132 567 12 39 41 28 47 113 210 17 7 20 16 17 11
50133 1130 7 74 82 61 111 293 378 25 10 50 16 23 11
50134 99 1 2 11 2 7 37 32 2 1 2 1 1 11
50135 1316 10 91 103 47 131 332 405 30 20 85 18 44 11

president_election_2019_1_by_vd %>%
  head() %>%
  gt::gt()
polling_station_number Spoiled bulletins president_voters Boyko Yuriy Vilkul Oleksandr Hrytsenko Anatoliy Zelensky Volodymyr Koshulynsky Ruslan Liashko Oleg Poroshenko Petro Smeshko Igor Tymoshenko Yuliya Other candidates district_number
50130 20 1121 32 11 73 232 17 45 370 120 169 32 11
50131 17 1358 47 33 97 319 23 46 365 147 213 51 11
50132 12 841 41 10 48 204 12 35 247 95 104 26 11
50133 12 1594 55 29 94 320 31 61 524 186 231 52 11
50134 0 140 1 1 12 36 3 3 39 14 26 5 11
50135 18 1661 49 22 109 347 16 38 511 265 224 62 11

This is the data from Ukrainian presidential (31.03.2019) and parlamentary (21.07.2019) elections.

The first round of the presidential election was conducted earlier than the parliamentary election. Therefore we can use presidential election data as predictors and parliamentary election data as response variables. So we can make a simple linear regression using this data. For example, a model for estimating support for the Voice party. We have an equation like this:

\[ Voice = intepcept + k1 \times Petro Poroshenko + k2 \times Volodymyr Zelensky + k3 \times Anatoliy Grytsenko + k4 \times Yuriy Boyko \]

Let’s create such eqution in R using lm function from base R:


lm(Voice ~ `Boyko Yuriy` + `Vilkul Oleksandr` + `Hrytsenko Anatoliy` +
  `Zelensky Volodymyr` + `Koshulynsky Ruslan` + `Liashko Oleg` +
  `Poroshenko Petro` + `Smeshko Igor` + `Tymoshenko Yuliya` +
  `Spoiled bulletins`,
  data = parliament_election_2019_by_vd %>%
    left_join(
      president_election_2019_1_by_vd,
      by = c("polling_station_number", "district_number")
    )
) %>%
  broom::tidy() %>%
  mutate(term = stringr::str_remove_all(term, "`")) %>% 
  rename(
    candidate = term,
    coefficient = estimate
  ) %>% 
  select(candidate, coefficient) %>% 
  gt() %>%
  fmt_number(
    columns = vars(coefficient),
    decimals = 3
  )
candidate coefficient
(Intercept) −1.583
Boyko Yuriy 0.005
Vilkul Oleksandr −0.006
Hrytsenko Anatoliy 0.518
Zelensky Volodymyr −0.005
Koshulynsky Ruslan 0.237
Liashko Oleg −0.102
Poroshenko Petro 0.212
Smeshko Igor −0.172
Tymoshenko Yuliya −0.064
Spoiled bulletins 0.015

Seems good. We received adjusted R-squared almost 91%. But in this case, we are interested in the right interpretation of regression coefficients, not in good explanation of the variance of the depending variable. Ok, now we see that the support of the party is associated with the support of Anatoliy Hrytsenko, Ruslan Koshulynsky and Petro Poroshenko (positive coefficients), and not associated with support of other candidates (negative coefficients). But this coefficient doesn’t answer us who voted for the Voice party For example, Petro Poroshenko has coefficient 0.21, but it doesn’t mean that 21% of Petro Poroshenko voters voted in parliament election for the Voice. And Yulia Tymoshenko’s coefficient -0.06 doesn’t say, that -6% voters of Tymoshenko voted for the Voice, that is for sure :)

We can try to leave only variables with positive coefficients and make new model:


lm(Voice ~ `Boyko Yuriy` + `Hrytsenko Anatoliy` + `Koshulynsky Ruslan` +
  `Poroshenko Petro` + `Spoiled bulletins`,
  data = parliament_election_2019_by_vd %>%
    left_join(
      president_election_2019_1_by_vd,
      by = c("polling_station_number", "district_number")
    )
) %>%
  broom::tidy() %>%
  mutate(term = stringr::str_remove_all(term, "`")) %>% 
  rename(
    candidate = term,
    coefficient = estimate
  ) %>% 
  select(candidate, coefficient) %>% 
  gt() %>%
  fmt_number(
    columns = vars(coefficient),
    decimals = 3
  )
candidate coefficient
(Intercept) −7.143
Boyko Yuriy 0.007
Hrytsenko Anatoliy 0.443
Koshulynsky Ruslan 0.305
Poroshenko Petro 0.182
Spoiled bulletins −0.762

Bad idea. New coefficients are different from old and the intercept is also different. But the result is closer to what you want to get. Let’s continue our analysis.

Maybe the reason for our failures lies in the intercept. We can force it to 0 so:


lm(I(Voice - 0) ~ 0 + `Boyko Yuriy` + `Hrytsenko Anatoliy` + `Koshulynsky Ruslan` +
  `Poroshenko Petro` + `Spoiled bulletins`,
  data = parliament_election_2019_by_vd %>% left_join(
    president_election_2019_1_by_vd,
    by = c("polling_station_number", "district_number")
  )
) %>%
  broom::tidy() %>%
  mutate(term = stringr::str_remove_all(term, "`")) %>% 
  rename(
    candidate = term,
    coefficient = estimate
  ) %>% 
  select(candidate, coefficient) %>% 
  gt() %>%
  fmt_number(
    columns = vars(coefficient),
    decimals = 3
  )
candidate coefficient
Boyko Yuriy −0.006
Hrytsenko Anatoliy 0.436
Koshulynsky Ruslan 0.284
Poroshenko Petro 0.172
Spoiled bulletins −0.996

We need again to remove variables with negative coefficients. Let’s do this:


lm(I(Voice - 0) ~ 0 + `Hrytsenko Anatoliy` + `Koshulynsky Ruslan` + `Poroshenko Petro`,
  data = parliament_election_2019_by_vd %>% left_join(
    president_election_2019_1_by_vd,
    by = c("polling_station_number", "district_number")
  )
) %>%
  broom::tidy() %>%
  mutate(term = stringr::str_remove_all(term, "`")) %>% 
  rename(
    candidate = term,
    coefficient = estimate
  ) %>% 
  select(candidate, coefficient) %>% 
  gt() %>%
  fmt_number(
    columns = vars(coefficient),
    decimals = 3
  )
candidate coefficient
Hrytsenko Anatoliy 0.413
Koshulynsky Ruslan 0.334
Poroshenko Petro 0.124

Much better. But why we can immediately train the model with only positive coefficients and intercept 0 without recursive removing variables with negative coefficients?

The reason lies in the method that is used for the estimation of linear regression. This method is called least squares method. This method minimizes mean squared error and estimated coefficients can vary from -∞ to +∞. Also, intercept can vary from -∞ to +∞. We need a model with non-negative coefficients.

Our goal is to get some linear regression that has only non-negative coefficients and intercept 0. Why is that so? Such regression model describes our dependent variable as the sum of independent variables, which is weighted on their coefficients. Fortunately, it is a method for estimation of linear regression. It is called the non-negative least squares method.

In R it is provided by nnls package. It is pretty straightforward.


xy <- parliament_election_2019_by_vd %>%
  left_join(
    president_election_2019_1_by_vd,
    by = c("polling_station_number", "district_number")
  ) %>%
  mutate_all(~ ifelse(is.na(.), 0, .))

x <- xy %>%
  select(
    `Boyko Yuriy`, `Vilkul Oleksandr`, `Hrytsenko Anatoliy`,
    `Zelensky Volodymyr`, `Koshulynsky Ruslan`, `Liashko Oleg`,
    `Poroshenko Petro`, `Smeshko Igor`, `Tymoshenko Yuliya`,
    `Spoiled bulletins`
  ) %>%
  as.matrix()

y <- xy %>% pull(Voice)

fit <- nnls::nnls(x, y)

tibble(
  candidate = colnames(x),
  coefficient = round(fit$x, 3)
) %>%
  filter(coefficient != 0) %>%
  gt::gt()
candidate coefficient
Hrytsenko Anatoliy 0.413
Koshulynsky Ruslan 0.334
Poroshenko Petro 0.124

This is the same result as we received earlier with a simple linear regression and removing negative coefficients and forcing intercept to 0.

But the sum of the coefficients does not equal 1 (only = 0.87). What’s the reason?

The reason is the different voters turnout. Presidential election voters turnout was 62%, parliament election voters turnout was 49%. We should weight it.

We can build a model not for absolute value, for the percentage of support and after converting back to absolute value. So:


xy <- parliament_election_2019_by_vd %>%
  left_join(
    president_election_2019_1_by_vd,
    by = c("polling_station_number", "district_number")
  ) %>%
  mutate_all(~ ifelse(is.na(.), 0, .))

x <- xy %>%
  select(
    `Boyko Yuriy`, `Vilkul Oleksandr`, `Hrytsenko Anatoliy`,
    `Zelensky Volodymyr`, `Koshulynsky Ruslan`, `Liashko Oleg`,
    `Poroshenko Petro`, `Smeshko Igor`, `Tymoshenko Yuliya`,
    `Spoiled bulletins`
  ) %>%
  mutate_all(~ . / xy$president_voters) %>%
  as.matrix()

x[is.na(x)] <- 0

y <- xy %>%
  mutate(Voice = Voice / xy$parliament_voters) %>%
  pull(Voice)
y[is.na(y)] <- 0

fit <- nnls::nnls(x, y)

tibble(
  candidate = colnames(x),
  coefficient = round(fit$x, 3)
) %>%
  filter(coefficient != 0) %>%
  gt::gt()
candidate coefficient
Hrytsenko Anatoliy 0.464
Koshulynsky Ruslan 0.409
Poroshenko Petro 0.110

So we have coefficients, but how do we convert them to an absolute value? As simple as follows:


tibble(
  candidate = colnames(x),
  votes = round(x %*% diag(fit$x) * xy$president_voters) %>% colSums(),
  votes_percent = votes / sum(votes)
) %>%
  filter(votes != 0) %>%
  gt::gt() %>% 
  fmt_percent(
    columns = vars(votes_percent),
    decimals = 2
  )
candidate votes votes_percent
Hrytsenko Anatoliy 600541 57.31%
Koshulynsky Ruslan 122192 11.66%
Poroshenko Petro 325091 31.03%

How accurate is this model? We can measure rmse for the model:


cat(paste0("RMSE: ", round(Metrics::rmse(y, fit$fitted), 3)))

RMSE: 0.037

We can create much more accurate model using ensemble of non negative regressions created for certain distiricts.


xy <- parliament_election_2019_by_vd %>%
  left_join(
    president_election_2019_1_by_vd,
    by = c("polling_station_number", "district_number")
  ) %>%
  mutate_all(~ ifelse(is.na(.), 0, .))

big_x <- xy %>%
  select(`Boyko Yuriy`, `Vilkul Oleksandr`, `Hrytsenko Anatoliy`, 
         `Zelensky Volodymyr`, `Koshulynsky Ruslan`, `Liashko Oleg`, 
         `Poroshenko Petro`, `Smeshko Igor`, `Tymoshenko Yuliya`, 
         `Spoiled bulletins`) %>%
  mutate_all(~ . / xy$president_voters) %>%
  as.matrix()

big_x[is.na(big_x)] <- 0

big_y <- xy %>%
  mutate(Voice = Voice / xy$parliament_voters) %>%
  pull(Voice)
big_y[is.na(big_y)] <- 0

flow_from_first_round_to_parliament <- purrr::map_dfr(unique(xy$district_number), function(x) {
  y <- big_y[xy$district_number == x]
  y <- ifelse(is.na(y), 0, y)
  fit <- nnls::nnls(big_x[xy$district_number == x, ], y)
  tibble(
    district_number = x,
    from = colnames(big_x),
    coef = ifelse(fit$x > 1, 1, fit$x),
    all_votes = unname(colSums(big_x[xy$district_number == x, ] * 
                             xy$president_voters[xy$district_number == x])),
    predicted_votes = round(unname(colSums(big_x[xy$district_number == x, ] %*% 
                                   diag(coef) * 
                                   xy$parliament_voters[xy$district_number == x]))),
    rmse = Metrics::rmse(y, fit$fitted)
  )
})

flow_from_first_round_to_parliament %>%
  group_by(from) %>%
  summarise(predicted_votes = sum(predicted_votes)) %>%
  arrange(desc(predicted_votes)) %>%
  gt::gt()
from predicted_votes
Poroshenko Petro 282249
Hrytsenko Anatoliy 189904
Smeshko Igor 127531
Zelensky Volodymyr 106922
Koshulynsky Ruslan 51083
Vilkul Oleksandr 20576
Tymoshenko Yuliya 14977
Spoiled bulletins 12354
Boyko Yuriy 7027
Liashko Oleg 2498

cat(paste0("Mean rmse: ", round(mean(flow_from_first_round_to_parliament$rmse), 3)))

Mean rmse: 0.017

The new model is twice better than the old one without ensemble!

So, let’s create a model for all parties and visualize it.


big_y <- xy %>%
  select(Oppobloc:`Other parties`) %>%
  mutate_all(~ . / xy$parliament_voters)
big_y[is.na(big_y)] <- 0

purrr::map_dfr(colnames(big_y), function(to) {
  purrr::map_dfr(unique(xy$district_number), function(x) {
    y <- big_y[[to]][xy$district_number == x]
    y <- ifelse(is.na(y), 0, y)
    fit <- nnls::nnls(big_x[xy$district_number == x, ], y)
    tibble(
      to = to,
      district_number = x,
      from = colnames(big_x),
      coef = ifelse(fit$x > 1, 1, fit$x),
      all_votes = unname(colSums(big_x[xy$district_number == x, ] * 
                               xy$president_voters[xy$district_number == x])),
      predicted_votes = round(unname(colSums(big_x[xy$district_number == x, ] %*% 
                                     diag(coef) * 
                                     xy$parliament_voters[xy$district_number == x]))),
      rmse = Metrics::rmse(y, fit$fitted)
    )
  })
}) %>%
  group_by(from, to) %>%
  summarise(predicted_votes = sum(predicted_votes)) %>%
  arrange(desc(predicted_votes)) %>%
  ggplot(aes(y = predicted_votes, axis1 = from, axis2 = to)) +
  geom_alluvium(aes(fill = from), width = 1 / 12) +
  geom_stratum(width = 1 / 6, fill = "white", color = "black") +
  geom_text(stat = "stratum", label.strata = TRUE, family = "PT Sans", size = 3) +
  scale_fill_manual(values = c(
    "#1f78b4", "#ff7f00", "#6a3d9a", "#ffff99", "#e31a1c", 
    "#cab2d6", "#fdbf6f", "#fb9a99", "#a6cee3", "#b2df8a"
  )) +
  scale_x_discrete(
    limits = c("Presidential election\n(31.03.2019)", "Parliamentary election\n(21.07.2019)"), 
    expand = c(.0, .0)
  ) +
  labs(
    title = "Voters flow from presidential election to parliamentary election",
    caption = "Using Central Election Commissions data"
  ) +
  theme_ipsum(base_family = "PT Sans") +
  theme(
    legend.position = "none", 
    panel.grid = element_blank(), 
    axis.title = element_blank(),
    axis.text.y = element_blank()
  )