Analysing voters flows without using sociological data

nnls

flow analysis

ggplot2

visualizations

election data

election statistics

In this article, we describe an approach to analyze voters flows using non-negative least squares and only election data.

Author

Roman Kyrychenko

Published

August 2, 2019

Note: in this piece I describe in the first place the approach for analysis of voters flows. The results of the analysis are better described in the other piece by me here (ukrainian).

It’s interesting how electorate flows from one candidate/party to another in next elections. Sometimes good flow visualizations which are also called alluvial diagrams or flows charts, that show flows of electorate, can appear in mass-media. But for some visualization, we need data from sociological surveys to contain questions about old and new choices. Usually, some surveys provide only questions about current electoral sympathies.

Sometimes after elections, we wonder: who voted for that stupid and populistic man? But analysis can show that they are some people who earlier voted for much better candidates and parties.

But I have good news. We can estimate voters flows without using sociological data. We need only election data aggregated to polling stations level. This data helps us to model all voters flows between elections given that the borders of polling stations did not change.

Suppose we know how people at a certain polling station voted in earlier elections and we know how they voted in a new poll.

Suppose it looks like this:

Code

suppressPackageStartupMessages({
  require(dplyr)
  require(ggplot2)
  require(hrbrthemes)
  require(ggalluvial)
  require(gt)
})

parliament_election_2019_by_vd <- readr::read_csv("https://raw.githubusercontent.com/RomanKyrychenko/ukrainian_elections/master/data/parliament_election_2019_by_vd_en.csv", col_types = readr::cols())
president_election_2019_1_by_vd <- readr::read_csv("https://raw.githubusercontent.com/RomanKyrychenko/ukrainian_elections/master/data/president_election_2019_1_by_vd_en.csv", col_types = readr::cols())

parliamentary election data

Code

parliament_election_2019_by_vd %>%
  head() %>%
  gt::gt()

polling_station_number	parliament_voters	Oppobloc	Strength and Honor	Fatherland	OP for Life	European Solidarity	Ukrainian Strategy	Servant of the People	Radical Party	Party of Shariy	Voice	Freedom	Other parties	district_number
50130	798	13	46	65	40	61	191	287	33	4	30	5	23	11
50131	1005	41	63	92	39	99	194	356	25	18	39	12	27	11
50132	567	12	39	41	28	47	113	210	17	7	20	16	17	11
50133	1130	7	74	82	61	111	293	378	25	10	50	16	23	11
50134	99	1	2	11	2	7	37	32	2	1	2	1	1	11
50135	1316	10	91	103	47	131	332	405	30	20	85	18	44	11

presidential election data

Code

president_election_2019_1_by_vd %>%
  head() %>%
  gt::gt()

polling_station_number	Spoiled bulletins	president_voters	Boyko Yuriy	Vilkul Oleksandr	Hrytsenko Anatoliy	Zelensky Volodymyr	Koshulynsky Ruslan	Liashko Oleg	Poroshenko Petro	Smeshko Igor	Tymoshenko Yuliya	Other candidates	district_number
50130	20	1121	32	11	73	232	17	45	370	120	169	32	11
50131	17	1358	47	33	97	319	23	46	365	147	213	51	11
50132	12	841	41	10	48	204	12	35	247	95	104	26	11
50133	12	1594	55	29	94	320	31	61	524	186	231	52	11
50134	0	140	1	1	12	36	3	3	39	14	26	5	11
50135	18	1661	49	22	109	347	16	38	511	265	224	62	11

This is the data from Ukrainian presidential (31.03.2019) and parlamentary (21.07.2019) elections.

The first round of the presidential election was conducted earlier than the parliamentary election. Therefore we can use presidential election data as predictors and parliamentary election data as response variables. So we can make a simple linear regression using this data. For example, a model for estimating support for the Voice party. We have an equation like this:

\[ Voice = intepcept + k1 \times Petro Poroshenko + k2 \times Volodymyr Zelensky + k3 \times Anatoliy Grytsenko + k4 \times Yuriy Boyko \]

Let’s create such eqution in R using lm function from base R:

Code

lm(Voice ~ `Boyko Yuriy` + `Vilkul Oleksandr` + `Hrytsenko Anatoliy` +
  `Zelensky Volodymyr` + `Koshulynsky Ruslan` + `Liashko Oleg` +
  `Poroshenko Petro` + `Smeshko Igor` + `Tymoshenko Yuliya` +
  `Spoiled bulletins`,
  data = parliament_election_2019_by_vd %>%
    left_join(
      president_election_2019_1_by_vd,
      by = c("polling_station_number", "district_number")
    )
) %>%
  broom::tidy() %>%
  mutate(term = stringr::str_remove_all(term, "`")) %>% 
  rename(
    candidate = term,
    coefficient = estimate
  ) %>% 
  select(candidate, coefficient) %>% 
  gt() %>%
  fmt_number(
    columns = c(coefficient),
    decimals = 3
  )

candidate	coefficient
(Intercept)	−1.583
Boyko Yuriy	0.005
Vilkul Oleksandr	−0.006
Hrytsenko Anatoliy	0.518
Zelensky Volodymyr	−0.005
Koshulynsky Ruslan	0.237
Liashko Oleg	−0.102
Poroshenko Petro	0.212
Smeshko Igor	−0.172
Tymoshenko Yuliya	−0.064
Spoiled bulletins	0.015

Seems good. We received adjusted R-squared almost 91%. But in this case, we are interested in the right interpretation of regression coefficients, not in good explanation of the variance of the depending variable. Ok, now we see that the support of the party is associated with the support of Anatoliy Hrytsenko, Ruslan Koshulynsky and Petro Poroshenko (positive coefficients), and not associated with support of other candidates (negative coefficients). But this coefficient doesn’t answer us who voted for the Voice party For example, Petro Poroshenko has coefficient 0.21, but it doesn’t mean that 21% of Petro Poroshenko voters voted in parliament election for the Voice. And Yulia Tymoshenko’s coefficient -0.06 doesn’t say, that -6% voters of Tymoshenko voted for the Voice, that is for sure :)

We can try to leave only variables with positive coefficients and make new model:

Code

lm(Voice ~ `Boyko Yuriy` + `Hrytsenko Anatoliy` + `Koshulynsky Ruslan` +
  `Poroshenko Petro` + `Spoiled bulletins`,
  data = parliament_election_2019_by_vd %>%
    left_join(
      president_election_2019_1_by_vd,
      by = c("polling_station_number", "district_number")
    )
) %>%
  broom::tidy() %>%
  mutate(term = stringr::str_remove_all(term, "`")) %>% 
  rename(
    candidate = term,
    coefficient = estimate
  ) %>% 
  select(candidate, coefficient) %>% 
  gt() %>%
  fmt_number(
    columns = c(coefficient),
    decimals = 3
  )

candidate	coefficient
(Intercept)	−7.143
Boyko Yuriy	0.007
Hrytsenko Anatoliy	0.443
Koshulynsky Ruslan	0.305
Poroshenko Petro	0.182
Spoiled bulletins	−0.762

Bad idea. New coefficients are different from old and the intercept is also different. But the result is closer to what you want to get. Let’s continue our analysis.

Maybe the reason for our failures lies in the intercept. We can force it to 0 so:

Code

lm(I(Voice - 0) ~ 0 + `Boyko Yuriy` + `Hrytsenko Anatoliy` + `Koshulynsky Ruslan` +
  `Poroshenko Petro` + `Spoiled bulletins`,
  data = parliament_election_2019_by_vd %>% left_join(
    president_election_2019_1_by_vd,
    by = c("polling_station_number", "district_number")
  )
) %>%
  broom::tidy() %>%
  mutate(term = stringr::str_remove_all(term, "`")) %>% 
  rename(
    candidate = term,
    coefficient = estimate
  ) %>% 
  select(candidate, coefficient) %>% 
  gt() %>%
  fmt_number(
    columns = c(coefficient),
    decimals = 3
  )

candidate	coefficient
Boyko Yuriy	−0.006
Hrytsenko Anatoliy	0.436
Koshulynsky Ruslan	0.284
Poroshenko Petro	0.172
Spoiled bulletins	−0.996

We need again to remove variables with negative coefficients. Let’s do this:

Code

lm(I(Voice - 0) ~ 0 + `Hrytsenko Anatoliy` + `Koshulynsky Ruslan` + `Poroshenko Petro`,
  data = parliament_election_2019_by_vd %>% left_join(
    president_election_2019_1_by_vd,
    by = c("polling_station_number", "district_number")
  )
) %>%
  broom::tidy() %>%
  mutate(term = stringr::str_remove_all(term, "`")) %>% 
  rename(
    candidate = term,
    coefficient = estimate
  ) %>% 
  select(candidate, coefficient) %>% 
  gt() %>%
  fmt_number(
    columns = c(coefficient),
    decimals = 3
  )

candidate	coefficient
Hrytsenko Anatoliy	0.413
Koshulynsky Ruslan	0.334
Poroshenko Petro	0.124

Much better. But why we can immediately train the model with only positive coefficients and intercept 0 without recursive removing variables with negative coefficients?

The reason lies in the method that is used for the estimation of linear regression. This method is called least squares method. This method minimizes mean squared error and estimated coefficients can vary from -∞ to +∞. Also, intercept can vary from -∞ to +∞. We need a model with non-negative coefficients.

Our goal is to get some linear regression that has only non-negative coefficients and intercept 0. Why is that so? Such regression model describes our dependent variable as the sum of independent variables, which is weighted on their coefficients. Fortunately, it is a method for estimation of linear regression. It is called the non-negative least squares method.

In R it is provided by nnls package. It is pretty straightforward.

Code

xy <- parliament_election_2019_by_vd %>%
  left_join(
    president_election_2019_1_by_vd,
    by = c("polling_station_number", "district_number")
  ) %>%
  mutate_all(~ ifelse(is.na(.), 0, .))

x <- xy %>%
  select(
    `Boyko Yuriy`, `Vilkul Oleksandr`, `Hrytsenko Anatoliy`,
    `Zelensky Volodymyr`, `Koshulynsky Ruslan`, `Liashko Oleg`,
    `Poroshenko Petro`, `Smeshko Igor`, `Tymoshenko Yuliya`,
    `Spoiled bulletins`
  ) %>%
  as.matrix()

y <- xy %>% pull(Voice)

fit <- nnls::nnls(x, y)

tibble(
  candidate = colnames(x),
  coefficient = round(fit$x, 3)
) %>%
  filter(coefficient != 0) %>%
  gt::gt()

candidate	coefficient
Hrytsenko Anatoliy	0.413
Koshulynsky Ruslan	0.334
Poroshenko Petro	0.124

This is the same result as we received earlier with a simple linear regression and removing negative coefficients and forcing intercept to 0.

But the sum of the coefficients does not equal 1 (only = 0.87). What’s the reason?

The reason is the different voters turnout. Presidential election voters turnout was 62%, parliament election voters turnout was 49%. We should weight it.

We can build a model not for absolute value, for the percentage of support and after converting back to absolute value. So:

Code

xy <- parliament_election_2019_by_vd %>%
  left_join(
    president_election_2019_1_by_vd,
    by = c("polling_station_number", "district_number")
  ) %>%
  mutate_all(~ ifelse(is.na(.), 0, .))

x <- xy %>%
  select(
    `Boyko Yuriy`, `Vilkul Oleksandr`, `Hrytsenko Anatoliy`,
    `Zelensky Volodymyr`, `Koshulynsky Ruslan`, `Liashko Oleg`,
    `Poroshenko Petro`, `Smeshko Igor`, `Tymoshenko Yuliya`,
    `Spoiled bulletins`
  ) %>%
  mutate_all(~ . / xy$president_voters) %>%
  as.matrix()

x[is.na(x)] <- 0

y <- xy %>%
  mutate(Voice = Voice / xy$parliament_voters) %>%
  pull(Voice)
y[is.na(y)] <- 0

fit <- nnls::nnls(x, y)

tibble(
  candidate = colnames(x),
  coefficient = round(fit$x, 3)
) %>%
  filter(coefficient != 0) %>%
  gt::gt()

candidate	coefficient
Hrytsenko Anatoliy	0.464
Koshulynsky Ruslan	0.409
Poroshenko Petro	0.110

So we have coefficients, but how do we convert them to an absolute value? As simple as follows:

Code

tibble(
  candidate = colnames(x),
  votes = round(x %*% diag(fit$x) * xy$president_voters) %>% colSums(),
  votes_percent = votes / sum(votes)
) %>%
  filter(votes != 0) %>%
  gt::gt() %>% 
  fmt_percent(
    columns = c(votes_percent),
    decimals = 2
  )

candidate	votes	votes_percent
Hrytsenko Anatoliy	600541	57.31%
Koshulynsky Ruslan	122192	11.66%
Poroshenko Petro	325091	31.03%

How accurate is this model? We can measure rmse for the model:

Code

cat(paste0("RMSE: ", round(Metrics::rmse(y, fit$fitted), 3)))

RMSE: 0.037

We can create much more accurate model using ensemble of non negative regressions created for certain districts.

Code

xy <- parliament_election_2019_by_vd %>%
  left_join(
    president_election_2019_1_by_vd,
    by = c("polling_station_number", "district_number")
  ) %>%
  mutate_all(~ ifelse(is.na(.), 0, .))

big_x <- xy %>%
  select(`Boyko Yuriy`, `Vilkul Oleksandr`, `Hrytsenko Anatoliy`, 
         `Zelensky Volodymyr`, `Koshulynsky Ruslan`, `Liashko Oleg`, 
         `Poroshenko Petro`, `Smeshko Igor`, `Tymoshenko Yuliya`, 
         `Spoiled bulletins`) %>%
  mutate_all(~ . / xy$president_voters) %>%
  as.matrix()

big_x[is.na(big_x)] <- 0

big_y <- xy %>%
  mutate(Voice = Voice / xy$parliament_voters) %>%
  pull(Voice)
big_y[is.na(big_y)] <- 0

flow_from_first_round_to_parliament <- purrr::map_dfr(unique(xy$district_number), function(x) {
  y <- big_y[xy$district_number == x]
  y <- ifelse(is.na(y), 0, y)
  fit <- nnls::nnls(big_x[xy$district_number == x, ], y)
  tibble(
    district_number = x,
    from = colnames(big_x),
    coef = ifelse(fit$x > 1, 1, fit$x),
    all_votes = unname(colSums(big_x[xy$district_number == x, ] * 
                             xy$president_voters[xy$district_number == x])),
    predicted_votes = round(unname(colSums(big_x[xy$district_number == x, ] %*% 
                                   diag(coef) * 
                                   xy$parliament_voters[xy$district_number == x]))),
    rmse = Metrics::rmse(y, fit$fitted)
  )
})

flow_from_first_round_to_parliament %>%
  group_by(from) %>%
  summarise(predicted_votes = sum(predicted_votes)) %>%
  arrange(desc(predicted_votes)) %>%
  gt::gt()

from	predicted_votes
Poroshenko Petro	282249
Hrytsenko Anatoliy	189904
Smeshko Igor	127531
Zelensky Volodymyr	106922
Koshulynsky Ruslan	51083
Vilkul Oleksandr	20576
Tymoshenko Yuliya	14977
Spoiled bulletins	12354
Boyko Yuriy	7027
Liashko Oleg	2498

Code

cat(paste0("Mean rmse: ", round(mean(flow_from_first_round_to_parliament$rmse), 3)))

Mean rmse: 0.017

The new model is twice better than the old one without ensemble!

So, let’s create a model for all parties and visualize it.

Code

big_y <- xy %>%
  select(Oppobloc:`Other parties`) %>%
  mutate_all(~ . / xy$parliament_voters)
big_y[is.na(big_y)] <- 0

purrr::map_dfr(colnames(big_y), function(to) {
  purrr::map_dfr(unique(xy$district_number), function(x) {
    y <- big_y[[to]][xy$district_number == x]
    y <- ifelse(is.na(y), 0, y)
    fit <- nnls::nnls(big_x[xy$district_number == x, ], y)
    tibble(
      to = to,
      district_number = x,
      from = colnames(big_x),
      coef = ifelse(fit$x > 1, 1, fit$x),
      all_votes = unname(colSums(big_x[xy$district_number == x, ] * 
                               xy$president_voters[xy$district_number == x])),
      predicted_votes = round(unname(colSums(big_x[xy$district_number == x, ] %*% 
                                     diag(coef) * 
                                     xy$parliament_voters[xy$district_number == x]))),
      rmse = Metrics::rmse(y, fit$fitted)
    )
  })
}) %>%
  group_by(from, to) %>%
  summarise(predicted_votes = sum(predicted_votes)) %>%
  arrange(desc(predicted_votes)) %>%
  ggplot(aes(y = predicted_votes, axis1 = from, axis2 = to)) +
  geom_alluvium(aes(fill = from), width = 1 / 12) +
  geom_stratum(width = 1 / 6, fill = "white", color = "black") +
  geom_text(stat = "stratum", aes(label = after_stat(stratum)), family = "PT Sans", size = 3) +
  scale_fill_manual(values = c(
    "#1f78b4", "#ff7f00", "#6a3d9a", "#ffff99", "#e31a1c", 
    "#cab2d6", "#fdbf6f", "#fb9a99", "#a6cee3", "#b2df8a"
  )) +
  scale_x_discrete(
    limits = c("Presidential election\n(31.03.2019)", "Parliamentary election\n(21.07.2019)"), 
    expand = c(.0, .0)
  ) +
  labs(
    title = "Voters flow from presidential election to parliamentary election",
    caption = "Using Central Election Commissions data"
  ) +
  theme_ipsum(base_family = "PT Sans") +
  theme(
    legend.position = "none", 
    panel.grid = element_blank(), 
    axis.title = element_blank(),
    axis.text.y = element_blank()
  )

This is a flow chart that shows how many people voted first for one candidate and then voted for one party. It’s interesting, when the voters of one particular candidate vote then for other various parties. For example, the voters of Igor Smeshko at the presidential election voted for his party Strength and Honor at the parliamentary election but some of them also voted for European Solidarity, the Voice and Servant of the People.

So the method of non negative least squares is a good way to analyse the voters flow. And yet, it has its flaws:

we can use this method when borders of polling stations don’t change;
this method provides that at both elections the same people vote;
this method doesn’t take account for possible fraud;
the model records only the general patterns of flow, not the personal voting history of each and every voter.

I mean, sociological surveys is better method to analyse voters flows, but non negative regression is a good way to analyse this when there are no such sociological studies.

Citation

BibTeX citation:

@online{kyrychenko2019,
  author = {Kyrychenko, Roman},
  title = {Analysing Voters Flows Without Using Sociological Data},
  date = {2019-08-02},
  url = {https://randomforest.run/posts/analysing-voters-flows},
  langid = {en}
}

For attribution, please cite this work as:

Kyrychenko, Roman. 2019. “Analysing Voters Flows Without Using Sociological Data.” August 2, 2019. https://randomforest.run/posts/analysing-voters-flows.