In this article, we describe an approach to analyze voters flows using non-negative least squares and only election data.
Note: in this piece I describe in the fist place the approach for analysis of voters flows. The results of the analysis are better described in the other piece by me here (ukrainian).
It’s interesting how electorate flows from one candidate/party to another in next elections. Sometimes good flow visualizations which are also called alluvial diagrams or flows charts, that show flows of electorate, can appear in mass-media. But for some visualization, we need data from sociological surveys to contain questions about old and new choices. Usually, some surveys provide only questions about current electoral sympathies.
Sometimes after elections, we wonder: who voted for that stupid and populistic man? But analysis can show that they are some people who earlier voted for much better candidates and parties.
But I have good news. We can estimate voters flows without using sociological data. We need only election data aggregated to polling stations level. This data helps us to model all voters flows between elections given that the borders of polling stations did not change.
Suppose we know how people at a certain polling station voted in earlier elections and we know how they voted in a new poll.
Suppose it looks like this:
suppressPackageStartupMessages({
require(dplyr)
require(ggplot2)
require(hrbrthemes)
require(ggalluvial)
require(gt)
})
parliament_election_2019_by_vd <- readr::read_csv("https://cutt.ly/87rsCe", col_types = readr::cols())
president_election_2019_1_by_vd <- readr::read_csv("https://cutt.ly/f4PlNS", col_types = readr::cols())
parliament_election_2019_by_vd %>%
head() %>%
gt::gt()
polling_station_number | parliament_voters | Oppobloc | Strength and Honor | Fatherland | OP for Life | European Solidarity | Ukrainian Strategy | Servant of the People | Radical Party | Party of Shariy | Voice | Freedom | Other parties | district_number |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
50130 | 798 | 13 | 46 | 65 | 40 | 61 | 191 | 287 | 33 | 4 | 30 | 5 | 23 | 11 |
50131 | 1005 | 41 | 63 | 92 | 39 | 99 | 194 | 356 | 25 | 18 | 39 | 12 | 27 | 11 |
50132 | 567 | 12 | 39 | 41 | 28 | 47 | 113 | 210 | 17 | 7 | 20 | 16 | 17 | 11 |
50133 | 1130 | 7 | 74 | 82 | 61 | 111 | 293 | 378 | 25 | 10 | 50 | 16 | 23 | 11 |
50134 | 99 | 1 | 2 | 11 | 2 | 7 | 37 | 32 | 2 | 1 | 2 | 1 | 1 | 11 |
50135 | 1316 | 10 | 91 | 103 | 47 | 131 | 332 | 405 | 30 | 20 | 85 | 18 | 44 | 11 |
president_election_2019_1_by_vd %>%
head() %>%
gt::gt()
polling_station_number | Spoiled bulletins | president_voters | Boyko Yuriy | Vilkul Oleksandr | Hrytsenko Anatoliy | Zelensky Volodymyr | Koshulynsky Ruslan | Liashko Oleg | Poroshenko Petro | Smeshko Igor | Tymoshenko Yuliya | Other candidates | district_number |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
50130 | 20 | 1121 | 32 | 11 | 73 | 232 | 17 | 45 | 370 | 120 | 169 | 32 | 11 |
50131 | 17 | 1358 | 47 | 33 | 97 | 319 | 23 | 46 | 365 | 147 | 213 | 51 | 11 |
50132 | 12 | 841 | 41 | 10 | 48 | 204 | 12 | 35 | 247 | 95 | 104 | 26 | 11 |
50133 | 12 | 1594 | 55 | 29 | 94 | 320 | 31 | 61 | 524 | 186 | 231 | 52 | 11 |
50134 | 0 | 140 | 1 | 1 | 12 | 36 | 3 | 3 | 39 | 14 | 26 | 5 | 11 |
50135 | 18 | 1661 | 49 | 22 | 109 | 347 | 16 | 38 | 511 | 265 | 224 | 62 | 11 |
This is the data from Ukrainian presidential (31.03.2019) and parlamentary (21.07.2019) elections.
The first round of the presidential election was conducted earlier than the parliamentary election. Therefore we can use presidential election data as predictors and parliamentary election data as response variables. So we can make a simple linear regression using this data. For example, a model for estimating support for the Voice party. We have an equation like this:
\[ Voice = intepcept + k1 \times Petro Poroshenko + k2 \times Volodymyr Zelensky + k3 \times Anatoliy Grytsenko + k4 \times Yuriy Boyko \]
Let’s create such eqution in R using lm function from base R:
lm(Voice ~ `Boyko Yuriy` + `Vilkul Oleksandr` + `Hrytsenko Anatoliy` +
`Zelensky Volodymyr` + `Koshulynsky Ruslan` + `Liashko Oleg` +
`Poroshenko Petro` + `Smeshko Igor` + `Tymoshenko Yuliya` +
`Spoiled bulletins`,
data = parliament_election_2019_by_vd %>%
left_join(
president_election_2019_1_by_vd,
by = c("polling_station_number", "district_number")
)
) %>%
broom::tidy() %>%
mutate(term = stringr::str_remove_all(term, "`")) %>%
rename(
candidate = term,
coefficient = estimate
) %>%
select(candidate, coefficient) %>%
gt() %>%
fmt_number(
columns = vars(coefficient),
decimals = 3
)
candidate | coefficient |
---|---|
(Intercept) | −1.583 |
Boyko Yuriy | 0.005 |
Vilkul Oleksandr | −0.006 |
Hrytsenko Anatoliy | 0.518 |
Zelensky Volodymyr | −0.005 |
Koshulynsky Ruslan | 0.237 |
Liashko Oleg | −0.102 |
Poroshenko Petro | 0.212 |
Smeshko Igor | −0.172 |
Tymoshenko Yuliya | −0.064 |
Spoiled bulletins | 0.015 |
Seems good. We received adjusted R-squared almost 91%. But in this case, we are interested in the right interpretation of regression coefficients, not in good explanation of the variance of the depending variable. Ok, now we see that the support of the party is associated with the support of Anatoliy Hrytsenko, Ruslan Koshulynsky and Petro Poroshenko (positive coefficients), and not associated with support of other candidates (negative coefficients). But this coefficient doesn’t answer us who voted for the Voice party For example, Petro Poroshenko has coefficient 0.21, but it doesn’t mean that 21% of Petro Poroshenko voters voted in parliament election for the Voice. And Yulia Tymoshenko’s coefficient -0.06 doesn’t say, that -6% voters of Tymoshenko voted for the Voice, that is for sure :)
We can try to leave only variables with positive coefficients and make new model:
lm(Voice ~ `Boyko Yuriy` + `Hrytsenko Anatoliy` + `Koshulynsky Ruslan` +
`Poroshenko Petro` + `Spoiled bulletins`,
data = parliament_election_2019_by_vd %>%
left_join(
president_election_2019_1_by_vd,
by = c("polling_station_number", "district_number")
)
) %>%
broom::tidy() %>%
mutate(term = stringr::str_remove_all(term, "`")) %>%
rename(
candidate = term,
coefficient = estimate
) %>%
select(candidate, coefficient) %>%
gt() %>%
fmt_number(
columns = vars(coefficient),
decimals = 3
)
candidate | coefficient |
---|---|
(Intercept) | −7.143 |
Boyko Yuriy | 0.007 |
Hrytsenko Anatoliy | 0.443 |
Koshulynsky Ruslan | 0.305 |
Poroshenko Petro | 0.182 |
Spoiled bulletins | −0.762 |
Bad idea. New coefficients are different from old and the intercept is also different. But the result is closer to what you want to get. Let’s continue our analysis.
Maybe the reason for our failures lies in the intercept. We can force it to 0 so:
lm(I(Voice - 0) ~ 0 + `Boyko Yuriy` + `Hrytsenko Anatoliy` + `Koshulynsky Ruslan` +
`Poroshenko Petro` + `Spoiled bulletins`,
data = parliament_election_2019_by_vd %>% left_join(
president_election_2019_1_by_vd,
by = c("polling_station_number", "district_number")
)
) %>%
broom::tidy() %>%
mutate(term = stringr::str_remove_all(term, "`")) %>%
rename(
candidate = term,
coefficient = estimate
) %>%
select(candidate, coefficient) %>%
gt() %>%
fmt_number(
columns = vars(coefficient),
decimals = 3
)
candidate | coefficient |
---|---|
Boyko Yuriy | −0.006 |
Hrytsenko Anatoliy | 0.436 |
Koshulynsky Ruslan | 0.284 |
Poroshenko Petro | 0.172 |
Spoiled bulletins | −0.996 |
We need again to remove variables with negative coefficients. Let’s do this:
lm(I(Voice - 0) ~ 0 + `Hrytsenko Anatoliy` + `Koshulynsky Ruslan` + `Poroshenko Petro`,
data = parliament_election_2019_by_vd %>% left_join(
president_election_2019_1_by_vd,
by = c("polling_station_number", "district_number")
)
) %>%
broom::tidy() %>%
mutate(term = stringr::str_remove_all(term, "`")) %>%
rename(
candidate = term,
coefficient = estimate
) %>%
select(candidate, coefficient) %>%
gt() %>%
fmt_number(
columns = vars(coefficient),
decimals = 3
)
candidate | coefficient |
---|---|
Hrytsenko Anatoliy | 0.413 |
Koshulynsky Ruslan | 0.334 |
Poroshenko Petro | 0.124 |
Much better. But why we can immediately train the model with only positive coefficients and intercept 0 without recursive removing variables with negative coefficients?
The reason lies in the method that is used for the estimation of linear regression. This method is called least squares method. This method minimizes mean squared error and estimated coefficients can vary from -∞ to +∞. Also, intercept can vary from -∞ to +∞. We need a model with non-negative coefficients.
Our goal is to get some linear regression that has only non-negative coefficients and intercept 0. Why is that so? Such regression model describes our dependent variable as the sum of independent variables, which is weighted on their coefficients. Fortunately, it is a method for estimation of linear regression. It is called the non-negative least squares method.
In R it is provided by nnls package. It is pretty straightforward.
xy <- parliament_election_2019_by_vd %>%
left_join(
president_election_2019_1_by_vd,
by = c("polling_station_number", "district_number")
) %>%
mutate_all(~ ifelse(is.na(.), 0, .))
x <- xy %>%
select(
`Boyko Yuriy`, `Vilkul Oleksandr`, `Hrytsenko Anatoliy`,
`Zelensky Volodymyr`, `Koshulynsky Ruslan`, `Liashko Oleg`,
`Poroshenko Petro`, `Smeshko Igor`, `Tymoshenko Yuliya`,
`Spoiled bulletins`
) %>%
as.matrix()
y <- xy %>% pull(Voice)
fit <- nnls::nnls(x, y)
tibble(
candidate = colnames(x),
coefficient = round(fit$x, 3)
) %>%
filter(coefficient != 0) %>%
gt::gt()
candidate | coefficient |
---|---|
Hrytsenko Anatoliy | 0.413 |
Koshulynsky Ruslan | 0.334 |
Poroshenko Petro | 0.124 |
This is the same result as we received earlier with a simple linear regression and removing negative coefficients and forcing intercept to 0.
But the sum of the coefficients does not equal 1 (only = 0.87). What’s the reason?
The reason is the different voters turnout. Presidential election voters turnout was 62%, parliament election voters turnout was 49%. We should weight it.
We can build a model not for absolute value, for the percentage of support and after converting back to absolute value. So:
xy <- parliament_election_2019_by_vd %>%
left_join(
president_election_2019_1_by_vd,
by = c("polling_station_number", "district_number")
) %>%
mutate_all(~ ifelse(is.na(.), 0, .))
x <- xy %>%
select(
`Boyko Yuriy`, `Vilkul Oleksandr`, `Hrytsenko Anatoliy`,
`Zelensky Volodymyr`, `Koshulynsky Ruslan`, `Liashko Oleg`,
`Poroshenko Petro`, `Smeshko Igor`, `Tymoshenko Yuliya`,
`Spoiled bulletins`
) %>%
mutate_all(~ . / xy$president_voters) %>%
as.matrix()
x[is.na(x)] <- 0
y <- xy %>%
mutate(Voice = Voice / xy$parliament_voters) %>%
pull(Voice)
y[is.na(y)] <- 0
fit <- nnls::nnls(x, y)
tibble(
candidate = colnames(x),
coefficient = round(fit$x, 3)
) %>%
filter(coefficient != 0) %>%
gt::gt()
candidate | coefficient |
---|---|
Hrytsenko Anatoliy | 0.464 |
Koshulynsky Ruslan | 0.409 |
Poroshenko Petro | 0.110 |
So we have coefficients, but how do we convert them to an absolute value? As simple as follows:
tibble(
candidate = colnames(x),
votes = round(x %*% diag(fit$x) * xy$president_voters) %>% colSums(),
votes_percent = votes / sum(votes)
) %>%
filter(votes != 0) %>%
gt::gt() %>%
fmt_percent(
columns = vars(votes_percent),
decimals = 2
)
candidate | votes | votes_percent |
---|---|---|
Hrytsenko Anatoliy | 600541 | 57.31% |
Koshulynsky Ruslan | 122192 | 11.66% |
Poroshenko Petro | 325091 | 31.03% |
How accurate is this model? We can measure rmse for the model:
cat(paste0("RMSE: ", round(Metrics::rmse(y, fit$fitted), 3)))
RMSE: 0.037
We can create much more accurate model using ensemble of non negative regressions created for certain distiricts.
xy <- parliament_election_2019_by_vd %>%
left_join(
president_election_2019_1_by_vd,
by = c("polling_station_number", "district_number")
) %>%
mutate_all(~ ifelse(is.na(.), 0, .))
big_x <- xy %>%
select(`Boyko Yuriy`, `Vilkul Oleksandr`, `Hrytsenko Anatoliy`,
`Zelensky Volodymyr`, `Koshulynsky Ruslan`, `Liashko Oleg`,
`Poroshenko Petro`, `Smeshko Igor`, `Tymoshenko Yuliya`,
`Spoiled bulletins`) %>%
mutate_all(~ . / xy$president_voters) %>%
as.matrix()
big_x[is.na(big_x)] <- 0
big_y <- xy %>%
mutate(Voice = Voice / xy$parliament_voters) %>%
pull(Voice)
big_y[is.na(big_y)] <- 0
flow_from_first_round_to_parliament <- purrr::map_dfr(unique(xy$district_number), function(x) {
y <- big_y[xy$district_number == x]
y <- ifelse(is.na(y), 0, y)
fit <- nnls::nnls(big_x[xy$district_number == x, ], y)
tibble(
district_number = x,
from = colnames(big_x),
coef = ifelse(fit$x > 1, 1, fit$x),
all_votes = unname(colSums(big_x[xy$district_number == x, ] *
xy$president_voters[xy$district_number == x])),
predicted_votes = round(unname(colSums(big_x[xy$district_number == x, ] %*%
diag(coef) *
xy$parliament_voters[xy$district_number == x]))),
rmse = Metrics::rmse(y, fit$fitted)
)
})
flow_from_first_round_to_parliament %>%
group_by(from) %>%
summarise(predicted_votes = sum(predicted_votes)) %>%
arrange(desc(predicted_votes)) %>%
gt::gt()
from | predicted_votes |
---|---|
Poroshenko Petro | 282249 |
Hrytsenko Anatoliy | 189904 |
Smeshko Igor | 127531 |
Zelensky Volodymyr | 106922 |
Koshulynsky Ruslan | 51083 |
Vilkul Oleksandr | 20576 |
Tymoshenko Yuliya | 14977 |
Spoiled bulletins | 12354 |
Boyko Yuriy | 7027 |
Liashko Oleg | 2498 |
cat(paste0("Mean rmse: ", round(mean(flow_from_first_round_to_parliament$rmse), 3)))
Mean rmse: 0.017
The new model is twice better than the old one without ensemble!
So, let’s create a model for all parties and visualize it.
big_y <- xy %>%
select(Oppobloc:`Other parties`) %>%
mutate_all(~ . / xy$parliament_voters)
big_y[is.na(big_y)] <- 0
purrr::map_dfr(colnames(big_y), function(to) {
purrr::map_dfr(unique(xy$district_number), function(x) {
y <- big_y[[to]][xy$district_number == x]
y <- ifelse(is.na(y), 0, y)
fit <- nnls::nnls(big_x[xy$district_number == x, ], y)
tibble(
to = to,
district_number = x,
from = colnames(big_x),
coef = ifelse(fit$x > 1, 1, fit$x),
all_votes = unname(colSums(big_x[xy$district_number == x, ] *
xy$president_voters[xy$district_number == x])),
predicted_votes = round(unname(colSums(big_x[xy$district_number == x, ] %*%
diag(coef) *
xy$parliament_voters[xy$district_number == x]))),
rmse = Metrics::rmse(y, fit$fitted)
)
})
}) %>%
group_by(from, to) %>%
summarise(predicted_votes = sum(predicted_votes)) %>%
arrange(desc(predicted_votes)) %>%
ggplot(aes(y = predicted_votes, axis1 = from, axis2 = to)) +
geom_alluvium(aes(fill = from), width = 1 / 12) +
geom_stratum(width = 1 / 6, fill = "white", color = "black") +
geom_text(stat = "stratum", label.strata = TRUE, family = "PT Sans", size = 3) +
scale_fill_manual(values = c(
"#1f78b4", "#ff7f00", "#6a3d9a", "#ffff99", "#e31a1c",
"#cab2d6", "#fdbf6f", "#fb9a99", "#a6cee3", "#b2df8a"
)) +
scale_x_discrete(
limits = c("Presidential election\n(31.03.2019)", "Parliamentary election\n(21.07.2019)"),
expand = c(.0, .0)
) +
labs(
title = "Voters flow from presidential election to parliamentary election",
caption = "Using Central Election Commissions data"
) +
theme_ipsum(base_family = "PT Sans") +
theme(
legend.position = "none",
panel.grid = element_blank(),
axis.title = element_blank(),
axis.text.y = element_blank()
)