In this post, I describe approaches to show a similarity between objects using graph analysis and similarity score on the example of the United Nations General Assembly Voting in 1992-2018 years.
We love networks and graphs, but often we don’t know how to make functional graph analysis using data that we like. It’s straightforward, but we need to develop a rule to classify the relationship between each object in our dataset as “has a connection” or “has no connection”.
I show this on the example of the United Nations General Assembly Voting data.
For graph analysis and visualization, I use tidygraph
and ggraph
libraries. They provide flexible and tidyverse graph processing.
I found United Nations GA Voting data here.
Also, I wanted to show gross domestic product per capita by each UN country. To do this, I found a dataset with World Bank GDP data here
suppressPackageStartupMessages({
require(dplyr)
require(ggplot2)
require(tidygraph)
require(ggraph)
})
load("~/UN-73new.RData") #load data from https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/12379
gdp <- readxl::read_excel("~/gdp.xls") %>% #gross domestic product per capita data from https://data.worldbank.org/indicator/ny.gdp.pcap.cd
select(`Country Name`, `Country Code`, `1960`:`2018`) %>%
tidyr::gather("year", "gdp", -`Country Name`, -`Country Code`) %>%
filter(!is.na(gdp)) %>%
mutate(year = as.numeric(year)) %>%
group_by(`Country Name`) %>%
slice(n()) %>%
mutate(
Region = countrycode::countrycode(`Country Code`, origin = "iso3c", "continent")
) %>%
filter(!is.na(Region))
Also, we need to do a little data cleaning. I filtered data from non-UN members in the 1992-2018 period. I recoded all votes that were not “YES” to zero for simplicity.
A similarity score means only a percentage of equal votes between each country. So score = 0 means that nations voted all-time differently. Otherwise, score = 1 means that those nations voted all-time similarly.
un <- completeVotes %>%
ungroup() %>%
select(Country, date, unres, importantvote, vote) %>%
filter(
vote != 9,
vote != 8,
date >= "1992-01-01"
) %>%
mutate(
vote = ifelse(vote == 1, 1, 0)
)
To find countries that vote alike, we need to find a similarity score between their voting at the General Assembly:
cor_mat <- un %>%
select(unres, Country, vote) %>%
distinct(Country, unres, .keep_all = T) %>%
widyr::pairwise_similarity(Country, unres, vote)
Now we have a challenge to define connections between countries. I offer two approaches:
Let’s try the first approach:
gr <- igraph::graph.data.frame(
cor_mat %>%
group_by(item1) %>%
top_n(3, similarity) #top 3 contries
)
graph <- as_tbl_graph(gr) %>%
left_join(
gdp %>% rename(country = `Country Name`), by = c("name" = "Country Code")
) %>%
mutate(
country = ifelse(is.na(country), countrycode::countrycode(name, origin = "iso3c", "country.name"), country)
) %>%
filter(!is.na(country)) %>%
mutate(
Region = ifelse(is.na(Region), countrycode::countrycode(name, origin = "iso3c", "continent"), Region)
)
Let’s visualize this graph:
ggraph(graph, layout = 'kk', maxiter = 10000) +
geom_node_point(aes(size = gdp, color = Region), alpha = 0.5) +
geom_edge_fan(alpha = 0.5, show.legend = FALSE, width = 0.05) +
geom_node_text(aes(label = country), repel = T, size = 1.5, show.legend = FALSE) +
scale_size(range = c(0.01, 10), name = "Gross Domestic Product per Capita", guide = guide_legend(
title.position = "top",
label.position = "bottom")) +
scale_color_manual(values = c(
"#e41a1c",
"#377eb8",
"#4daf4a",
"#984ea3",
"#ff7f00"
), name = "Continent", guide = guide_legend(
title.position = "top",
label.position = "bottom")) +
labs(
title = "United Nations General Assembly Votes 1992-2018",
subtitle = "Each country connected with 3 most similar countries by voting",
caption = "Data: https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/12379"
) +
hrbrthemes::theme_ipsum(base_family = "Lato") +
theme(
panel.grid = element_blank(),
legend.position = "bottom",
axis.title = element_blank(),
axis.text = element_blank()
)
Figure 1: United Nations General Assembly Votes 1992-2018 (3 friends)
You can see that there are two clusters of nations:
Will the results we get with the second approach be different? Let’s see:
gr <- igraph::graph.data.frame(
cor_mat %>%
group_by(item1) %>%
filter(similarity > 0.8) #we change only this part
)
graph <- as_tbl_graph(gr) %>%
left_join(
gdp %>% rename(country = `Country Name`), by = c("name" = "Country Code")
) %>%
mutate(
country = ifelse(is.na(country), countrycode::countrycode(name, origin = "iso3c", "country.name"), country)
) %>%
filter(!is.na(country)) %>%
mutate(
Region = ifelse(is.na(Region), countrycode::countrycode(name, origin = "iso3c", "continent"), Region)
)
Let’s visualize this graph:
ggraph(graph, layout = 'kk', maxiter = 10000) +
geom_node_point(aes(size = gdp, color = Region), alpha = 0.5, ) +
geom_edge_fan(alpha = 0.5, show.legend = FALSE, width = 0.01) +
geom_node_text(aes(label = country), repel = T, size = 1.5, show.legend = FALSE) +
scale_size(range = c(0.01, 10), name = "Gross Domestic Product per Capita", guide = guide_legend(
title.position = "top",
label.position = "bottom")) +
scale_color_manual(values = c(
"#e41a1c",
"#377eb8",
"#4daf4a",
"#984ea3",
"#ff7f00"
), name = "Continent", guide = guide_legend(
title.position = "top",
label.position = "bottom")) +
labs(
title = "United Nations General Assembly Votes 1992-2018",
subtitle = "Similarity above 80%",
caption = "Data: https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/12379"
) +
hrbrthemes::theme_ipsum(base_family = "Lato") +
theme(
panel.grid = element_blank(),
legend.position = "bottom",
axis.title = element_blank(),
axis.text = element_blank()
)
Figure 2: United Nations General Assembly Votes 1992-2018 (with threshold)
The result looks different, but still, there are two same clusters as in the first graph. Also, some countries are not included in this network, because they don’t have a similarity above 0.8 with any country.
Thus, both approaches are useful, but the first graph looks better and represent all countries.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Kyrychenko (2019, Nov. 10). Random Forest: United Nations General Assembly Voting graph analysis. Retrieved from http://randomforest.run/posts/united-nations-general-assembly-voting-graph-analysis/
BibTeX citation
@misc{kyrychenko2019united, author = {Kyrychenko, Roman}, title = {Random Forest: United Nations General Assembly Voting graph analysis}, url = {http://randomforest.run/posts/united-nations-general-assembly-voting-graph-analysis/}, year = {2019} }