European values clustering using R

Which European countries are similar in their values?

Roman Kyrychenko true
02-18-2019

Table of Contents


Sometimes there is a need to analyze geographically distributed data. In itself, the regional breakdown is a factor that may affect the variables in one way or another. What methods of geographic data analysis are in R? I will try to answer this question through a summary of the data of the European Social Survey.

Before all, I was interested in the geographical distribution of values of the population of Europe. I was trying to find out in which of the regions of Europe people were more conservative, and where they were more liberal; where the values of hedonism dominated, and where there were the values of restraint, etc.

But my primary interest was to determine how influential this was in the frontiers of Europe? In other words, I was trying to find out if the values dominating in neighboring regions belonging to different countries were more similar than those of distant regions belonging to one country?

The ECS data helped me in answering these questions.

In this article, I describe an approach to clustering geographical constrained data. Data from the European social survey is useful for this task.

Data for spatial analysis

For the spatial analysis of the data of the ESS, it was essential to combine them with the data on the geographic polygons that I have taken with GADM. The problem was that the division into regions in the ESS and the shapefiles GADM sometimes were different, so I manually brought them in line. Also, the division into areas was too detailed so that for every region there were not enough respondents; therefore, such regions (for example, in Ukraine, Albania, Turkey) were enlarged).

You may find the ESS data here or in the section /datahub on this site. Also, you can find GADM data in this section.

Because my goal was to look at the distribution of values in Europe, I converted variables to Schwartz values.

The ESS uses the Schwarz scale to characterize personal values. It is its classical form which includes ten components:

  1. power
  2. achievement
  3. hedonism
  4. stimulation
  5. self direction
  6. universalism
  7. benevolence
  8. conformity
  9. tradition
  10. security

In the ECS, these values are reflected in 23 variables. There is a separate instruction for summarizing these values to 10. Its essence is that each respondent counts the average value of the variables corresponding to an absolute value and subtracts the average value of the answers to all questions that correspond to all values. This is normal because the scale is somewhat dual; that is, we can oppose the same value to one another. And thus, we avoid situations when a person in general answers that all characteristics are similar to them, and this results in a distortion of the data in the study.

This scale of 10 points can be further generalized to 4. This method is scientifically substantiated and described in the ETS documentation:

  1. Openness to Change
  2. Self-Transcendence
  3. Conservation
  4. Self-Enhancement1

Part of the transformations with geographic data is saved in the essay ess_prepare.R script.

Values area

To begin, let’s see how all ten values from the Schwarz scale are geographically distributed. I standardized (code below) them to values from 0 (slightly inherent) to 5 (very inherent). Some values acquired roughly uniform expressions in all corners of Europe, while others were significantly different.

To create a map, I counted the median value of each of the 10 values for each region and mapped it to the map.


cnt <- ess %>% 
  dplyr::select(region, cntry) %>% 
  distinct(.keep_all = T)

tbl <- ess %>% 
  filter(region %in% cnt$region[cnt$region %in% reg$id]) %>% 
  group_by(region) %>% 
  top_n(1, round_year) %>% 
  summarise(
    `Openness to Change` = median(`Openness to Change`, na.rm = T),
    Conservation = median(Conservation, na.rm = T),
    `Self-Enhancement` = median(`Self-Enhancement`, na.rm = T),
    `Self-Trancendence` = median(`Self-Trancendence`, na.rm=T),
    security = median(security, na.rm = T),
    conformity = median(conformity, na.rm = T),
    tradition = median(tradition, na.rm = T),
    benevolence = median(benevolence, na.rm = T),
    universalism = median(universalism, na.rm = T),
    self_direction = median(self_direction, na.rm = T),
    stimulation = median(stimulation, na.rm = T),
    hedonism = median(hedonism, na.rm = T),
    achievement = median(achievement, na.rm = T),
    power = median(power, na.rm = T)
) 

tbl[6:15] <- tbl[6:15] %>% 
  as.matrix() %>% 
  BBmisc::normalize("range", range = c(0, 5))

regions_gSimplify_df <- fortify(regions_gSimplify, 
                                region = "id")  %>% 
  left_join(
    left_join(
      reg[reg$id %in% cnt$region,]@data %>% 
        mutate(id = as.character(id)), tbl, 
      by = c("id" = "region")
    ), by = "id")

As we see, the value of force clearly divided Europe into the West and the East. The values of preservation, hedonism and stimulation divide Europe in a similar way. As to the distribution of other values, such a clear distinction is not in place, although it is generally observed that the main differences are manifested in the West(including Scandinavia here)-East and North-South.


regions_gSimplify_wide <- regions_gSimplify_df %>% 
  select(long, lat, order, hole, piece, id, 
         group, security, conformity, tradition, 
         benevolence, universalism, self_direction, 
         stimulation, hedonism, achievement, power) %>% 
  tidyr::gather(key = value, value = score, security, 
                conformity, tradition, benevolence, 
                universalism, self_direction, 
                stimulation, hedonism, 
                achievement, power)

p <- ggplot(regions_gSimplify_wide, aes(map_id = id)) +
  geom_map(map = regions_gSimplify_wide, 
           aes(fill = score)) + 
  expand_limits(x = regions_gSimplify_wide$long, 
                y = regions_gSimplify_wide$lat) +
  scale_fill_gradient(low = "#fff5eb", high = "#7f2704") +
  theme_void() + 
  coord_map("ortho", orientation = c(41, 22, -10), 
            xlim = c(-10, 43), ylim = c(33, 70)) + 
  theme(
    legend.position = "bottom", 
    strip.text = element_text(family = "PT Sans", 
                              size = 5, 
                              face = "bold"),
    legend.title = element_blank(), 
    legend.text = element_text(family = "PT Sans", 
                               size = 5),
    plot.margin = unit(c(0, 0, 0, 0),"cm")
  ) +
  guides(fill = guide_legend(
    title.position = "left", 
    ncol = 6, 
    keywidth = 0.5, 
    keyheight = 0.5,
    label.position = "bottom")
  ) + 
  facet_wrap(~value, ncol = 4)

ggsave("map.svg", plot = p, device = "svg", 
       width = 13, height = 10, units = "cm", dpi = 300)
European values map (Schwartz scale)

Figure 1: European values map (Schwartz scale)

Spatial clustering

To confirm these guesses, we will make and visualize the cluster analysis of this data. Do not forget that we have a factor in clustering not only values, but also the geo-affinity of regions. So, most clustering algorithms, such as k-averages, have a partially random principle of cluster formation. We need the model in cluster grouping first choose geographically close units. That is why the usual approach to clustering is not suitable for us; an option that takes into account hegaufia is required. And this option exists.

I used ClustGeo package for clustering spatial constrainde data.

The method implemented in this library is the adaptation of a hierarchical cluster analysis for the task of grouping georeferenced data. Essencially, this is a combination of two hierarchical models, one of which is based on data, and the other on a geographic neighborhood analysis.


D0 <- tbl[6:15] %>% 
  as.matrix() %>% 
  dist(method = "minkowski")

idx <- sapply(
  (tbl %>% select(region) %>% 
     left_join(
       regions_gSimplify@data %>% 
         mutate(id = as.character(id)), 
       by = c("region" = "id")
     ) %>% 
     distinct() %>% pull(region)) %>% 
    unique(), function(x) {
       which(x == as.character(regions_gSimplify$id %>% unique()))
     }
) %>% unlist() %>% unique()

A <- spdep::nb2mat(
  spdep::poly2nb(
    regions_gSimplify
  ), style = "B", zero.policy = T)

diag(A) <- 1
A <- A[idx, idx]
colnames(A) <- rownames(A) <- tbl$region
D1 <- as.dist(1 - A)
fit <- ClustGeo::hclustgeo(D0, D1, alpha = 0.18)

ggdendro::ggdendrogram(fit, rotate = TRUE, 
                       theme_dendro = FALSE) +
  ylab("") +
  xlab("") +
  geom_hline(yintercept = 0.01, 
             linetype = "dashed") +
  geom_hline(yintercept = 0.005, 
             linetype = "dashed") +
  theme_minimal() + theme(
    axis.text.y = element_text(family = "PT Sans", 
                               size = 5),
    panel.grid.minor = element_blank(),
    panel.grid.major.y = element_blank()
  )
European values hierarchical clustering

Figure 2: European values hierarchical clustering

It is clear from the constructed dendrogram that two large clusters are clearly distinguished, which in turn can be divided into a couple of smaller clusters. We will define two tresholds for the allocation of these two clusters and their detailing by another 2-3 clusters. The optimal values for tresholds are 0.01 and 0.005.

The ClustGeo package also has the ability to determine the ideal relationship between hierarchical cluster analysis models. This can be achieved through the function choicealpha, which will give the optimal alpha value (The models above are already based on this ratio). We do not just count it and we visualize it:


tbl <- tbl %>% mutate(
  two_cluster_solution = cutree(fit, h = 0.01),
  five_cluster_solution = cutree(fit, h = 0.005)
) %>% group_by(two_cluster_solution) %>% 
  mutate(
    five_cluster_solution = paste0(two_cluster_solution, 
      ".", as.numeric(as.factor(five_cluster_solution)))
) %>% ungroup()

ClustGeo::choicealpha(D0, D1, 
                      range.alpha = seq(0, 0.5, by = 0.01),
                      K = 33, graph = F)$Qnorm %>% 
  as_tibble(rownames = "alpha") %>% 
  rename(`D0 model` = "Q0norm",`D1 model` = "Q1norm") %>% 
  mutate(alpha = readr::parse_number(alpha)) %>% 
  tidyr::gather("clustering", "explained inertia", - alpha) %>% 
  ggplot(aes(alpha, `explained inertia`, color = clustering)) + 
  geom_path() +
  scale_color_manual(values = c("#a6cee3", "#b2df8a")) +
  theme_minimal() + theme(
    axis.text = element_text(family = "PT Sans", size = 9),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_line(colour = "black", 
                                    linetype = "dashed", 
                                    size = 0.05)
  )
Selection of theshold

Figure 3: Selection of theshold

Based on the graph above, the optimal alpha is 0.18.

Map of European values

After that, we just have to put these clustering data on the map.


regions_gSimplify_df <- fortify(regions_gSimplify, 
                                region = "id")  %>% 
  left_join(
    left_join(
      reg[reg$id %in% cnt$region,]@data %>% 
        mutate(id = as.character(id)), tbl, 
      by = c("id" = "region")
    ), by = "id")

clusters_fill <- list(
  cluster_1 = c(
    "#9ecae1",
    "#6baed6",
    "#3182bd"),
  cluster_2 = c(
   "#74c476",
    "#006d2c")
) %>% unlist()

p2 <- ggplot() +
  geom_map(data = regions_gSimplify_df %>% 
             filter(!is.na(five_cluster_solution)),
           map = regions_gSimplify_df %>% 
             filter(!is.na(five_cluster_solution)), 
           aes(map_id = id, fill = five_cluster_solution)) + 
  expand_limits(x = regions_gSimplify_df$long, 
                y = regions_gSimplify_df$lat) +
  theme_void() + 
  scale_fill_manual(values = unname(clusters_fill),
                    na.value = "lightgrey") +
  coord_map("gilbert", 
            xlim = c(-10, 50), ylim = c(33, 71)) + 
  geom_text(aes(x=-20, y = 70, 
                label = "European values clusters", 
                hjust = 0, vjust=1), 
            family = "PT Sans", 
            color = "black", size = 5)+
  geom_text(aes(x=-19, y = 68.6, 
                label = "Based on ESS data", 
                hjust = 0, vjust = 1), 
            family = "PT Sans", color = "black", size = 3)+
  theme(
    legend.position = "bottom", 
    legend.title = element_text(family = "PT Sans", 
                                size = 14, face = "bold"), 
    legend.text = element_text(family = "PT Sans", 
                               size = 13)
  ) +
  guides(fill = guide_legend(
    title = "values cluster",
    title.position = "left", ncol = 6, 
    keywidth = 2, keyheight = 2,
    label.position = "bottom")
  )

ggsave("cluster.svg", plot = p2, device = "svg", 
       width = 27, height = 20, units = "cm", dpi = 300)
European values clustering map (Schwartz scale)

Figure 4: European values clustering map (Schwartz scale)

As we can see, our assumption about the differences between the West and the East has been confirmed. Although there are interesting nuances. For example, Italy turned out to be closer to Eastern Europe than to the West. It is also interesting to attribute Estonia to the Western European cluster.


  1. detailed calculation instructions can be found here: https://www.europeansocialsurvey.org/docs/methodology/ESS_computing_human_values_scale.pdf

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Kyrychenko (2019, Feb. 18). Random Forest: European values clustering using R. Retrieved from http://randomforest.run/posts/spatial-clustering/

BibTeX citation

@misc{kyrychenko2019european,
  author = {Kyrychenko, Roman},
  title = {Random Forest: European values clustering using R},
  url = {http://randomforest.run/posts/spatial-clustering/},
  year = {2019}
}