Methodology

Social Connectedness Index

Methodology

Step 1: Create Table
We use a snapshot of Facebook users and their friendship networks to measure the intensity of connectedness between locations. Locations are assigned to users based on their information and activity on Facebook, including the stated city on their Facebook profile, and device and connection information. Our primary measure of Social Connectedness between two locations i and j is:
Here, FB_Usersi and FB_Usersj are the number of Facebook users in locations i and j, and FB_Connectionsi,j is the number of Facebook friendship connections between the two.
Social Connectednessi,j, therefore, measures the relative probability of a Facebook friendship link between a given Facebook user in location i and a user in location j. Put differently, if this measure is twice as large, a Facebook user in i is about twice as likely to be connected with a given Facebook user in j.
In each dataset, we scale the measure to have a fixed maximum value (by dividing the original measure by the maximum and multiplying by 1,000,000,000) and the lowest possible value of 1. We also round the measure to the nearest integer.
We also take several steps to preserve user privacy:
  1. We remove all locations with fewer than 100 active users, except for countries, which are only included if they have more than 50,000 active users.
  2. We add random N(0,1) noise (rounded to the nearest integer) to the number of friendships between each pair of locations. The number of friendships after this noise is added cannot be less than 0.
  3. The SCI presented here is the average SCI across 10 draws of 99% of the population of active Facebook users.

We exclude the following areas: Afghanistan, Western Sahara, China, Cuba, Iraq, Israel, Iran, North Korea, Russia, Syria, Somalia, South Sudan, Sudan, Venezuela, Yemen, Crimea, Jammu and Kashmir, Donetsk, Luhansk, Sevastopol, West Bank, and Gaza.


Data Release – August 2020

The data within this folder include this measure calculated for different geographical areas as of August 2020. Each dataset has three columns: user_loc, fr_loc, and scaled_sci. They include every (symmetric) i to j and j to i location pair. It also includes a number of potentially relevant academic papers.

The August 2020 version of the datasets included are:

  • International Countries. Each row is a country – country pair. Countries are denoted by their ISO2 code.
  • US Counties. Each row is a US county – US county pair. Counties are denoted by their 5-digit FIPS code.
  • US Counties to Country. Each row is a US county –country pair. Counties are denoted by their 5-digit FIPS code, countries are denoted by their ISO2 code.

There are two files built on the Database of Global Administrative Areas (GADM) and the European Nomenclature of Territorial Units for Statistics (NUTS) areas. These data use GADM version 2.8 and NUTS 2016.

  • GADM1_NUTS2: Countries outside of Europe are broken into their GADM level 1 boundaries (e.g. states in USA) if their population > 1 million. Otherwise the area is the full country. European countries are broken into their NUTS2 regions (e.g. 12 provinces in the Netherlands). Each row is a pair of these areas.
  • GADM1_NUTS3_Counties: Countries with population < 1 million are not broken up. European countries are broken into NUTS3 regions (e.g. 40 regions in the Netherlands). The US, Canada, and the countries of the Indian Subcontinent with population > 1 million (Bangladesh, India, Nepal, Pakistan, and Sri Lanka) are broken into their GADM2 regions (e.g. US counties). All other countries are broken into their GADM1 regions. Each row is a pair of these areas.