Spatial Data with Python - Operations!

Matias Mascioto
rmotr.com
Published in
6 min readMar 4, 2020

--

This is the second post in the Spatial Data with Python series. You can find the first post here.

We did it. We’ve taken the first step towards our data analysis project. We know the fundamental concepts of geographical data and how to load, inspect and visualize data in our Python project.

Let’s keep going. Remember the vector files we worked on in the first post? We have the limits of every country in the world (world_gdf) and the position of 1,509 volcanoes around the planet (volcanoes_gdf). In this post, we’ll be working with a shapefile containing 7,343 cities all over the world (cities_gdf), which you can download from Natural Earth (the simple version will suffice for this project).

Let’s load the shapefiles into GeoDataFrames and take a quick look at the first rows to remember what they were about.

display(world_gdf.head())
print(“world_gdf shape: {}”.format(world_gdf.shape))

world_gdf shape: (246, 12)

display(volcanoes_gdf.head())
print(“volcanoes_gdf shape: {}”.format(volcanoes_gdf.shape))

volcanoes_gdf shape: (1509, 10)

display(cities_gdf.head())
print(“cities_gdf shape: {}”.format(cities_gdf.shape))

cities_gdf shape: (7343, 4)

Let’s Save the World! 💪

Let’s briefly imagine the following situation:

Some of the volcanoes from our dataset are active. They could erupt at any moment! We’re responsible for reporting which countries and cities face serious risk.

(We’ll consider a volcano ‘active’ if its state value is ‘Historica’)

Good thing we’re only playing 😁. Decisions like this require a level of rigor that we’ll not be maintaining in this post (data validation, knowledge in the domain, and more…). But we’re still going to try!

There’s a lot of information we can extract from our dataset to make the best possible decisions. Answering these questions may help:

1) Which countries have the most active volcanoes in their territory? And which of these have the highest density of active volcanoes (volcanoes/km2)?

2) Which cities with a population >500,000 are closest to an active volcano?

We have the data and we know Python… let’s get to work! 💻

Active Volcanoes

First, we’ll filter volcanoes_gdf to work only with the active ones.

volcanoes_gdf = volcanoes_gdf.query(“STATUS == ‘Historica’”)
print(volcanoes_gdf.shape)

(541, 10)

Volcanoes by Country

We can apply different methods to calculate how many active volcanoes there are in each country.

Attribute Based Joins

Using the values from the columns NAME (world_gdf) and LOCATION (volcanoes_gdf) we can determine the number of active volcanoes in each country from world_gdf. Right?

For example, we can count how many times each country is repeated in the column LOCATION from volcanoes_gdf and then add that result to a new column in world_gdf using the merge operation.

Most values in ‘volcanoes_country_count’ are NaN

Do you think this is a good solution? I’m not totally convinced:

· The countries’ names are not normalized, so the same country could be written in different ways in each of the DataFrames.

· It is possible that a country present in volcanoes_gdf could not exist in world_gdf (and the other way around).

Countries in 'volcanoes_gdf' and 'world_gdf': 16
Countries in 'volcanoes_gdf' but not in 'world_gdf': 83
Countries in 'world_gdf' but not in 'volcanoes_gdf': 230

Yeah, we were right. Only 16 countries are in both datasets.

Spatial Joins

Knowing that the countries are represented with polygons in world_gdf and the volcanoes are points in volcanoes_gdf, a better option would be to use a function that takes advantage of the spatial relation between these objects, in order to calculate what we need. We can make use of the Spatial Join(sjoin) function from GeoPandas. In this way, we’ll create a new GeoDataFrame (world_volcanoes_gdf) that will have data for each volcano from the country it resides in.

Let’s see:

world_volcanoes_gdf = gpd.sjoin(volcanoes_gdf, world_gdf, how=”left”, op=”within”)
display(world_volcanoes_gdf.loc[:, [“NAME_”, “LOCATION”, “NAME”]].head(20))

It looks better now, doesn’t it?

We can now see which countries contain the most active volcanoes:

world_volcanoes_gdf[“NAME”].value_counts().head(5)

Indonesia 62
United States 50
Japan 47
Russia 44
Chile 31
Name: NAME, dtype: int64

Let’s plot this.

Volcanoes’ Density

We also want to know which countries have the highest density of active volcanoes. We recently added volcanoes_count to world_gdf. So we just have to divide this value by the area (AREA).

Let’s do it!

Highest density of active volcanoes

Distances between cities and volcanoes

Now it’s time to figure out the distances between the cities (represented by point in cities_gdf) and the volcanoes (also represented by points in volcanos_gdf). Calculating the geodesic distance (shortest distance between two points on a surface) on our planet’s surface is not trivial. First, we must understand that this calculation won’t be exact: it’s impossible to take into account all of our planet’s irregularities. For this reason, the Earth is usually modelled as a plane, a sphere or an ellipsoid. In turn, each of these models can vary in their parameters, generating different representations of the planet. For example, here we can see a table of historical ellipsoid models of Earth.

Source: http://physics.nmsu.edu/

We’ll be using the library GeoPy to calculate the distances. This library offers methods such as great_circle ( Earth as a sphere) and geodesic (Earth as an ellipsoid model). We are going to use the latter method, with its default model: WGS-84.

The first thing we need to do is make a calculation to corroborate the result’s precision. Let’s calculate the distance between La Plata (my city!) to New York, and compare the result against the 8571km that distance.to estimates.

New York coordinates: (40.75192492259464, -73.98196278740681)
La Plata coordinates: (-34.90961464563105, -57.959961183287135)
Distance: 8536km

8536km is our calculation’s result. Not bad! Let’s proceed.

We’ll save in big_cities_gdf the cities with a population higher than 500,000.

big_cities_gdf = cities_gdf.loc[cities_gdf[“pop_max”] > 500000]
print(big_cities_gdf.shape)

(988, 4)

Lastly, we calculate the distances. We’ll use the list distances to save the Series which contain the distances from each city to each one of the volcanoes. Later, we’ll convert this list to a DataFrame (distance_matrix_gdf).

Attention: this may take several minutes to calculate. Refill your coffee.

distance_matrix_df.shape = (988, 541)

Let’s see which 5 cities with a population >500,000 are closest to an active volcano:

top_5_indexes = distance_matrix_df.min(axis=1).sort_values(ascending=True).head(5).index
display(cities_gdf.loc[top_5_indexes])

Well, that’s it! I hope this has been helpful in saving the world… 😆

Resources

Distance calculator: https://www.distance.to

EarthWorks — volcanoes of the world: https://earthworks.stanford.edu/catalog/harvard-glb-volc

Geopandas Documentation: http://geopandas.org/

GeoPy Documentation: https://geopy.readthedocs.io/en/stable/#

Historical Earth ellipsoids — Wikipedia: https://en.wikipedia.org/wiki/Earth_ellipsoid#Historical_Earth_ellipsoids

Natural Earth — Populated Places: https://www.naturalearthdata.com/downloads/10m-cultural-vectors/10m-populated-places/

Thematic Mapping — World Borders: http://thematicmapping.org/downloads/world_borders.php

--

--