Spatial Data with Python — Let’s Begin!

Matias Mascioto
rmotr.com
Published in
5 min readDec 16, 2019

--

Latitude and longitude. Points, lines, and polygons. GIS. CRS. EPSG. Vector or raster. Shapefile, TIF. At some point in your adventures with Python and/or data, you’ve probably come across some of these words and acronyms. We can easily deduce what some of them reference, but others aren’t as intuitive. In this post, I’ll explain these concepts, which will help you take the first steps into the world of spatial data with Python.

Work on this example interactively with a Jupyter Notebook

Raster vs Vector

There are two ways of storing geospatial data: raster or vector.

The data in a raster format is in a table (rows and columns) where each cell (also called pixels) contains information on the area it represents. Most of the time, the cells are square-shaped and regularly spaced. The information in each cell can be a color, an altitude, a temperature, or any other indicator that the file’s author wants to convey.

On the other hand, vector files represent geographical entities from three basic elements: points, lines, and polygons (areas). In turn, each geographical entity/object can store additional elements. For example, if each entity is a country, then the country’s name, population size, official language, etc. could be stored.

The same information can be represented with a raster or vector format

Let’s analyze a concrete example.

Raster File

NASA (through its Visible Earth project) offers tons of images and data of our planet. Thanks to them, we can download an image in raster format from this link (2MB — TIF). The downloaded file has a .tif extension. Keep in mind that there are many other raster files formats. Some of the more popular ones include Esri Grid, JPEG 2000, MrSID, and ECW.

As we’ve previously mentioned, a raster file is a table with information stored in each of its cells. In this case, since we’re using a color image of Earth, each cell must have the corresponding color to that portion of the planet.

We’ll use Python and rasterio to verify it.

count: 3
height: 1024
width: 2048

This tells us that the raster file has a height of 1024 (rows), a width of 2048 (columns), and three bands. Each band indicates a corresponding layer of information contained in each cell. In this case, this means that each cell contains three values (corresponding to the RGB values).

Let’s see what we can find in each of these bands:

<class ‘numpy.ndarray’>

(1024, 2048)

[[ 10 10 10 … 10 10 10]

[ 10 10 10 … 10 10 10]

[ 10 10 10 … 10 10 10]

[213 219 219 … 218 220 206]

[213 219 219 … 218 220 206]

[206 211 210 … 210 212 199]]

As you can see, the band is a 1024x2048 matrix with values between zero and 255, which indicate the intensity of the color they represent (red, green, and blue).

Finally, we can show the raster file as an image (and then save it in a .png file)

world_raster.png

Vector File

Now let’s see how the vector files can help us represent the information from all the countries in the world and the location of all active volcanoes from the last 10,000 years. For the countries, we’ll use the shapefile provided by Thematic Mapping in this link ( titled TM_WORLD_BORDERS-0.3.zip). The information for the volcanoes can be downloaded from this link from Stanford University ( titled harvard-glb-volc-shapefile.zip).

The shapefile format is the most popular GIS file format. However, it’s not the only one. You may see other file formats such as GeoJSON, KML, KMZ, and even CSV.

If we decompress the .zip files, you’ll notice they’re made up of various files. Regardless, we shouldn’t worry about looking at it as if it were only one file.

We’ll use the geopandas library from Python to read and analyze the vector files.

world_gdf shape (246, 12)

volcanoes_gdf shape (1509, 10)

This last part tells us that world_gdf (the variable where we read the vector file of the countries) has 246 rows, each of them with information from one country, as well as 12 columns representing the countries’ attributes. At the same time, volcanoes_gdf has 1509 volcanoes and 10 attributes.

display(world_gdf.head())

The first rows give us a better understanding of what this file is about. Columns 1–11 have data from the countries (name, codes, etc.) and the 12th column (geometry) has the vector information that represents the country. In this case, each country is represented by a polygon.

display(volcanoes_gdf.head())

For each volcano, we also have several attributes with information on them. Here, each volcano is represented by a point. (Note that the values inside each point coincide with the LAT and LON attributes).

Let’s see how this looks…

world_volcanoes_vector.png

One Last Piece of Advice…

If the raster or vector files refer to locations on Earth (like in this example), we must pay attention to the Coordinate Reference System (CRS) that the file is using. The CRS tells us how the locations indicated in the raster or vector file correspond with Earth.

Furthermore, it establishes which technique must be used to “flatten” or “project” the Earth into two dimensions. It’s a somewhat complex subject, but keep it in mind to avoid some headaches.

In this example, we were lucky. Both vector files use the same CRS (EPSG 4326). If we hadn’t been so lucky, we would’ve had to convert the CRS in one of the two layers (the to_crs() method from geopandas is useful for this!).

There you have it. By taking your first step into spatial data with Python, you’ve traveled around the world in minutes.

Resources

Visible Earth (NASA): https://visibleearth.nasa.gov/images/57752/blue-marble-land-surface-shallow-water-and-shaded-topography

Thematic Mapping: http://thematicmapping.org/downloads/world_borders.php

Rasterio Documentation: https://rasterio.readthedocs.io/en/stable/index.html

Geopandas Documentation: http://geopandas.org/

--

--