Get data¶

Now we have a solid plan to start the project. According to Table 4 of [HTL+21], the only ICESat-2 data file (a.k.a. granule) we need to get is ATL03_20190805232841_05940403_002_01.h5.

Traditionally, what we would do is to go to a data hosting service such as Earthdata Search and search for the exact file. While this approach works for this project, there are several drawbacks for a general data-search case:

It is hard to reproduce the data query process using these services. Researchers might be confused by the complexity of the website and struggle to find the right data.
Not every data hosting sites provide the bulk access/download service. If not, retrieve large volume of data from them is time-consuming and requires manual and frequent checks.
This step is often disconnected from the following data processing, which means extra efforts for researchers to check and assimilate their data.

Tools in the Jupyter ecosystem (here Icepyx) can mitigate these issues as they are executed on a Jupyter Notebook. So all the data query and download steps can be scripted and documented. As we continue to use Notebook, accessing data can better connect to the rest of the research stages.

Goals

Download the ICESat-2 granule we need.

Steps¶

Firstly we need Icepyx, a Python package for obtaining and working with ICESat-2 data:

import icepyx

Query ICESat-2¶

Icepyx has a Query class for searching available ICESat-2 granules. We use it to see if the file we are looking for (ATL03_20190805232841_05940403_002_01.h5) is available. From the paper we know that

The data set name is ATL03.
The location of Negribreen is 78.585 \(^\circ\)N, 18.809 \(^\circ\)E.
The track number (RGT on Table 4) is 594.
The date is August 5, 2019.

Using the information above, we can build a query like this:

spatial_extent = [18.3, 78.5, 19.3, 78.7]     # bounding box, [lon_LL, lat_LL, lon_UR, lon_UR]
date_range = ['2019-08-04','2019-08-06']
query = icepyx.Query(dataset='ATL03', spatial_extent=spatial_extent, tracks='0594', date_range=date_range)

Now let’s see the available granules:

query.avail_granules(ids=True)

[['ATL03_20190805232841_05940403_004_01.h5']]

Yes, this is the granule we are looking for. Note that this file is from the latest release (version 004) instead of version 002 (what the paper uses), but the content should not differ too much.

Download the granule¶

Now we can

login to the NASA Earthdata service,
send an order to Earthdata, and
download the granule.

earthdata_uid = 'username'        # Please change this to your Earthdata ID
email = 'user@berkeley.edu'       # Please change this to your Earthdata email
query.earthdata_login(earthdata_uid, email)

Earthdata Login password:  ···········

query.order_granules()

Total number of data order requests is  1  for  1  granules.
Data request  1  of  1  is submitting to NSIDC
order ID:  5000001135507
Initial status of your order request at NSIDC is:  processing
Your order status is still  processing  at NSIDC. Please continue waiting... this may take a few moments.
Your order status is still  processing  at NSIDC. Please continue waiting... this may take a few moments.
Your order is: complete

path = './download'     # Where the downloaded file(s) will be stored at
query.download_granules(path)

Beginning download of zipped output...
Data request 5000001135507 of  1  order(s) is downloaded.
Download complete

Reproducibility¶

This scripted query and download steps can be easily shared by the Notebook (If you are reading the Jupyter book page, there is a button in the upper right to access the Notebook file). This workflow also ensures the consistency of the downloaded data from machine to machine. For example, if we check the downloaded data size:

!du -hs ./download/*.h5

82M	./download/processed_ATL03_20190805232841_05940403_004_01.h5

It should show 82M on your screen. You can see that the fine name has a prefix called processed, indicating that it is actually a subset of the original ATL03_20190805232841_05940403_004_01.h5 file. There are several advantages for doing a subset, for example, the file size can be small. By sharing this workflow via the Notebook, the same subsetting process is preserved and can be run many times in the future.

Summary

By writing and sharing scripted Jupyter notebook content, obtaining data can be easily reproducible, regardless of the data size.

Extra¶

Many Jupyter tools come with some visualization utilities so you can double check each step in the workflow. For example, Icepyx.Query’s visualize_spatial_extent method lets you to visually examine the area of interest:

query.visualize_spatial_extent()

As you can see, the bounding box is located on Negribreen, Svalbard, suggesting we should be downloading the right file.

HTL+21: Ute C. Herzfeld, Thomas Trantow, Matthew Lawson, Jacob Hans, and Gavin Medley. Surface heights and crevasse morphologies of surging and fast-moving glaciers from ICESat-2 laser altimeter data - Application of the density-dimension algorithm (DDA-ice) and evaluation using airborne altimeter and Planet SkySat data. Science of Remote Sensing, 3(May 2020):100013, 2021. URL: https://doi.org/10.1016/j.srs.2020.100013, doi:10.1016/j.srs.2020.100013.

Jupyter Meets the Earth - Toy research workflows