Cloud Infrastructure Overview

The key to everything here is AWS's Elastic File System (EFS). I wanted to avoid "reinventing the wheel" as much as possible, and there is already a great tool for extracting data from forecast files- WGRIB2! Moreover, NOAA data was already being stored on AWS. This meant that, from the 10,000 ft perspective, data could be downloaded and stored on a file system that could then be easily accessed by a serverless function, instead of trying to move it to a database. That is the "one-sentence" explanation of how this is set up, but for more details, read on!

Architecture

EventBridge timers launch Step Function to trigger Fargate
WGRIB2 Image pulled from repo
Fargate Cluster launched
Task to: Download, Merge, Chunk GRB files
Data saved to EFS
NWS alerts saved to EFS as GeoJSON
Lambda reads forecast data, processes and interpolates, returns JSON
Expose JSON forecast data via API Gateway
Distribute API endpoint via CloudFront
Monitor API key usage

Data Sources

Starting from the beginning, three NOAA models are used for the raw forecast data: HRRR, GFS, and the GEFS.

HRRR

The High Resolution Rapid Refresh (HRRR) provides forecasts over all of the continental US, as well as most of the Canadian population. 15-minute forecasts every 3 km are provided every hour for 18 hours, and every 6 hours a 48-hour forecast is run, all at a 3 km resolution. This was perfect for this project, since Dark Sky provided a minute-by-minute forecast for 1 hour, which can be loosely approximated using the 15-minute HRRR forecasts. HRRR has almost all of the variables required for the API, with the exception of UV radiation and ozone. Personally, this is my favourite weather model, and the one that produced the best results during my thesis research on Hurricane Dorian https://doi.org/10.1029/2020JC016489.

GFS

The Global Forecast System (GFS) is NOAA's global weather model. Running with a resolution of about 30 km (0.25 degrees), the GFS model provides hourly forecasts out of 120 hours, and 3-hour forecasts out to 240 hours. Here, GFS data is used for anywhere in the world not covered by the HRRR model, and for all results past 48 hours.

The GFS model also underpins the Global Ensemble Forecast System (GEFS), which is the 30-member ensemble (the website says 21, but there are 30 data files) version of the GFS. This means that 30 different "versions" of the model are run, each with slightly different starting assumptions. The API uses the GEFS to get precipitation type, quantity, and probability, since it seemed like the most accurate way of determining this. I have no idea how Dark Sky did it, and I am very open to feedback about other ways it could be assigned, since getting the precipitation probability number turned out to be one of the most complex parts of the entire setup!

GEFS

The Global Ensemble Forecast System (GEFS) is the ensemble version of NOAA's GFS model. By running different variations parameters and inputs, 30 different versions of this model are run at the same time, providing 3-hour forecasts out to 240 hours. The API uses the GEFS to get precipitation type, quantity, and probability, since it seemed like the most accurate way of determining this. I have no idea how Dark Sky did it, and I am very open to feedback about other ways it could be assigned, since getting the precipitation probability number turned out to be one of the most complex parts of the entire setup!

ERA5

To provide historic weather data, the European Reanalysis 5 Dataset is used. This source is particularly interesting, since unlike the real-time NOAA models that I need to convert, it's provided in the "cloud native" Zarr file format. This lets the data be accessed directly and quickly in S3 from Lambda. There aren't nearly as many, many parameters available as with the GFS or HRRR models, but there are enough to cover the most important variables.

Others

There are a number of other models that I could have used as part of this API. The Canadian model (HRDPS) is even higher resolution (2.5 km), and seems to do particularly well with precipitation. Also, the European models are sometimes considered better global models than the GFS model is, which would make it a great addition. However, HRRR and GFS were enough to get things working, and since they are stored on AWS already, there were no data transfer costs!

As the rest of this document explains, the data pipeline here is fairly flexible, and given enough interest, it would be relatively straightforward to add additional model sources/ historic forecasts.

Forecast data is provided by NOAA in GRIB2 format. This file type has a steep learning curve, but is brilliant once I realized how it worked. In short, it saves all the forecast parameters, and includes metadata on their names and units. GRIB files are compressed to save space, but are referenced in a way that lets individual parameters be quickly extracted. In order to see what is going on in a GRIB file, the NASA Panoply reader works incredibly well.

Lambda, Fargate, and WGRIB2 Setup

AWS Lambda allows code to run without requiring any underlying server infrastructure (serverless). In my case, I used Python as the target language, since I was interested in learning it! Once triggered, a Lambda function will run with the configured memory. It can pull data from S3 or the Elastic File System (EFS), and can use information passed as part of the trigger. Lambda functions can depend on layers or support code packages. This API uses Lambda to retrieve and process the forecast data when a call is made. Lambda handles Python packages as layers, so I created layers for NetCDF4, Astral, pytz, and timezonefinder. To get historic data, I added a Zarr layer; however, it is too large to be combined with the NetCDF4 layer in Lambda, which is why it's a separate API call compared to the forecast API.
For processing, I wanted to use the WGRIB2 application as much as I could, since it has been extensively optimized for this sort of work. Pywgrib2 was recently released, which is the Python interface for working with WGRIB2 files. I used the pywgrib2_s flavour, and then always called it using the .wgrib2 method. I initially had this setup using Lambda, but ran into the 500 MB temporary storage limit that Lambda has. Instead, processing is now done using the Elastic Conatiner Service (ECS) and Fargate. This lets code run inside a Docker container, which I set up to compile the WGRIB2 code from source. This image is stored on AWS, and gets retrieved each time a processing task is run! The Dockerfile to generate this container is pretty simple, relying on Ubuntu as a base image and following the instructions from the WGRIB2 README. Only interesting part is a nifty "one-liner" to replace a specific line in the makefile: `RUN sed -i "s|MAKE_SHARED_LIB=0|MAKE_SHARED_LIB=1|g" makefile'.

Data Pipeline

Trigger

Forecasts are saved from NOAA onto the AWS Public Cloud into three buckets for the HRRR, GFS, and GEFS:

	GFS	GEFS	HRRR- 48h	HRRR- 18h/ SubHourly
Run Times (UTC)	0,6,12,18	0,6,12,18	0,6,12,18	0-24
Delay	5:00	7:00	2:30	1:45
Ingest Times (UTC)	5,11,17,23	7,13,19,1	2:30,8:30,14:30,20:30	1:45-00:45

Each rule calls a different AWS Step Function, which is the tool that oversees the data pipeline. The step function takes the current time from the trigger, adds several other environmental parameters (like which bucket the data is saved in and which processing script to use), and then finally starts a Fargate Task using the WGRIB2/ Python Docker image! Step functions have the added perk that they can repeat the task if it fails for some reason. I spent some time optimizing the tasks to maximize network speed and minimize the RAM requirements, settling on 1 CPU and 5 GB of RAM. The Fargate task is set up to have access to the NOAA S3 buckets, as well as an EFS file system to save the processed files. The python processing scripts are explained below, with the source code available on the repository.

Download, Filter, and Merge

For all of the models, the download process works in a similar way:

Get the bucket, time, and download path from the environmental variables set by the step function.
Set up paths on the EFS filesystem.
Step through the required files and download using boto3.

For the HRRR model, the wind directions need to be converted from grid relative to earth relative, using the wgrib2 -new_grid_winds command. Separately, for the GFS/ GEFS models, there are two accumulated precipitation fields (APCP), one representing 3 hours of accumulation, and one representing 0 to the forecast hour. wgrib2 has a -ncep_norm command; however, it requires that all the time steps are in the same grib file, which isn't how they're saved to the buckets. Instead, I used tip #66 from the (ever handy) wgrib2 tricks site, and added the -quit command to stop wgrib2 from processing the second APCP record.

My complete pywgrib2_s command ended up looking like this:

pywgrib2_s.wgrib2([download_path, '-new_grid_winds', 'earth', '-new_grid_interpolation', 'neighbor', '-match', matchString, '-new_grid', HRRR_grid1, HRRR_grid2, HRRR_grid3, download_path_GB])
pywgrib2_s.wgrib2([download_path, '-rewind_init', download_path, '-new_grid_winds', 'earth', '-new_grid_interpolation', 'neighbor', '-match', 'APCP', '-append','-new_grid', HRRR_grid1, HRRR_grid2, HRRR_grid3, download_path_GB, '-quit'])

Where matchString was the list of parameters, HRRR_grid1, HRRR_grid2, HRRR_grid3 are the HRRR grid parameters, and download_path_GB was the output file location. Once wgrib2 has run, the processed grib files are appended to a NetCDF file (via pywgrib2_s.wgrib2([download_path_GB, '-append', '-netcdf', download_path_NC])). This is a NetCDF 3 file, so no compression, but is much easier to work with than GRIB files. After each step is added to the NetCDF, the original GRIBs are removed to save space. For most of the scripts, there are actually two different processes going on at the same time, downloading slightly different files. For the GFS model, this is the primary and secondary variable versions, for GEFS this is the complete ensemble as well as the mean, and for HRRR this is the hourly and subhourly forecasts. The process is the same as above, just replicated to reduce the number of scripts that need to be run.

Compress, Chunk, and Save

My initial plan was to simply save the grib files to EFS and access them via py_wgrib2; however, despite EFS being very quick and wgrib2's optimizations, this was never fast enough to be realistic (~20 seconds). Eventually, I was pointed in the direction of a more structured file type, and since there was already a great NetCDF Python package, it seemed perfect! From the merged NetCDF 3 files, the next steps are pretty straightforward:

Create a new in-memory NetCDF4 file .
Copy variables over from NetCDF3 to NetCDF4, enabling compression and significant digit limit for each one.
Chunk the NetCDF4 file by time to dramatically speed up access times and save to EFS.
A separate pickle file is saved with the latitudes and longitudes of each grid node.
Old model results are removed from the EFS filesystem.

While the process is simple, the details here are tricky. The chunking and compression are the key elements here, since they allow for fast data retrieval, while using an in-memory dataset speeds things up a fair bit.

Model Specific Notes

In order to get UV data, a separate grib file is needed for the GFS model, as it is classified as a "Least commonly used parameters.” The data ingest steps are the same, but there is an extra step where the wgrib2 -append command is used to merge the two NetCDF3 files together.
The ensemble data was by far the most difficult to deal with. There are several extra steps:
- The 30-ensemble grib files for a given time step are merged and saved as a grib file in the /tmp/
The wgrib2 -ens_processing command is then run on this merged grib file. This produces probability of precipitation, mean, and spread (which is used for precipitation intensity error) from the 30-member ensemble; however, it provides the probability of any (>0) precipitation. Since this is a little too sensitive, I used the excellent wgrib2 trick #65, which combines -rpn and -set_prob to allow arbitrary values to be used.
These three values are then exported to NetCDF3 files with the -set_ext_name command set to 1
The files are then converted to NetCDF 4 and chucked in the same way
For most variables, the least significant digit parameter is set to 1, and the compression level is also set to 1. There is probably some room for further optimization here.

Retrieval

When a request comes in, a Lambda function is triggered and is passed the URL parameters (latitude/ longitude/ time/ extended forecast/ units) as a JSON payload. These are extracted, and then the nearest grid cell to the lat/long is found from the pickle files created from the model results. Using the time parameter (if available, otherwise the current time), the latest completed model runs are identified. Weather variables are then iteratively extracted from the NetCDF4 files and saved to a 2-dimensional numpy arrays. This is then repeated for each model, skipping the HRRR results the requested location is outside of the HRRR domain. For the GFS model, precipitation accumulation is adjusted from the varying time step in the grib file to a standard 1-hour time step.

Once the data has been read in, arrays area created for the minutely and hourly forecasts, and the data series from the model results is interpolated into these new output arrays. This process worked incredibly well, since NetCDF files natively save timestamps, so the same method could be followed for each data source.

Some precipitation parameters are true/false (will it rain, snow, hail, etc.), and for these, the same interpolation is done using 0 and 1, and then the precipitation category with the highest value is selected and saved. Currently a 10:1 snow to rain ratio is used (1 mm of rain is 10 mm of snow), but this could be improved. Where available, HRRR sub-hourly results are used for minutely precipitation (and all currently results), and the GFS ensemble model is used for the hourly time series. Daily data is calculated by processing the hourly time series, calculating maximum, minimum, and mean values.

For the GFS and GEFS models, the returned value is a weighted average (by 1 over the distance) of the closest 9 grid cells. For variables were taking an average isn't realistic (true/false variables), the most common (mode) result is used. While this approach isn't used for the HRRR model, since the cells are much closer together, I got it working using the numpy np.argpartition function to find the 9 closest points.

A few additional parameters are calculated without using the NOAA models. The incredibly handy timezonefinder python library is used to determine the local time zone for a request, which is required to determine when days start and end and which icon to use. Astral is used for sunrise, sunset, and moon phases. Apparent temperature is found by adjusting for either wind chill or humidex, and the UV Index is calculated from the modelled solar radiation. This variable has some uncertainty, since the official documentation suggests that these values should be multiplied by 40. I've found this produces values that are incorrect, and instead, the model results are multiplied by 0.4. Dark Sky provides both temperatureHigh and temperatureMax values, and since I am not sure what the difference between them is, the same value is currently used for both.

Icons are based on the categorical precipitation if it is expected, and the total cloud cover percentage and visibility otherwise. For weather alerts, a GeoJSON is downloaded every 10 minutes from the NWS, and the requested point is iteratively checked to see if it is inside one of the alert polygons. If a point is inside an alert, the details are extracted from the GeoJSON and returned. Finally, the forecast is converted into the requested units (defaulting to US customary units for compatibility), and then into the returned JSON payload. The lambda function takes between 1 and 3 seconds to run, depending on if the point is inside the HRRR model domain, and how many alerts are currently active in the US.

Historic Data

Historic data is saved in the AWS ERA5 bucket in Zarr format, which makes it incredibly easy to work with here! I mostly followed the process outlined here: https://github.com/zflamig/birthday-weather, with some minor tweaks to read one location instead of the entire domain and to process accumulation variables. This dataset didn't include cloud cover, which presented a significant issue, since that is what's used to determine the weather icons. To work around this, I used the provided shortwave radiation flux variable and compared it against the theoretical clear sky radiation. This isn't a perfect proxy, since it doesn't work at night, and there are other factors that can impact shortwave radiation other than cloud cover (notably elevation), but it provides a reasonable approximation.

AWS API

The end of this service relies on two other AWS products, the API Gateway and developer portal. I found the API Gateway (using the REST protocol) fairly straightforward- in this implantation there is one resource, a GET request to the custom domain name, which extracts the {api-key} and {location} from the URL as path parameters. It also checks for URL query parameters. This method then authenticates the request, passes it to the Lambda function, and returns the result.

The trickiest part of this setup was, by far, getting the API Gateway to use an API key from the URL. This is not officially supported (as opposed to URL query parameters). This makes sense, since passing API keys in a URL isn't a [great idea](https://security.stackexchange.com/questions/118975/is-it-safe-to-include-an-api-key-in-a-requests-url, but for compatibility, I needed to find a way.

After a few attempts, what ended up working was a custom Lambda Authorizer as described here. Essentially, what happens is that the API Gateway passes the request to this short Lambda function, which converts the URL path parameter into the API key. This is then passed back to the API Gateway for validation. For this to work, the API Key Source needs to be set to AUTHORIZER under the setting panel.

Next Steps

While this service currently covers almost everything that the Dark Sky API does, I have a few ideas for future improvements to this service!

Text Summaries. This is the largest missing piece. Dark Sky open-sourced their translation library, so my plan is to build off that to get this working. All the data is there, but it's a matter of writing the logics required to go from numerical forecasts to weather summaries.
Additional sources. The method developed here is largely source agnostic. Any weather forecast service that delivers data using grib files that wgrib2 can understand (all the primary ones) is theoretically capable of being added in. The NOAA North American Mesoscale NAM model would provide higher resolution forecasts out to 4 days (instead of the 2 days from HRRR). The Canadian HRDPS Model is another tempting addition, since it provides data at a resolution even higher than HRRR (2.5 km vs. 3.5 km)! The European model would be fantastic to add in, since it often outperforms the GFS model; however, the data is not open, which would add a significant cost.

Other Notes and Assumptions

While this API will give minutely forecasts for anywhere in the world, they are calculated using the HRRR-subhourly forecasts, so only accurate to 15-minute periods. Outside of the HRRR domain, they are calculated using the GFS hourly forecasts, so really not adding much value!
Precipitation probabilities are a tricky problem to solve- weather models don't give a range of outcomes, just one answer. To get probabilities, this implementation relies on the Global Ensemble Forecast System (GEFS). This is a 30-member ensemble, so if 1 member predicts precipitation, the probability will be 1/30. GEFS data is also used to predict precipitation type and accumulation. A 1:10 snow-water ratio is assumed.
Current conditions are based on model results (from HRRR-subhourly), which assimilates observations, but not direct observations.
Why "PirateWeather"? I've always thought that the HRRR model was pronounced the same way as the classic pirate "ARRR". Also, there is one company out there that thinks APIs can be copyrighted, which might apply here.