Serverless Web Scraping with AWS, Part II

February 21, 2021

How do I build an automated ETL pipeline from a simple web scraper? Over two posts, I document the resources I leverage to periodically extract solar irradiance and meteorological data from the Solar Energy Research Institute's (SRRL) online portal. This second post will focus on automating the ETL pipeline.

A taste of the measurements made by the NREL Baseline Measurement System in Golden, Colorado. This is equivalent to the solar resource for collectors with fixed-tilt and optimized for year-round performance.

Last time out, I built an ETL pipeline on top of the Scrapy framework to gather data from NREL's station in Golden, CO. As much as I'd enjoy dedicating my remaining days to manually triggering this web scraper, I can't let it conflict with my Netflix schedule. Fortunately, I'm told some guy built a slew of somewhat profitable services for this exact reason. His name is Jeff.

Core Tools

Tools and process of web scraping NREL data

I've already been using one of Jeff's 'somewhat profitable' offerings - Amazon RDS - to host my Postgres instance. This time round, I'll be leveraging AWS Lambda, a compute service that allows me to farm off code that can be triggered in response to manual invocations, network requests, and events. We'll be running the scraper as a scheduled event, i.e., a cron job.

This is where Serverless comes in. As a framework, it handles much of the infrastructure required to build and deploy an application using serverless code patterns. My use-case will be fairly simple: deploy 1 service to AWS Lambda comprised of 1 function that's triggered daily.

  1. If you haven't done so already, set up an AWS account.
  2. Install the Serverless framework with npm install -g serverless (requires Node.js and NPM)
  3. Install Docker, which will be used to bundle python dependencies for deployment.

Configuring Serverless Access to AWS

In the root of the nrel-scraper project folder (containing scrapy.cfg), we initialize and name a service from a template:

$ serverless create --template aws-python3 --name nrel-scraper

This will hook us up with a series of configurations provided in serverless.yml, a .gitignore, and a handler.py script (more on that later).

Next, you'll want to generate some AWS access keys for Serverless latch on to. If you're feeling the need for speed, you can dump them into a gitignored dotenv file and export them as environment variables in your shell (source .env) before deployment. I prefer the more permanent solution of setting up AWS profiles under ~/.aws/credentials:

[serverless-admin]
aws_access_key_id = ********************
aws_secret_access_key = ********************

There are several ways to invoke different profiles during deployment, but for my purposes it's sufficient to just edit the provider field in serverless.yml:

provider:
  name: aws
  runtime: python3.7
  lambdaHashingVersion: 20201221
  profile: serverless-admin

Packaging the Function

Up to this point, we've been manually invoking the scraper with scrapy crawl midc. Jeff wants us to package this into a neat function that handles event and context objects that are passed to it. You can read more about these arguments here.

Recall that intializing Serverless provided us a handler.py. Since I'm not passing any event data for the Lambda function to process, my handler function is excruciatingly boring:

# ./handler.py

import json
import sys
from nrel_scraper.crawl import crawl

def scrape(event={}, context={}):
    crawl()

if __name__ == "__main__":
    scrape(event)

The imported crawl method just replicates the manual invocation of the scaper by using Scrapy API (pardon the hacky sqlite3 hotfix I had to dirty my hands with):

# ./nrel_scraper/crawl.py

import scrapy
from scrapy.spiderloader import SpiderLoader
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

# Start sqlite3 fix
import imp
import sys
sys.modules["sqlite"] = imp.new_module("sqlite")
sys.modules["sqlite3.dbapi2"] = imp.new_module("sqlite.dbapi2")
# End sqlite3 fix

def crawl(spider_name="midc", spider_kwargs={}):

    project_settings = get_project_settings()
    spider_loader = SpiderLoader(project_settings)
    spider_cls = spider_loader.load(spider_name)

    process = CrawlerProcess(project_settings)
    process.crawl(spider_cls, **spider_kwargs)
    process.start()

Managing Dependencies

Don't forget the never-ending joy of bundling dependencies so that they'll be provisioned 'in the cloud'. I'll use pipenv to spawn a virtual Python 3 environment:

$ pipenv --three
$ pipenv shell
$ pipenv install <for_each_package>
$ pipenv lock

The required packages for nrel-scraper are Scrapy, sqlalchemy, pandas, python-dotenv, and psycopg2-binary. If you ever needed proof of how over-engineered this simple scraper is, that list of bloated packages should do the trick.

Serverless will need to bundle these dependencies specified in the newly generated Pipfile. Fortunately, there's a handy plugin for this purpose:

$ serverless plugin install -n serverless-python-requirements

The plugin will be automatically added to the package.json and serverless.yml. We'll also need to add a custom field to the latter enabling the usage of Docker to compile packages that aren't native to Python:

plugins:
  - serverless-python-requirements
custom:
  pythonRequirements:
    slim: true # Omits tests, __pycache__, *.pyc etc
    usePipenv: true
    dockerizePip: non-linux

Serverless Deployment

AWS Lambda functions can be triggered on a schedule

All that's left is to rehaul the functions field in serverless.yml to schedule the cron job and point to the correct handler method. Note that all cron expressions should be passed in UTC:

functions:
  nrelScrape:
    # handler method for nrel-scraper
    handler: handler.scrape
    # Seconds until function times out
    timeout: 30 
    events:
      # Definition of the cron job that's passed to AWS CloudWatch
      - schedule:
          name: nrel-midc-daily-ingestion
          description: Scheduled daily scraping of NREL MIDC at 11:59PM MST (UTC-7:00)
          rate: cron(59 6 * * ? *)

This wraps up our serverless.yml. You can learn more from this example reference or check out the GitHub repo for this project. But first, deploy!

$ serverless deploy --verbose

If by some miracle, the permissions across Amazon RDS, Lambda, and Serverless are all perfectly aligned, c'est fini. Personally, I had to do some troubleshooting to grease the wheels.

The function will now self-trigger according to the schedule dictated to CloudWatch. You can still manually trigger the job to run on Lambda:

$ serverless invoke -f nrelScrape --log

And we're done here! If you've somehow made it through this slog, I must both apologize and thank you for bearing with me. Now, we sit back and let the solar irradiance and meteorological data from the finely calibrated pyranometers and pyrheliometers of the Solar Energy Research Institute's Baseline Measurement System roll in...

© 2024 Kim Sha
    Powered by: