Writing an Events Scraper

Events listings are one of the more compelling datasets that we are able to collect, since it allows for near real-time updating of upcoming events. Events include hearings, meetings, or bascially anything with a date and time listed by the organization you’re scraping.

Target Data

Event scrapers pull down information regarding upcoming (or past) Events and associated metadata, such as who was there, what was talked about, and any supporting material.

Some of the commonly scraped data includes:

  • Name of the event

  • When the event starts and ends

  • Items on the Agenda

    • Related entities (people, orgs, bills)
    • Subject of the agenda item
    • Related media
  • Where the event is to take place

    • Lat / lon (if it exists)
    • Description of location (such address or building)
    • Venue link
  • Associated documents

  • Associated people, orgs, participants

  • Any video or audio of the event

Creating a new Events scraper

Let’s take a look at a sample Pupa event scraper:

from pupa.scrape import Scraper
from pupa.scrape import Event
import datetime as dt
import pytz

class SeattleEventScraper(Scraper):
    def scrape(self):
        when = dt.datetime(1776,7,4,9,15)
        tz = pytz.timezone("US/Pacific") #set the timezone for this location
        when = tz.localize(when)
        e = Event(name="Hearing",  # Event Name
                      start_time=when,  # When the event will take place
                      timezone=tz.zone, #the local timezone for the event
                      location_name='Town Hall')  # Where the event will be
        e.add_source("http://example.com")
        yield e

The events scraper looks a lot like a person scraper - the same stuff is going on here - the magic scrape method, returns an iterable of objects. Unlike people, where we often found committees or other organizations, it’s not common to come across other objects while scraping events, so this scraper will usually just return Event objects.

The scraper above contains the minimum elements required to create an event. But there’s much more we might want to add. The following scraper adds particpants and documents that are relevant to the hearing:

from pupa.scrape import Scraper
from pupa.scrape import Event
import datetime as dt
import pytz

class SeattleEventScraper(Scraper):
    def scrape(self):
        when = dt.datetime(1776,7,4,9,15)
        tz = pytz.timezone("US/Pacific") #set the timezone for this location
        when = tz.localize(when)
        e = Event(name="Hearing",  # Event Name
                      start_time=when,  # When the event will take place
                      timezone=tz.zone, #the local timezone for the event
                      location_name='unknown')  # Where the event will be
        e.add_source("http://example.com")

        #add a committee
        e.add_participant(name="Transportation Committee",
                        type="committee")

        #add a person
        e.add_person(name="Joe Smith", note="Hearing Chair")

        #add an mpeg video
        e.add_media_link(note="Video of meeting",
                        url="http://example.com/hearing/video.mpg",
                        media_type="video/mpeg")

        #add a pdf of meeting minutes
        e.add_media_link(note="Meeting minutes",
                        url="http://example.com/hearing/minutes.pdf",
                        media_type="application/pdf")

        yield e

The event is now much more fleshed out. But we’re still missing the meat of an event: the agenda! Next we’ll add agenda items:

from pupa.scrape import Scraper
from pupa.scrape import Event
import datetime as dt
import pytz

class SeattleEventScraper(Scraper):
    def scrape(self):
        when = dt.datetime(1776,7,4,9,15)
        tz = pytz.timezone("US/Pacific") #set the timezone for this location
        when = tz.localize(when)
        e = Event(name="Hearing",  # Event Name
                      start_time=when,  # When the event will take place
                      timezone=tz.zone, #the local timezone for the event
                      location_name='unknown')  # Where the event will be
        e.add_source("http://example.com")

        #add a committee
        e.add_participant(name="Transportation Committee",
                        type="committee")

        #add a person
        e.add_person(name="Joe Smith", note="Hearing Chair")

        #add an mpeg video
        e.add_media_link(note="Video of meeting",
                        url="http://example.com/hearing/video.mpg",
                        media_type="video/mpeg")

        #add a pdf of meeting minutes
        e.add_media_link(note="Meeting minutes",
                        url="http://example.com/hearing/minutes.pdf",
                        media_type="application/pdf")


        #add an agenda item to this event
        a = e.add_agenda_item(description="Testimony from concerned citizens")

        #the testimony is about transportation and the environment
        a.add_subject("Transportation")
        a.add_subject("Environment")

        #and includes these two committees
        a.add_committee("Transportation")
        a.add_committee("Environment and Natural Resources")

        #these people will be present
        a.add_person("Jane Brown")
        a.add_person("Alicia Jones")
        a.add_person("Fred Green")

        #they'll be discussing this bill
        a.add_bill("HB101")

        #here's a document that is included
        a.add_media_link(note="Written version of testimony",
                        url="http://example.com/hearing/testimony.pdf",
                        media_type="application/pdf")

        yield e

This example shows how to use the events model exhaustively. However, we haven’t done any actual web-scraping. All of the details we added are hard-coded. It is quite difficult to show an example of a functioning web-scraper for an events page, as we have found that legislative events pages or calendars tend to change formats somewhat frequently. For an example of a scraper that hits an actual webpage to find information, see Writing a Person Scraper.

Read the Docs v: latest
Versions
latest
Downloads
pdf
htmlzip
epub
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.