Poll tracker assignment

The polling page here contains a list of polls from a presidential election in a hypothetical country. The election will be held on the 10th of October, 2024, so there’s a long way to go—lots could happen between now and then.

We would like you to write a scraper to pull the polls off the polling page, convert them to a CSV, and create a poll average based on those polls. The poll tracker should continue to work up until election day, and be robust to all the normal issues we see in aggregate polling lists:

Some days have no polls, some have multiple
Some pollsters do not include line items for all candidates
Some pollsters will conduct multiple polls with hypotheticals (e.g. what if this candidate dropped out?)
The order of candidates on the page may change
Formatting may be inconsistent

As well as the normal things that happen during election campaigns:

Opinions can shift suddenly
A candidate might drop out
A candidate might join the race late
There may be big gaps in the polling record (for example, around Christmas or for this country’s two-week public holiday in June)
There may be significant data entry errors
Notes might be attached to specific polls or numbers

The poll tracker page will add more polls over time.

What we’re looking for

This is a test of your ability to write stable, production-ready code. All of the “tricks” to this assignment are listed above; there should be no surprises or gotchas here. The structure of the table will never change, and its design is pretty simple. (We are not testing your ability to scrape irritatingly-written websites!) Your code should be clean, well-documented (or self-documenting), and easy to read. Ideally it will have tests or some sort of error monitoring, and will detect or alert us in the event of a major error. (Writing errors to a log file is a sufficient alert for this assignment.) Your entire program should be executable from a single command, but it probably shouldn’t be written in a single monolithic script.

We would strongly prefer you write your poll tracker in R, Python, or JavaScript/TypeScript. You should submit your code as a github repository. You’re welcome to use additional libraries so long as your code is easily runnable on another computer. (A provisioned docker image is fine, so are things like npm or virtualenv.)

Your script should output two CSVs, called polls.csv and trends.csv (for the averages). The polls file should have columns for date, pollster, n (sample size), and each candidate (by name); the trends file just date and a column for each candidate. Values for polls and trends should be a number from 0 to 1. The polls file should have a row for each poll. The trends file should have a row for each day, starting with October 11th, 2023. You can see example files for polls and trends (these files contain random numbers).

How we’ll assess it

Once we’ve received the code, we will run it for each day leading up to the election. It should not crash, and should produce a CSV file that accurately matches our reference CSV of polls. We’ll also check the averages to ensure they haven’t diverged too much from the actual polling data (which would imply the code has gone wrong). And, of course, we’ll check that your code hasn’t crashed. If your code does crash, that doesn’t disqualify you—stuff happens. But we’d expect it to fail gracefully and trigger error monitoring so we’d know straight away.

This is not a test of your statistical know-how. It does not matter whether you use a 7-day trailing average or a Bayesian-filtered Dirichlet model to generate your average. It does matter that your average stays relatively close to the trends revealed by the polls.