This page provides some technical background on the Bayesian poll aggregation models I use on this site for the 2024-25 Australian Federal election.
The data aggregation or data fusion models I use are best described as state space models. They are similar to hidden Markov models (HMM); however, the hidden state variables in these models are continuous and not discrete (as they are in HMMs). The models are also analogous to the Kalman filter, which is a state space model.
I model the national voting intention (which cannot be observed directly; it is "hidden") for each and every day of the period under analysis. The only time the national voting intention is not hidden, is at an
election. In some models (known as anchored models), we use the election
result to anchor the model of day-to-day voting intention.
In the language of modelling, our estimates of the national voting intention for each day being modeled are known as states. These "states" link together to form a process where each state is directly dependent on the state for the day before and a probability distribution linking the states. In plain English, the model assumes that the national voting intention today is much like it was yesterday.
The model is informed by irregular and noisy data from the selected polling houses. The challenge for the model is to ignore the noise and find the underlying signal. In effect, the model is solved by finding the the day-to-day pathway with the maximum likelihood given the known poll results.
To improve the robustness of the model, we make provision for the long-run tendency of each polling house to systematically favour either the Coalition or Labor. We call this small tendency to favour one side or the other a "house effect". The model assumes that the results from each pollster diverge
(on average) from the from real population voting intention by a small, constant number of
percentage points. We use the calculated house effect to adjust the raw polling data from each polling house.
In estimating the house effects, we can take one of a number of approaches. We could:
- anchor the model to an election result on a particular day, and use that anchoring to establish the house effects.
- anchor the model to a particular polling house or houses; or
- assume that collectively the polling houses are unbiased, and that collectively their house effects sum to zero.
Currently, I tend to favour the third approach in my analysis.
The problem with anchoring the model to the previous election outcome (or to a
particular polling house), is that pollsters are constantly reviewing
and, from time to time, changing their polling practice. Over time these
changes affect the reliability of anchored models. On the other hand, the sum-to-zero assumption is rarely correct. Nonetheless, in some previous elections, those people who used models that were anchored to the previous election did poorer than those people whose models averaged the bias across all polling houses.
Solving a model necessitates integration over a series of complex
multidimensional probability distributions. The definite integral is
typically impossible to solve algebraically. But it can be solved using a
numerical method based on Markov chains and random numbers known as Markov Chain Monte Carlo (MCMC) integration.
The specific model I use, as coded in PyMC, is set out in the following code block.
def define_zs_model( # zs = zero-sum (house effects) n_firms: int, n_days: int, poll_day: pd.Series, # of int, length is number of polls poll_brand: pd.Series, # of int, length is number of polls zero_centered_y: pd.Series, # of float, length is number of polls measurement_error_sd: float, ) -> pm.Model: """PyMC model for pooling/aggregating voter opinion polls. Model assumes poll data (in percentage points) has been zero-centered (by subtracting the mean for the series). Model assumes that House Effects sum to zero.""" model = pm.Model() with model: # --- Temporal voting-intention model # Guess a starting point for the random walk guess_first_n_polls = 5 # guess based on first n polls guess_sigma = 15 # allow SD flexibility on init guess educated_guess = zero_centered_y[ : min(guess_first_n_polls, len(zero_centered_y)) ].mean() start_dist = pm.Normal.dist(mu=educated_guess, sigma=guess_sigma) # Establish a Gaussian random walk ... daily_innovation = 0.20 # from experience ... daily change in VI voting_intention = pm.GaussianRandomWalk( "voting_intention", mu=0, # no drift in model sigma=daily_innovation, init_dist=start_dist, steps=n_days, ) # --- House effects model house_effect_sigma = 15 # assume big house effects possible house_effects = pm.ZeroSumNormal( "house_effects", sigma=house_effect_sigma, shape=n_firms ) # --- Observational model (likelihood) polling_observations = pm.Normal( "polling_observations", mu=voting_intention[poll_day.values] + house_effects[poll_brand.values], sigma=measurement_error_sd, observed=zero_centered_y, ) return model
This modelling is based on the work of Simon Jackman in Bayesian Analysis for Social Sciences (2009).
The complete code base is available on my github site.