DEV Talk: Generalized Raking for Survey Weighting

In the world of surveys, it is very common that our acquired responses need to be weighted in order to achieve a sample that is representative of some target population. This process of weighting simply consists of assigning a weight (a.k.a. factor) to each respondent and calculating all survey results as a weighted sum of respondents.

For example, we might have surveyed 100 male respondents and 150 female respondents but we're targeting a male/female ratio of 48%/52%. In this simple case, we could achieve the target ratio by weighting the male responses by a factor of 0.48 / (100 / (100 + 150)) = 1.2 and weighting the female responses by 0.52 / (150 / (100 + 150) = 0.867. The technical term for this method of computing weights is Post-Stratification.

However, in a more complex scenario, where we have many different measurable demographic targets, how can we determine weights for all the survey respondents?

Raking

At Potloc, it is very common that our clients desire survey populations matching a lot of such targets. For example, we might have targets looking like this:

  • 42% male
  • 58% female
  • 20% students
  • 80% non-students
  • 15% dog owners

In this setting, weights cannot be calculated using a simple ratio as in the male/female example shown above. Here, we instead need to rely on more involved algorithms, notably a process called Raking.

Iterative Proportional Fitting

One common approach to solve the problem of finding good weights that will satisfy our demographic targets is Iterative Proportional Fitting. In this method, weights for each respondent are computed for a single target at a time using Post-Stratification. By iteratively computing this for each target and repeating it a few times, the weights end up converging to values that satisfy our targets.

Great! Problem solved!

...but what if we could do even better? 🤔

Generalized Raking

Beyond satisfying the demographic targets, the most desirable property for the weights is that they should be as close as possible to 1. Indeed, weights that are really large mean that those respondents' responses will count for a lot more than the "average" respondent in our survey results. For example, a respondent with a weight of 10 will count for 10 times more than the average respondent, and 100 times more than a respondent with a weight of 0.1. Similarly, small weights mean that some responses will have very little impact on the final results.

Unfortunately, Iterative Proportional Fitting does nothing to encourage weights to be close to 1, which leads to sub-optimal weights. This is where Generalized Raking, an algorithm introduced by Deville et al. (1992), comes into play.

Note: This is where we get into the more mathematical part of this blog post 🤓. Don't care about this part? No worries! Simply skip to the next section!

The authors of this paper formulated the weighting problem as a constrained optimization method where the objective is that the weights are as close to one as possible and where the constraint is that the targets are matched. Mathematically this looks like this:

 
arg min G(x) s.t. XTw = T

where G(x) is the raking function that encourages weights to be close to 1,  is the vector of weights, T is the vector of targets (in absolute numbers, not percentages) and X is the (numRespondents) x (numTargets) matrix of responses. The matrix X is binary where cells are filled with a '1' if the respondent belongs to the target category and '0' otherwise.

In other words, this is saying that we want to optimize the weights to be as close to 1 as possible while satisfying the target constraints. This is achieved by minimizing the function G(x) which looks like this (notice the global minimum at x = 1 !):
image

The Generalized Raking Algorithm

While it is possible to solve this optimization problem using general methods such as Sequential Least-Squares Programming, the authors of Generalized Raking have devised a more efficient and robust algorithm for this specific problem:

  1. Initialize variables
    • A (numRespondents x 1) vector w to ones
    • A ( vector  to zeros
    • A  square matrix  to the Identity matrix
  2. While the weights have not converged

Here,  is the inverse of the derivative of the raking function, i.e. ex.
 is its derivative, in this case also .

Implementation

While there are many implementations of this algorithm in R, we were not able to find one in Ruby that could play well with our codebase and be easily maintainable.
We, therefore, decided to make our own and to share it here for anyone looking for something similar. We started by making an implementation in python with the popular numpy library:

Ruby Implementation

After validating the algorithm in python, we then proceeded to replicate it in Ruby. For this, we had to find an equivalent to numpy which we found in Numo. Numo is an awesome library for vector and matrix operations, and its linalg sub-library was perfect for us as we needed to compute a matrix pseudo-inverse. This allowed us to translate the code to Ruby almost line by line:

 

You may have noticed that the code doesn't quite match exactly the algorithm described above, notably steps 2.1 and 2.3. This is because we have found it to be vastly faster with Numo to store the sparse matrix  as a flat vector h_matrix_diagonal since it only contains values on the diagonal. As a result, the step of taking the product XTcan be rewritten as X.Transpose * h_matrix_diagonal, making use of Numo's implicit broadcasting.

 

In practice, we optimize this code a bit further by exiting early whenever possible (for example if our loss becomes NaN) and by allowing to pass as input an initial value for the vector lambdas if we believe to have an initialisation value better than the default.

With these few lines of code, we are now able to support complex survey weighting scenarios while having all of our code in our beautiful Ruby monolith 🎉!