How do you all handle discrepant demographic responses?

Hi, all! Wondering if anyone can suggest resources on how to handle discrepant demographic responses in a longitudinal survey. For example, if a respondent identifies as LGBTQ+ in an initial screener but then not in the follow-up survey or if they identify as LGBTQ+ at baseline but then not at a follow-up wave. We’re trying to identify some best practices on how to treat these cases analytically.

1 Like

Hi allie - In my work, we would use the most recent response. This goes for race, ethnicity, sexual orientation and gender identity, disability, etc. Our reasoning is that these categories (many of which are not created by the people that inhabit them!) are meant to be fluid and not stagnant. I have found that with older populations and LGBTQ+ identity there tends not to be as much shift over time, but with younger people it is a different story, especially as new ways to identify regarding sexual orientation and gender identity evolve.

Analytically, you could try a couple of things 1) use identity as baseline; 2) use identity at follow up; and 3) look at the individuals who changed identities from baseline to follow-up (this is assuming you are using LGBTQ+ identity as a an independent variable) to see if they might be different than those that did not change during the study (especially look at age).

Good luck!


Great thread on an essential conversation! Thanks for raising it @allie and great reply @AileenDuldulao.

For me, it really depends on what you’re trying to measure in asking the demographic question - and what role the demographic data plays in your analysis.

If you’re doing a descriptive analysis and trying to understand something like the different percentages of people experiencing an event or using a service in each category, then I’d consider using the data from the time of collection that most closely measures what you’re trying to measure with the demographic data. Another option in addition to the ones suggested by @AileenDuldulao is to include them in both categories they have selected.

If you’d doing a causal model and need to include the demographic variable as a moderator or confounder, it’s often a good idea to format the demographic variable as a time-variant variable and including both.

If you tell us a little more about the question you’re trying to answer with the analysis, I’d be happy to include an example from one of our projects.

1 Like

Thanks so much for these responses, @Heather and @AileenDuldulao! Helpful to hear how other folks have done it. For a bit more context, we’re doing some analyses to determine if a health intervention affects various population subgroups differently. The categories that seem to fluctuate the most over time are LGBTQIA+ status and mental health, which makes sense as both can be quite fluid; we’re just trying to determine if we should be more strict in our definition (e.g., include only respondents who identified as LGBTQIA+ in the current wave) or more lenient (e.g., include respondents who identified as LGBTQIA+ in any wave, even if they don’t identify as such in the current wave). We’re leaning toward the latter, as many factors can influence whether someone chooses to self-identify on a survey, but would love to hear any additional thoughts you all have!

1 Like

In an ideal world, I’d do it both ways and see what types of differences this makes. Of course, this is probably too resource-intensive for most real-world situations. I would tend to agree with you that if I could only choose one, I’d choose the latter and ensure it was labelled clearly.