Errors in Satisfaction

statistics
r
Author

Mark Jurries II

Published

February 22, 2024

I came across an article on American life satisfaction yesterday. The trend analysis was fine, but the local news story version of it decided to take it a step further and do some sub-group analysis. While some news stories are pretty obviously wrong - if a reporter were to issue a story about how much better the grocery stores in Moscow are, for instance, you wouldn’t take long to realize it’s nonsense* - there’s something about numbers that lulls us into a “looks right, I guess” sense.

*Scott Lincicome worked the numbers, BTW, and concluded the same amount of groceries would cost about $130 over here, vs. the $400 the segment estimated. The more you know.

I had several questions that I was hoping to resolve by getting complete data (more on that later), but the most I could find was a cross-tab with some (not all) of the demographics in the report. This isn’t to pick on this poll, it looks to be well done and you can’t blame them for somebody else possibly reading too much into it. Still, having recently read Steve Wexler’s excellent take on checking uncertainty in data, this seemed like a good place to put it to practice.

Show the code
library(gt)
library(gtExtras)
library(hrbrthemes)
library(tidyverse)

sat_data <- tibble(demo = character(),
       class = character(), 
       n = numeric(), 
       sat = numeric()) %>%
  add_row(demo = 'Total', class = 'Total', n = 1011, sat = 791) %>%
  add_row(demo = 'Gender', class = 'Male', n = 502, sat = 392) %>%
  add_row(demo = 'Gender', class = 'Female', n = 500, sat = 394) %>%
  add_row(demo = 'Race', class = 'White', n = 649, sat = 518) %>%
  add_row(demo = 'Race', class = 'Non-White', n = 334, sat = 253) %>%
  add_row(demo = 'Age', class = '18-34', n = 276, sat = 210) %>%
  add_row(demo = 'Age', class = '35-54', n = 310, sat = 256) %>%
  add_row(demo = 'Age', class = '55+', n = 396, sat = 301) %>%
  add_row(demo = 'Education', class = 'College Grad', n = 374, sat = 317) %>%
  add_row(demo = 'Education', class = 'Some College', n = 279, sat = 212) %>%
  add_row(demo = 'Education', class = 'HS Grad or Less', n = 355, sat = 260) %>%
  add_row(demo = 'Party ID', class = 'Republican', n = 255, sat = 197) %>%
  add_row(demo = 'Party ID', class = 'Independent', n = 461, sat = 346) %>%
  add_row(demo = 'Party ID', class = 'Democrat', n = 273, sat = 230) %>%
  add_row(demo = 'Household Income', class = 'Less Than $50,000', n = 313, sat = 219) %>%
  add_row(demo = 'Household Income', class = '$50,000 - 100,000', n = 322, sat = 247) %>%
  add_row(demo = 'Household Income', class = '$100,000+', n = 278, sat = 247) %>%
  mutate(sat_perc = sat / n)

sat_data %>%
  gt() %>%
  gt_theme_espn() %>%
  fmt_number(columns = c('n', 'sat'), decimals = 0) %>%
  fmt_percent(columns = c('sat_perc'), decimals = 0)
demo class n sat sat_perc
Total Total 1,011 791 78%
Gender Male 502 392 78%
Gender Female 500 394 79%
Race White 649 518 80%
Race Non-White 334 253 76%
Age 18-34 276 210 76%
Age 35-54 310 256 83%
Age 55+ 396 301 76%
Education College Grad 374 317 85%
Education Some College 279 212 76%
Education HS Grad or Less 355 260 73%
Party ID Republican 255 197 77%
Party ID Independent 461 346 75%
Party ID Democrat 273 230 84%
Household Income Less Than $50,000 313 219 70%
Household Income $50,000 - 100,000 322 247 77%
Household Income $100,000+ 278 247 89%

The smallest group here has a sample size of 273. Surely that’s a big enough number, right?* Let’s calculate the standard error - that is, if we drew another sample of the same size, we’d expect to get a number between x and y. The bars indicate this range, if they overlap, then we’re less sure that our difference is real vs. due to the sample we happened to get.

Show the code
sat_data %>%
  mutate(se = sqrt( (sat_perc * (1 - sat_perc)) / n),
         se_low = sat_perc - (1.96 * se),
         se_high = sat_perc + (1.96 * se)) %>%
  filter(demo != 'Total') %>%
  ggplot(aes(x = sat_perc, y = class))+
  geom_point()+
  geom_errorbar(aes(xmin = se_low, xmax = se_high))+
  theme_ipsum()+
  facet_wrap(demo ~ ., scales = "free_y")+
  xlab('')+
  ylab('')+
  scale_x_continuous(breaks = seq(0.6, 0.96, by = 0.10),
                     labels = scales::percent)

We see that both college grads and those making over $100K report higher satisfaction*. This is where the cross-tab lets us down, though. It’s reasonable to assume that most (not all, but most) making over $100K are also college grads. If we were building a model on this, we’d test the relationship and either only keep one or use an interaction term, i.e. measure satisfaction for each combination of education and income. Political party shows clear separation between Independents and Democrats, but everything else has overlap.

*This of course assumes that we have a representative sample. Their methodology seems sound - random calls spread across the country - but you should always check how people were polled. If they opted in to an online poll on a partisan website, you can bet the results will be skewed.

Eyeballing stuff is fun, but we can be a bit more rigorous and run a chi-squared test in each group. If the p-value is less than 0.05, we’ll say the differences are likely real - though we should incorporate our own knowledge of how the data was generated and not just check the box and call it good.

Show the code
sat_data %>%
  filter(demo != 'Total') %>%
  mutate(non_sat = n - sat) %>%
  select(-sat_perc, -class, -n) %>%
  group_by(demo) %>%
  nest() %>% 
  mutate(chisq_p = map_dbl(data, ~chisq.test(.)$p.value)) %>%
  select(-data) %>%
  ungroup() %>%
  gt() %>%
  gt_theme_espn() %>%
  fmt_number(columns = c(chisq_p), decimals = 3)
demo chisq_p
Gender 0.844
Race 0.166
Age 0.070
Education 0.000
Party ID 0.013
Household Income 0.000

There may be other lurking variables as well. For instance, age may be more important that ethnicity, but if the survey contained a higher proportion of whites in the 35-55 group, then that would skew their results, even though they’d be dead even if we adjusted for age. The news report also mentioned religious attendance, that’s not in the crosstab but would certainly add flavor to a full-blown analysis.

*Ryan Burge delves into relgious/demographic/political data quite regularly, his blog is well worth the follow. to

Reporting this way is certainly better than a plain old cross tab, and the error bars should keep us somewhat in line. And there are situations where we have to take this route either because of time constraints or because this is the only data we have. Ideally, we’d set up a regression model that would let us know how much each demographic contributes.

*Note to aspiring data analysts: someday, you’ll get an emergency request for data, likely a crosstab. Providing even basic linear regression, t-tests, etc. can help avoid some of the sample size issues we’re trying to stay away from.

But at a fundemental level, knowing the questions to ask will help keep you from being led astray by bad studies. I used R for this, but there’s nothing here that couldn’t be done in Excel or Google Sheets. I’m not saying you need to crunch the numbers yourself each time you see a study, but slowing down and thinking it through will benefit you in the long run.