I came across an article on American life satisfaction yesterday. The trend analysis was fine, but the local news story version of it decided to take it a step further and do some sub-group analysis. While some news stories are pretty obviously wrong - if a reporter were to issue a story about how much better the grocery stores in Moscow are, for instance, you wouldn’t take long to realize it’s nonsense* - there’s something about numbers that lulls us into a “looks right, I guess” sense.
*Scott Lincicome worked the numbers, BTW, and concluded the same amount of groceries would cost about $130 over here, vs. the $400 the segment estimated. The more you know.
I had several questions that I was hoping to resolve by getting complete data (more on that later), but the most I could find was a cross-tab with some (not all) of the demographics in the report. This isn’t to pick on this poll, it looks to be well done and you can’t blame them for somebody else possibly reading too much into it. Still, having recently read Steve Wexler’s excellent take on checking uncertainty in data, this seemed like a good place to put it to practice.
Show the code
library(gt)library(gtExtras)library(hrbrthemes)library(tidyverse)sat_data <-tibble(demo =character(),class =character(), n =numeric(), sat =numeric()) %>%add_row(demo ='Total', class ='Total', n =1011, sat =791) %>%add_row(demo ='Gender', class ='Male', n =502, sat =392) %>%add_row(demo ='Gender', class ='Female', n =500, sat =394) %>%add_row(demo ='Race', class ='White', n =649, sat =518) %>%add_row(demo ='Race', class ='Non-White', n =334, sat =253) %>%add_row(demo ='Age', class ='18-34', n =276, sat =210) %>%add_row(demo ='Age', class ='35-54', n =310, sat =256) %>%add_row(demo ='Age', class ='55+', n =396, sat =301) %>%add_row(demo ='Education', class ='College Grad', n =374, sat =317) %>%add_row(demo ='Education', class ='Some College', n =279, sat =212) %>%add_row(demo ='Education', class ='HS Grad or Less', n =355, sat =260) %>%add_row(demo ='Party ID', class ='Republican', n =255, sat =197) %>%add_row(demo ='Party ID', class ='Independent', n =461, sat =346) %>%add_row(demo ='Party ID', class ='Democrat', n =273, sat =230) %>%add_row(demo ='Household Income', class ='Less Than $50,000', n =313, sat =219) %>%add_row(demo ='Household Income', class ='$50,000 - 100,000', n =322, sat =247) %>%add_row(demo ='Household Income', class ='$100,000+', n =278, sat =247) %>%mutate(sat_perc = sat / n)sat_data %>%gt() %>%gt_theme_espn() %>%fmt_number(columns =c('n', 'sat'), decimals =0) %>%fmt_percent(columns =c('sat_perc'), decimals =0)
demo
class
n
sat
sat_perc
Total
Total
1,011
791
78%
Gender
Male
502
392
78%
Gender
Female
500
394
79%
Race
White
649
518
80%
Race
Non-White
334
253
76%
Age
18-34
276
210
76%
Age
35-54
310
256
83%
Age
55+
396
301
76%
Education
College Grad
374
317
85%
Education
Some College
279
212
76%
Education
HS Grad or Less
355
260
73%
Party ID
Republican
255
197
77%
Party ID
Independent
461
346
75%
Party ID
Democrat
273
230
84%
Household Income
Less Than $50,000
313
219
70%
Household Income
$50,000 - 100,000
322
247
77%
Household Income
$100,000+
278
247
89%
The smallest group here has a sample size of 273. Surely that’s a big enough number, right?* Let’s calculate the standard error - that is, if we drew another sample of the same size, we’d expect to get a number between x and y. The bars indicate this range, if they overlap, then we’re less sure that our difference is real vs. due to the sample we happened to get.
We see that both college grads and those making over $100K report higher satisfaction*. This is where the cross-tab lets us down, though. It’s reasonable to assume that most (not all, but most) making over $100K are also college grads. If we were building a model on this, we’d test the relationship and either only keep one or use an interaction term, i.e. measure satisfaction for each combination of education and income. Political party shows clear separation between Independents and Democrats, but everything else has overlap.
*This of course assumes that we have a representative sample. Their methodology seems sound - random calls spread across the country - but you should always check how people were polled. If they opted in to an online poll on a partisan website, you can bet the results will be skewed.
Eyeballing stuff is fun, but we can be a bit more rigorous and run a chi-squared test in each group. If the p-value is less than 0.05, we’ll say the differences are likely real - though we should incorporate our own knowledge of how the data was generated and not just check the box and call it good.
There may be other lurking variables as well. For instance, age may be more important that ethnicity, but if the survey contained a higher proportion of whites in the 35-55 group, then that would skew their results, even though they’d be dead even if we adjusted for age. The news report also mentioned religious attendance, that’s not in the crosstab but would certainly add flavor to a full-blown analysis.
*Ryan Burge delves into relgious/demographic/political data quite regularly, his blog is well worth the follow. to
Reporting this way is certainly better than a plain old cross tab, and the error bars should keep us somewhat in line. And there are situations where we have to take this route either because of time constraints or because this is the only data we have. Ideally, we’d set up a regression model that would let us know how much each demographic contributes.
*Note to aspiring data analysts: someday, you’ll get an emergency request for data, likely a crosstab. Providing even basic linear regression, t-tests, etc. can help avoid some of the sample size issues we’re trying to stay away from.
But at a fundemental level, knowing the questions to ask will help keep you from being led astray by bad studies. I used R for this, but there’s nothing here that couldn’t be done in Excel or Google Sheets. I’m not saying you need to crunch the numbers yourself each time you see a study, but slowing down and thinking it through will benefit you in the long run.