Fun with statistics, American-style

As a former teacher of a section of an applied probability class at MIT and a recent tutor for AP Statistics, here are my favorite recent news stories…

“Facebook says high-frequency posters often share fake news” (Engadget):

Adam Mosseri, the Facebook VP in charge of News Feed, said that the company’s research shows that people who post more than 50 times per day are often sharing low quality content.

“Harassed, Propositioned and Silenced in Silicon Valley” (nytimes front page headline; the story itself carries “Women in Tech Speak Frankly on Culture of Harassment”)

More than two dozen women in the technology start-up industry spoke to The Times in recent days about being sexually harassed. … The women’s experiences help explain why the venture capital and start-up ecosystem — which underpins the tech industry and has spawned companies such as Google, Facebook and Amazon — has been so lopsided in terms of gender.

[What’s the denominator for those 24 women who complained to the Times? This report shows total Silicon Valley employment at 1.5 million jobs in 2015. The EEOC says that 36 percent of “high tech workers” in 2014 were women (Google says that 31 percent of its employees in 2016 identified as “women”). If all Silicon Valley jobs were in tech, that would be roughly 500,000 women, so the denominator is at most 500,000 but almost certainly smaller due to the fact that not all jobs in SV are tech jobs.]

15 thoughts on “Fun with statistics, American-style

  1. >What’s the denominator for those
    >24 women who complained to the Times?

    If one was trying to estimate a rate, the denominator would be the sample size, not the population size. Of course, any such calculation is of little value unless the data were collected using random sampling, and it appears these data were not so collected. Unlike this posting (which attempts to make such a calculation anyway), I did not see any incidence estimate in the NY Times article, did I miss it?

    If I understood the article correctly, then the population under discussion is women attempting to raise SV seed/venture capital. If, for some other reason, one is trying to ballpark the size of that population, I don’t see how the fraction of women in Google’s work force and total number SV jobs are of any value. Compared to the estimate given in the posting, 500,000, the actual value is not just “almost certainly smaller”, but certainly smaller by orders of magnitude.

  2. Is there some sort of calculation that would indicate how many women the Times should have interviewed before publishing that article? Should their reports have spoken to 100,000 women?

  3. If we found out how long each of these 24 women had been working in the field and also how many independent incidents of harassment each had encountered would it be possible to estimate an incidence rate for the larger population of women in the SV angel/venture game? Granted, it would require making some assumptions (e.g. all women are at the same risk of being harassed), but even if those assumptions were not valid they might not be so invalid as to make the estimate completely uninteresting.

  4. If you guys have faith in a sampling approach that starts with “Email or call us only if you feel that you’ve been sexually harassed,” I don’t think that we need to argue about the downstream statistical methods employed.

  5. Actually, the more than I think about it, the more significant this kind of stuff is, albeit still not statistically significant…

    If you’re in Silicon Valley having trouble competing, pay a few women to call up the New York Times and say that they were harassed by employees of your competitor. I’m not suggesting that Lyft took this approach, but just imagine what https://www.susanjfowler.com/blog/2017/2/19/reflecting-on-one-very-strange-year-at-uber was worth to Lyft.

  6. @philg: What I’m pondering is whether it would be possible to tease out some meaningful quantitative results even with the obvious limitation of the sampling approach by collecting some additional data (number of times harassed and time period exposed to harassment).

    Assuming this sample is representative of women seeking SV seed/venture capital is clearly problematic, but let’s assume they are representative of some subset of that population. Then let’s say (the data indicates) that each woman was harassed once. We should be able to compute the highest incidence rate (per unit time) which is compatible with pulling a sample n=24 (from our imaginary population) and finding every woman sampled had been harassed once. This would be much lower than the average incidence rate for the sample, because if the population rate matched the sample rate with a sample of that size we would expect some (probably many) of the women to have been harassed multiple times.

    Now we are interested in the population of women who are seeking SV seed/venture capital, not some imaginary subset of that population. Our biased sampling method would tend to produce a higher than actual estimate for that population. However, that’s ok if our computed incidence rate is “low” (i.e. all or most of the women were harassed once).. That just means that the actual rate is even lower than our already low computed rate. We have learned something. If our computed rate is high (i.e. many of the women have been harassed multiple times), we can’t really say to much about the rate in our population of concern because we don’t know how much of that high rate is due to sampling bias. In this case, oh well, perhaps the result is suggestive but more research is needed with a better sampling technique.

  7. >If you’re in Silicon Valley having trouble
    >competing, pay a few women to call up
    >the New York Times and say that they
    >were harassed by employees of your
    >competitor.

    Or you can just steal their IP by hiring away an engineer who takes it for you.

  8. >I don’t think that we need to argue
    >about the downstream statistical methods employed.

    I wasn’t looking to argue, I was looking for technical feedback on #6.

  9. Neal: In #6 you’re using a lot of terms that could be found in a probability textbook, but I still don’t see how they can add up to analysis.

    Let’s take this out of the politically and emotionally charged area of sex and gender. Suppose that the NYT canvasses a city of 1.5 million for people who’ve eaten in a particular restaurant. 24 people come forward and talk about their experience in this restaurant. You learn that some of them have been to the restaurant multiple times. What do you think that you can say about the dining preferences of the rest of the folks who live in that city of 1.5 million (assume that the same percentage of these 1.5 million never go to any restaurant as the percentage of SV workers who aren’t in the tech industry)?

  10. In addition to sample selection bias, model bias is also huge. Using comparable definitions of sexual harassment, is there also culture of sexually harassing men in Silicon Valley?

    If you work in Silicon Valley, and a co-worker has ever worn a dress that you found sexually stimulating, call the NYT and complain about a culture of women sexually harassing men!

  11. @philg: I agree that my analysis won’t work in your hypothetical, but I’m not sure the hypothetical is analogous. How about this one:

    A city is getting complaints from people being hit by bird crap at a very popular park. After placing an ad in the newspaper, 24 people who have been hit in the park respond and provide the total time they have spent in the park and the number of times they have been splatted. Clearly, these data do not enable us to estimate the risk (per unit time) of being hit by bird crap for park visitors. We have no way of knowing if these respondents are representative of the visitors. My claim is that we can use these data to compute an upper bound for the risk to park visitors, and furthermore, that this computed upper bound will be much lower than the sampled risk (sum[splats] / sum[minutes in park]) for the 24x(splats=1) case.

  12. Neal: I think I understand your point now. Even though the 24 anecdotes are a ridiculously small sample size for drawing conclusions regarding a large industry, we could still get some information out of the providers of those anecdotes. For example, if Susan Wu, whom the Times quotes as complaining about having her face touched, had interacted with 10 VCs and one touched her face that is a little different than if she’d interacted with 1,000 VCs and one touched her face. (If the latter, though, I’m not going to hold my breath for the NY Times running a story headlined “We estimate that no more than 0.1% of VCs are face-touchers.”)

  13. Here’s a better formulation: If I ask people to contact me if they love Michael Bolton as much as I do, and 24 people from Silicon Valley respond that they enjoyed listening to “When a Man Loves a Woman” while relaxing with VC friends, will you be convinced that part of the Silicon Valley “culture” is being a huge Michael Bolton fan?

  14. Yes, my point was that there is nothing inherently wrong with anecdotal evidence or even non-randomly sampled data provided the analysis used and conclusions drawn take the limitations of the sampling method into account. Often, this kind of flawed data is all that is available but some creative analysis (done carefully) can still unlock genuine insights. I completely agree that samples collected by seeking sample points with a particular characteristic cannot be used to infer the frequency of that characteristic in the population.

Comments are closed.