Unrepresentative samples in stats problems

Home Forums Questions about the standards 6–8 Statistics and Probability Unrepresentative samples in stats problems

Viewing 6 posts - 1 through 6 (of 6 total)
  • Author
    Posts
  • #2314
    pstinchcombe
    Member

    I’ve seen, in lots of stats curriculum fora including the Common Core progressions and other Common Core materials, problems that ask students to make inferences about animals based on data on some list of about 30 animals.

    In every case that I’ve seen, that list of animals is wildly unrepresentative: for example, the one shown in the progressions is vast-majority (>80%) mammals, and vast-majority over 100 grams. In the progressions doc, this is targeted at a 6th-grade standard, so the students aren’t supposed to know about representative samples yet — but often these questions are targeted at higher-level standards (my guess is the 8th-grade question on life span vs. speed is based on a similarly size-biased sample, although I can’t tell for sure).

    It feels like we ask students to learn about representative samples in 7th grade, but at every point before and after this, we expect them to be ignoring it and drawing inferences based on unrepresentative samples. The progressions document seems to be implicitly telling me I shouldn’t be worried about this — is that the intention? If so, I’d like to hear why this isn’t a concern.

    #2321
    Bill McCallum
    Keymaster

    I’m going to ask a statistician to take a look at this question, but here is my take on it. When you talk about a representative sample, you are talking about a situation where there is a more or less homogeneous population with a number of different subtype (e.g. ethnic groups in a population of a country). And you want to make sure your sample has the same proportions of subtypes.

    But I don’t think you can say that animals form a homogenous group. For each species you might measure average height and weigh, and you have probably already chosen a representative sample within each species in order to get those averages. And then you want to see if there is any relationship between these variables across species. You could just try to get all the species, but I’m not sure what it would mean to make a representative selection of species. Would you try to make sure the percentage of mammalian species represented the true proportion in the world? Or would you go for some sort of geographic representation? I can see that it’s worth discussing these things, but it seems a too complicated example to introduce 6th graders to the idea of a representative sample (or even high schoolers, for that matter).

    #2322
    pstinchcombe
    Member

    Just to clarify, I certainly don’t think this problem is a good way to introduce students to the idea of representative sampling. The question isn’t whether we should use such datasets to teach students about representative sampling: it’s whether we should use such datasets at all.

    My impression has been that representativeness of a sample is still relevant — although harder to assess — when the population in question isn’t as easy to understand. If we don’t believe our sample is representative at least in the key regards, what justification could we have for drawing conclusions about the relationship between characteristics of animals from that sample?

    Part of the problem, for me, is that the standards do address this, briefly, when they say “random sampling tends to produce representative samples and valid inferences.” The probability that any random selection of animals — whether you weighted by geographic distribution, or by population, or biomass, or any other reasonable criterion except familiarity to us — would produce a sample which is this weighted towards large animals and mammals is vanishingly small.

    #2323
    Bill McCallum
    Keymaster

    Here is the comment from the statistician I asked, Roxy Peck:

    I think that at the grade 6 level, the idea is that kids are working with census data and that they are not interested in generalizing beyond the group that they have data on. The idea of sampling is introduced in Grade 7 and from there on it is reasonable to have students think about the sampling process and whether or not that process is likely to result in a representative sample whenever they are generalizing from a sample to a larger population.

    However, in many situations, like the ones that look at relationships between variables in Grade 8, regression and other model fitting is viewed as descriptive rather than inferential. So I think that these kinds of examples are OK, as long as students are trying to describe the relationship in a given set of data and are not generalizing beyond the data set. It is related to the distinction between descriptive statistics and inferential statistics. My take on the common core standards is that all of the modeling relationships between two numerical variables falls in the descriptive statistics realm. But I think it is appropriate to ask students to think about the sampling issues if they are asked to make predictions based on regression models. The assumption that is being made is that the data are representative of the relationship between the variables–something that would follow if the data are from a random sample but which also might be reasonable even when the data are not from a random sample.

    I would add that the example in the progression about animal speeds is linked to the Grade 6 standard about describing data sets (6.SP.5), so falls squarely in the domain of descriptive statistics.

    #2337

    I teach 6th grade math at a charter school in the Bay Area and have been working with my students on statistics these past few weeks. I think I’ve been overthinking my unit – there’s too much going on and too little time to teach it all. I wrote up my unit plan before being told about this website and decided to teach my students about representative samples. I’ve enjoyed teaching them this – although they are no doubt a little confused – but it seems like a logical place to go. I interpreted the standards as a way not just to analyze data but a way for them to create their own statistical questions, collect data, create visual representations of that data, and analyze it. Although a representative sample is a challenging concept to understand, I’m wondering why this is not part of the standard. I’m not going into too much depth with it but it should be intuitive that you cannot take a complete sample of something – you have to ask a smaller group to get information about the larger population. You can use the data from a smaller sample to make inferences about larger samples. They also learn about inferring in ELA class and this provides a great opportunity to extend that skill to another subject.

    I would love some feedback on this. Thanks.

    #2344
    pstinchcombe
    Member

    OK, thanks — so basically the conclusion is that we can do descriptive statistics on a non-representative sample just fine, and it’s only when we want to make inferences about the whole population that we need to worry about representativeness.

    It seems to me that questions about whether there *is* a linear trend (or, equivalently, a nonlinear trend, or a relationship between categorical variables) are irrevocably inferential — unless the correlation is exactly zero, which almost never happens. Do you agree with that, or is there a purely descriptive way to make sense of such a question?

    The progressions doc does indicate that we should make such judgments. (“if the two proportions… are about the same…”) If one of two proportions is a little bit higher than the other, the only way I know how to make sense of differentiating between sample proportions being about the same vs. one being higher is to make an (informal) inferential distinction: if I flip a coin 100 times, and 51% of them are heads, that’s about the same as I’d expect for a fair coin, but if I flip a coin 100,000,000 times and 51% are heads, that’s not about the same as I’d expect for a fair coin.

    Does this seem right to you? That questions about whether there is a relationship are inferential by their very nature? It seems like in this case we’ve got to be careful about using the data set in the example on speeds and lifespans to make any conclusions; it seems like saying “the linear trend is due to a couple of outliers” is okay, as is “the line that best fits the data is…” but asking if there is a linear trend is a bad idea since this is an inferential question (or rather, either it’s an inferential question or the answer is “yes” for almost every data set including this one) and the data are unrepresentative.

Viewing 6 posts - 1 through 6 (of 6 total)
  • You must be logged in to reply to this topic.