Untitled | Free Text Host

Share this text:

Untitled - posted by guest on 30th May 2020 03:27:07 PM

Stat notes

*Sample Size Argument*

If we assume the sample is random (and if it's not, then we're in trouble no matter what we do), then only the sample size, not the population size, matters in terms of the standard error (the uncertainty in the estimate). This is an important finding in statistics.

For example: suppose I wanted to estimate the height of people in the US (population ~330M) and in Micronesia (population ~100,000), and I had a random group of 1000 people from each country. You might think that the confidence interval (the uncertainty) would be smaller for Micronesia because we have a larger share of the population (1% vs .0003%). But this isn't the case! The uncertainty in the estimates would be the same.

There is a slight exception to this (the finite population correction) that kicks in when you're sampling more than about 5%, but even in this case the difference is pretty small and sample_size / total would not be the right correction.

*lm vs loess*

Local regression (Loess) makes no sense for how sparse your data is. The Loess function is essentially taking, for every point along the line, the nearest known points and constructing a low-order polynomial curve that best fits the data in that area. With so few data points, a localized nonlinear regression is going to severely overfit the data.

That being said, I not sure of any application where even a linear fit is appropriate here for a few reasons.

First of all, what is the goal here? The point of a linear model is to have a mathematical function that describes the predicted value ŷ (y-hat) for any future value x. However I'm trying to picture a scenario where it makes sense to predict the frequency of a new value x given a modeled distribution f(x). Could you give more specifics about what goal this model is meant to achieve?

Second, quality of data - it would be very difficult to confirm that your data follows the Assumptions of Linear Models. You just don't have enough data to make any reasonable conclusions. Can you even make conclusions about outliers? Is the value (11,14) an outlier, or a valid data point? It would be difficult to convince me that you can be sure either way.

On a side note, if the data is frequency of occurrence vs each possible x-value, why do you have two different values for x=3, but no value for 4, 8, or 9?

*Lumping too many color/fill arguments*

When you want to add fill/color to the plot and there's too many of them. Use fct_lump on the fill/color variable

df %>% mutate(fill_column = fct_lump(fill_column, n_count)) %>% ggplot(aes(x,y, fill = fill_column)) + geom_col()

let's pretend that cut column from diamonds has many different values

diamonds %>% mutate(cut = fct_lump(cut, 4)) %>% ggplot(aes(x,y, fill = cut)) + geom_col()

Report Abuse