Thursday, 17 November 2016

r - Sample n random rows per group in a dataframe



From these questions - Random sample of rows from subset of an R dataframe & Sample random rows in dataframe I can easily see how to randomly sample (select) 'n' rows from a df, or 'n' rows that originate from a specific level of a factor within a df.




Here are some sample data:



df <- data.frame(matrix(rnorm(80), nrow=40))
df$color <- rep(c("blue", "red", "yellow", "pink"), each=10)

df[sample(nrow(df), 3), ] #samples 3 random rows from df, without replacement.


To e.g. just sample 3 random rows from 'pink' color - using library(kimisc):




library(kimisc)
sample.rows(subset(df, color == "pink"), 3)


or writing custom function:



sample.df <- function(df, n) df[sample(nrow(df), n), , drop = FALSE]
sample.df(subset(df, color == "pink"), 3)



However, I want to sample 3 (or n) random rows from each level of the factor. I.e. the new df would have 12 rows (3 from blue, 3 from red, 3 from yellow, 3 from pink). It's obviously possible to run this several times, create newdfs for each color, and then bind them together, but I am looking for a simpler solution.


Answer



You can assign a random ID to each element that has a particular factor level using ave. Then you can select all random IDs in a certain range.



rndid <- with(df, ave(X1, color, FUN=function(x) {sample.int(length(x))}))
df[rndid<=3,]


This has the advantage of preserving the original row order and row names if that's something you are interested in. Plus you can re-use the rndid vector to create subset of different lengths fairly easily.


No comments:

Post a Comment

c++ - Does curly brackets matter for empty constructor?

Those brackets declare an empty, inline constructor. In that case, with them, the constructor does exist, it merely does nothing more than t...