r - Calculating percentages of a factor variable with dplyr

Friday 9 December 2016

r - Calculating percentages of a factor variable with dplyr

I am trying to calculate percentages/counts of each level of a factor variable in a data frame within dplyr, kind of like using table, and while I can do this manually, this becomes tedious if I have many factor variables or the factor variable has many levels.

Example:

set.seed(100)
data <- data.frame(groupbyvar = LETTERS[1:4],
               var1 = letters[1:4],
               var2 = as.factor(sample(1:4,12,TRUE)))


data %>% group_by(groupbyvar) %>% summarise(var1_a = mean(var1 == 'a', na.rm=TRUE),
                                        var1_b = mean(var1 == 'b', na.rm=TRUE),
                                        var1_c = mean(var1 == 'c', na.rm=TRUE),
                                        var1_d = mean(var1 == 'd', na.rm=TRUE),
                                        var1_1 = mean(var2 == 1, na.rm=TRUE),
                                        var1_2 = mean(var2 == 2, na.rm=TRUE),
                                        var1_3 = mean(var2 == 3, na.rm=TRUE),
                                        var1_4 = mean(var2 == 4, na.rm=TRUE))

I thought about using table, but this doesn't generate output that dplyr can understand. Also, I thought about using something like model.matrix to generate indicators on the factor variables before passing in the dataframe, but this increases memory footprint unnecessarily (esp for a large data set). Is there some easy way to automate this?

The result should be a new dataframe with percentages/counts:

  groupbyvar var1_a var1_b var1_c var1_d    var1_1    var1_2    var1_3    var1_4
1          A      1      0      0      0 0.0000000 0.6666667 0.3333333 0.0000000
2          B      0      1      0      0 0.3333333 0.6666667 0.0000000 0.0000000
3          C      0      0      1      0 0.0000000 0.0000000 0.6666667 0.3333333
4          D      0      0      0      1 0.3333333 0.3333333 0.0000000 0.3333333

I want it to automate the suffix on each column name, similar to what model.matrix does with factor variables.

Blog

Friday 9 December 2016

r - Calculating percentages of a factor variable with dplyr

No comments:

Post a Comment

c++ - Does curly brackets matter for empty constructor?