Thursday 25 August 2016

r - data.table join then add columns to existing data.frame without re-copy




I have two data.tables, X (3m rows by ~500 columns), and Y (100 rows by two columns).



set.seed(1)
X <- data.table( a=letters, b=letters, c=letters, g=sample(c(1:5,7),length(letters),replace=TRUE), key="g" )
Y <- data.table( z=runif(6), g=1:6, key="g" )


I want to do a left outer join on X, which I can do by Y[X] thanks to:




Why does X[Y] join of data.tables not allow a full outer join, or a left join?



But I want to add the new column to X without copying X (since it's huge).



Obviously, something like X <- Y[X] works, but unless data.table is far cleverer than I give it credit for (and I give it credit for quite a lot of deviousness!), I believe this copies the whole of X.



X[ , z:= Y[X,z]$z ] works, but is kludgy and doesn't scale well to more than one column.



How do I store the results of a merge back into the retained data.table in an efficient (both in terms of copies and in terms of programmer time) way?


Answer




This is easy to do:



X[Y, z := i.z]


It works because the only difference between Y[X] and X[Y] here, is when some elements are not in Y, in which case presumably you'd want z to be NA, which the above assignment will exactly do.



It would also work just as well for many variables:



X[Y, `:=`(z1 = i.z1, z2 = i.z2, ...)]






Since you require the operation Y[X], you can add the argument nomatch=0 (as @mnel points out) so as to not get NAs for those where X doesn't contain the key values from Y. That is:



X[Y, z := i.z, nomatch=0]






From the NEWS for data.table




    **********************************************
** **
** CHANGES IN DATA.TABLE VERSION 1.7.10 **
** **
**********************************************



NEW FEATURES



o   The prefix i. can now be used in j to refer to join inherited
columns of i that are otherwise masked by columns in x with
the same name.


No comments:

Post a Comment

c++ - Does curly brackets matter for empty constructor?

Those brackets declare an empty, inline constructor. In that case, with them, the constructor does exist, it merely does nothing more than t...