Using tapply, ave functions for ff vectors in R -
i have been trying use tapply
, ave
, ddply
create statistics group of variable (age, sex). haven't been able use above mentioned r commands successfully.
library("ff") df <- as.ffdf(data.frame(a=c(1,1,1:3,1:5), b=c(10:1), c=(1:10))) tapply(df$a, df$b, length)
the error message
error in as.vmode(value, vmode) : argument "value" missing, no default
or
error in bymean(df$b, df$a) : object 'index' not found
there no tapply or ave ff_vectors implemented in package ff. can use functionality in ffbase. let's elaborate on bigger dataset
require(ffbase) <- ffrep.int(ff(1:100000), times=500) ## 50mio records on disk - not in ram b <- ffrandom(n=length(a), rfun = runif) c <- ffseq_len(length(a)) df <- ffdf(a = a, b = b, c = c) ## on disk dim(df)
for simple aggregation method, can use binned_sum can extract length follows. mark binned_sum needs ff factor object in bin, can obtained doing as.character.ff shown.
df$groupbyfactor <- as.character(df$a) agg <- binned_sum(x=df$b, bin=df$groupbyfactor, nbins = length(levels(df$groupbyfactor))) head(agg) agg[, "count"]
for more complex aggregations can use ffdfdply in ffbase. combine data.table statements this:
require(data.table) agg <- ffdfdply(df, split=df$groupbyfactor, fun=function(x){ x <- as.data.table(x) result <- x[, list(b.mean = mean(b), b.median = median(b), b.length = length(b), whatever = b[c == max(c)][1]), = list(a)] result <- as.data.frame(result) result }) class(agg) aggg <- as.data.frame(agg) ## puts data in ram!
this put data in ram in chunks of groups of split elements based on can apply function, data.table statements, require data in ram. result of chunks based on applied function next combined in new ffdf, can further use it, or put ram if ram allows size.
the sizes of chunks controlled getoption("ffbatchbytes"). if have more ram, better allow more data in each chunk in ram.
Comments
Post a Comment