r - How to calculate Euclidean distance (and save only summaries) for large data frames -
i've written short 'for' loop find minimum euclidean distance between each row in dataframe , other rows (and record row closest). in theory avoids errors associated trying calculate distance measures large matrices. however, while not being saved in memory, very slow large matrices (my use case of ~150k rows still running).
i'm wondering whether can advise or point me in right direction in terms of vectorising function, using apply or similar. apologies may seem simple question, i'm still struggling think in vectorised way.
thanks in advance (and patience).
require(proxy) df<-data.frame(matrix(runif(10*10),nrow=10,ncol=10), row.names=paste("site",seq(1:10))) min.dist<-function(df) { #df results all.min.dist<-data.frame() #set loop for(k in 1:nrow(df)) { #calcuate dissimilarity between each row , other rows df.dist<-dist(df[k,],df[-k,]) # find minimum distance min.dist<-min(df.dist) # rowname minimum distance (id of nearest point) closest.row<-row.names(df)[-k][which.min(df.dist)] #combine outputs all.min.dist<-rbind(all.min.dist,data.frame(orig_row=row.names(df)[k], dist=min.dist, closest_row=closest.row)) } #return results return(all.min.dist) } #example min.dist(df)
this should start. uses fast matrix operations , avoids growing object construct, both suggested in comments.
min.dist <- function(df) { which.closest <- function(k, df) { d <- colsums((df[, -k] - df[, k]) ^ 2) m <- which.min(d) data.frame(orig_row = row.names(df)[k], dist = sqrt(d[m]), closest_row = row.names(df)[-k][m]) } do.call(rbind, lapply(1:nrow(df), which.closest, t(as.matrix(df)))) }
if still slow, suggested improvement, compute distances k points @ time instead of single one. size of k need compromise between speed , memory usage.
edit: read https://stackoverflow.com/a/16670220/1201032
Comments
Post a Comment