r - How to calculate Euclidean distance (and save only summaries) for large data frames -

March 15, 2014

i've written short 'for' loop find minimum euclidean distance between each row in dataframe , other rows (and record row closest). in theory avoids errors associated trying calculate distance measures large matrices. however, while not being saved in memory, very slow large matrices (my use case of ~150k rows still running).

i'm wondering whether can advise or point me in right direction in terms of vectorising function, using apply or similar. apologies may seem simple question, i'm still struggling think in vectorised way.

thanks in advance (and patience).

require(proxy)  df<-data.frame(matrix(runif(10*10),nrow=10,ncol=10), row.names=paste("site",seq(1:10)))  min.dist<-function(df) {    #df results  all.min.dist<-data.frame()  #set loop   for(k in 1:nrow(df)) {      #calcuate dissimilarity between each row , other rows      df.dist<-dist(df[k,],df[-k,])      # find minimum distance      min.dist<-min(df.dist)      # rowname minimum distance (id of nearest point)      closest.row<-row.names(df)[-k][which.min(df.dist)]      #combine outputs      all.min.dist<-rbind(all.min.dist,data.frame(orig_row=row.names(df)[k],      dist=min.dist, closest_row=closest.row))     }  #return results  return(all.min.dist)                         }   #example  min.dist(df)

this should start. uses fast matrix operations , avoids growing object construct, both suggested in comments.

min.dist <- function(df) {    which.closest <- function(k, df) {     d <- colsums((df[, -k] - df[, k]) ^ 2)     m <- which.min(d)     data.frame(orig_row    = row.names(df)[k],                dist        = sqrt(d[m]),                closest_row = row.names(df)[-k][m])   }    do.call(rbind, lapply(1:nrow(df), which.closest, t(as.matrix(df)))) }

if still slow, suggested improvement, compute distances k points @ time instead of single one. size of k need compromise between speed , memory usage.

edit: read https://stackoverflow.com/a/16670220/1201032

Search This Blog

Parth Code

r - How to calculate Euclidean distance (and save only summaries) for large data frames -

Comments

Post a Comment

Popular posts from this blog

c# - WPF Converters DLL - Failed to Add Reference -

linux - xterm copying to CLIPBOARD using copy-selection causes automatic updating of CLIPBOARD upon mouse selection -

c++ - qgraphicsview horizontal scrolling always has a vertical delta -