regex - Return original search terms for grep in R -
i have list of items , list of search terms, , trying 2 things:
- search through items matches of search terms, , return true iff match found.
- for items true returned (i.e., there match), return original search term matched in step 1.
so, given following data frame:
items 1 alex 2 alex person 3 test 4 false 5 cathy
and following list of search terms:
"alex" "bob" "cathy" "derrick" "erica" "ferdinand"
i create following output:
items matches original 1 alex true alex 2 alex person true alex 3 test false <na> 4 false false <na> 5 cathy true cathy
step 1 straightforward, having trouble step (2). create 'matches' column, use grepl()
create variable true
if row in d$items
in list of search terms, , false
otherwise.
for step 2, thought should able use grep()
while specifying value = t
, shown in code below. however, returns wrong value: rather return original search term matched grep, returns value of item matched. following output:
items matches original 1 alex true alex 2 alex person true alex person 3 test false <na> 4 false false <na> 5 cathy true cathy
this code using right now. thoughts appreciated!
# dummy data , search terms d = data.frame(items = c("alex", "alex person", "this test", "false", "this cathy")) searchterms = c("alex", "bob", "cathy", "derrick", "erica", "ferdinand") # return true iff search term found in items column, not between letters d$matches = grepl(paste("(^| |[^abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvqxyz])", searchterms, "($| |[^abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvqxyz])", sep = "", collapse = "|"), d[,1], ignore.case = true ) # subset data dmatched = d[d$matches==t,] # problem is: return value matched grepl above dmatched$original = grep(paste("(^| |[^abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvqxyz])", searchterms, "($| |[^abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvqxyz])", sep = "", collapse = "|"), dmatched[,1], ignore.case = true, value = true ) d$original[d$matches==t] = dmatched$original
not want can use qdap
's termco
function this. in case have 2 names in same sentence:
library(qdap) termco(d$items, 1:nrow(d), searchterms) ## > termco(d$items, 1:nrow(d), searchterms) ## nrow(d word.count alex bob cathy derrick erica ferdinand ## 1 1 1 1(100.00%) 0 0 0 0 0 ## 2 2 4 1(25.00%) 0 0 0 0 0 ## 3 3 4 0 0 0 0 0 0 ## 4 4 1 0 0 0 0 0 0 ## 5 5 3 0 0 1(33.33%) 0 0 0
to you're after qdap can use:
dat <- termco(d$items, 1:nrow(d), searchterms)$raw terms <- character() (i in 3:ncol(dat)){ terms <- paste(terms, ifelse(dat[, i] == 1, colnames(dat)[i], "")) } d$matches <- as.logical(rowsums(dat[, -c(1:2)])) x <- gsub(" ", ", ", clean(trim(terms))) d$original <- replacer(x, "", na) ## > d ## items matches original ## 1 alex true alex ## 2 alex person true alex ## 3 test false <na> ## 4 false false <na> ## 5 cathy true cathy
Comments
Post a Comment