python - Creating a Term Document Matrix from Text File -


i'm trying read 1 text file , create term document matrix using textmining packages. can create term document matrix need add each line line. problem want include whole file @ time. missing in following code? in advance suggestion?

import textmining  def term_document_matrix_roy_1():      '''-----------------------------------------'''     open("data_set.txt") f:         reading_file_line = f.readlines() #entire content, return  list          print reading_file_line #list         reading_file_info = [item.rstrip('\n') item in reading_file_line]         print reading_file_info         print reading_file_info [1] #list-1         print reading_file_info [2] #list-2          '''-----------------------------------------'''         tdm = textmining.termdocumentmatrix()         #tdm.add_doc(reading_file_info) #giving error because of readlines          tdm.add_doc(reading_file_info[0])                tdm.add_doc(reading_file_info[1])         tdm.add_doc(reading_file_info[2])           row in tdm.rows(cutoff=1):             print row 

sample text files: "data_set.txt" contain following information:

lets write python code

thus far, book has discussed process of ad hoc retrieval.

along way study important machine learning techniques.

output term document matrix, how many times 1 specific word appear. output image: http://postimg.org/image/eidddlkld/

enter image description here

if i'm understanding correctly, you're adding each line of file separate document. add whole file, concatenate lines, , add them @ once.

tdm = textmining.termdocumentmatrix() #tdm.add_doc(reading_file_info) #giving error because of readlines  tdm.add_doc(' '.join(reading_file_info)) 

if looking multiple matrices, you'll end getting 1 row in each, there 1 document, unless have way of splitting line in separate documents. may want re-think whether want. nevertheless, think code you:

with open("txt_files/input_data_set.txt") f:     tdms = []     line in f:         tdm = textmining.termdocumentmatrix()         tdm.add_doc(line.strip())         tdms.append(tdm)      tdm in tdms:         row in tdm.rows(cutoff=1):             print row 

i haven't been able test code, output might not right. on way.


Comments

Popular posts from this blog

c# - Operator '==' incompatible with operand types 'Guid' and 'Guid' using DynamicExpression.ParseLambda<T, bool> -