python - Creating a Term Document Matrix from Text File -
i'm trying read 1 text file , create term document matrix using textmining packages. can create term document matrix need add each line line. problem want include whole file @ time. missing in following code? in advance suggestion?
import textmining def term_document_matrix_roy_1(): '''-----------------------------------------''' open("data_set.txt") f: reading_file_line = f.readlines() #entire content, return list print reading_file_line #list reading_file_info = [item.rstrip('\n') item in reading_file_line] print reading_file_info print reading_file_info [1] #list-1 print reading_file_info [2] #list-2 '''-----------------------------------------''' tdm = textmining.termdocumentmatrix() #tdm.add_doc(reading_file_info) #giving error because of readlines tdm.add_doc(reading_file_info[0]) tdm.add_doc(reading_file_info[1]) tdm.add_doc(reading_file_info[2]) row in tdm.rows(cutoff=1): print row sample text files: "data_set.txt" contain following information:
lets write python code
thus far, book has discussed process of ad hoc retrieval.
along way study important machine learning techniques.
output term document matrix, how many times 1 specific word appear. output image: http://postimg.org/image/eidddlkld/

if i'm understanding correctly, you're adding each line of file separate document. add whole file, concatenate lines, , add them @ once.
tdm = textmining.termdocumentmatrix() #tdm.add_doc(reading_file_info) #giving error because of readlines tdm.add_doc(' '.join(reading_file_info)) if looking multiple matrices, you'll end getting 1 row in each, there 1 document, unless have way of splitting line in separate documents. may want re-think whether want. nevertheless, think code you:
with open("txt_files/input_data_set.txt") f: tdms = [] line in f: tdm = textmining.termdocumentmatrix() tdm.add_doc(line.strip()) tdms.append(tdm) tdm in tdms: row in tdm.rows(cutoff=1): print row i haven't been able test code, output might not right. on way.
Comments
Post a Comment