text mining - What is the way to represent factor variables in scikit-learn while using Random Forests? -
i solving classification problem using random forests. have decided use python library scikit-learn. new both random forest algorithm , tool. data contains many factor variables. googled , found out it's not right give numerical values factor variables in linear regression, treat continuous variable , give wrong result. not find how deal factor variables in scikit-learn. please tell me options use or point me document can it.
if using pandas data frame can use get_dummies function accomplish this. here's example:
import pandas pd my_data = [['a','b'],['b','a'],['c','b'],['d','a'],['a','c']] df = pd.dataframe(my_data, columns = ['var1','var2']) dummy_ranks = pd.get_dummies(df['var1'], prefix = 'var1_') print dummy_ranks var1__a var1__b var1__c var1__d 0 1 0 0 0 1 0 1 0 0 2 0 0 1 0 3 0 0 0 1 4 1 0 0 0 [5 rows x 4 columns]
Comments
Post a Comment