text mining - What is the way to represent factor variables in scikit-learn while using Random Forests? -


i solving classification problem using random forests. have decided use python library scikit-learn. new both random forest algorithm , tool. data contains many factor variables. googled , found out it's not right give numerical values factor variables in linear regression, treat continuous variable , give wrong result. not find how deal factor variables in scikit-learn. please tell me options use or point me document can it.

if using pandas data frame can use get_dummies function accomplish this. here's example:

import pandas pd  my_data = [['a','b'],['b','a'],['c','b'],['d','a'],['a','c']] df = pd.dataframe(my_data, columns = ['var1','var2']) dummy_ranks = pd.get_dummies(df['var1'], prefix = 'var1_') print dummy_ranks     var1__a  var1__b  var1__c  var1__d 0        1        0        0        0 1        0        1        0        0 2        0        0        1        0 3        0        0        0        1 4        1        0        0        0  [5 rows x 4 columns] 

Comments

Popular posts from this blog

c# - Operator '==' incompatible with operand types 'Guid' and 'Guid' using DynamicExpression.ParseLambda<T, bool> -