python - Efficiently creating a dictionary of dictionaries from very large csv data -
i have data different locations split date , time in comma separated file. sample location 201682
shown below:
location date time data 201682 3/15/2011 1:00:00 10 201682 3/16/2011 1:00:00 12 201682 3/15/2011 2:00:00 32 201682 3/16/2011 2:00:00 31 201682 3/15/2011 3:00:00 21 201682 3/16/2011 3:00:00 20 201682 3/15/2011 4:00:00 45 201682 3/16/2011 4:00:00 56 201682 3/15/2011 5:00:00 211 201682 3/16/2011 5:00:00 198 201682 3/15/2011 6:00:00 512 201682 3/16/2011 6:00:00 324
the file have runs millions of lines of data. in order process data trying create dictionary object in python. use location key , store rest of data in list. (futile) attempt @ this:
import csv headers = none records = {} reader=csv.reader(open(csvfile)) row in reader: if reader.line_num == 1: headers = row[1:] else: records[row[0]] = dict(zip(headers, row[1:])) print records['201682']
the output shown below:
{'date':'3/16/2011', 'time':'6:00:00 am', 'data':'324'}
i wanted data way:
{['date':'3/15/2011', 'time':'1:00:00 am', 'data':'10'], ['date':'3/16/2011', 'time':'1:00:00 am', 'data':'12'], ['date':'3/15/2011', 'time':'2:00:00 am', 'data':'32'], ['date':'3/16/2011', 'time':'2:00:00 am', 'data':'31'], ['date':'3/15/2011', 'time':'3:00:00 am', 'data':'21'], ['date':'3/16/2011', 'time':'3:00:00 am', 'data':'20'], ['date':'3/15/2011', 'time':'4:00:00 am', 'data':'45'], ['date':'3/16/2011', 'time':'4:00:00 am', 'data':'56'], ['date':'3/15/2011', 'time':'5:00:00 am', 'data':'211'], ['date':'3/16/2011', 'time':'5:00:00 am', 'data':'198'], ['date':'3/15/2011', 'time':'6:00:00 am', 'data':'512'], ['date':'3/16/2011', 'time':'6:00:00 am', 'data':'324']}
the intention store date
, time
, data
information every record in dictionary. lump data particular location within list. finally, create dictionary of such lists having location key.
how can code this? also, there more efficient way this? data file have close 24gb in size. [is there map-reduce approach in python multiple threads - very new map reduce paradigm...]. appreciated!
the goal you've describe end data structure. however, data structures meant service query -- trying extract information? without knowing that, it's hard efficient or whether map-reduce helpful.
that said, seems easiest thing build dictionary you've described contain row ids rather row data themselves. save space, , still allow answer queries. if, however, data set 24gb on disk, you'll need more keep in ram. supposing given query, getting row ids sufficient, suggest:
import csv headers = none records = {} reader = csv.reader(open(csvfile)) # can have lists entries default collections import defaultdict index = {} row in reader: if reader.line_num == 1: headers = row # we'll set rows dictionary 1 defaultdict # each of headers, mapping unique values # rows match index = dict((header, defaultdict(list)) header in headers) else: header, value in zip(headers, row): index[header][value].append(reader.line_num) # now, can find out rows have, say, 'location' set given value index['location']['201682'] # or rows 'time' set '1:00:00 am' index['time']['1:00:00 am']
that said, using python dictionaries build index, , there tools better suited this. off hand, mysql comes mind, if you're going doing lot of ad-hoc queries. supports better indexing dictionary can offer , doesn't suffer constraint of having fit memory.
Comments
Post a Comment