python - Efficiently creating a dictionary of dictionaries from very large csv data -

May 15, 2012

i have data different locations split date , time in comma separated file. sample location 201682 shown below:

location    date        time            data 201682      3/15/2011   1:00:00      10 201682      3/16/2011   1:00:00      12 201682      3/15/2011   2:00:00      32 201682      3/16/2011   2:00:00      31 201682      3/15/2011   3:00:00      21 201682      3/16/2011   3:00:00      20 201682      3/15/2011   4:00:00      45 201682      3/16/2011   4:00:00      56 201682      3/15/2011   5:00:00      211 201682      3/16/2011   5:00:00      198 201682      3/15/2011   6:00:00      512 201682      3/16/2011   6:00:00      324

the file have runs millions of lines of data. in order process data trying create dictionary object in python. use location key , store rest of data in list. (futile) attempt @ this:

import csv  headers = none records = {}  reader=csv.reader(open(csvfile)) row in reader:     if reader.line_num == 1:         headers = row[1:]     else:         records[row[0]] = dict(zip(headers, row[1:]))  print records['201682']

the output shown below:

{'date':'3/16/2011', 'time':'6:00:00 am', 'data':'324'}

i wanted data way:

{['date':'3/15/2011', 'time':'1:00:00 am', 'data':'10'],  ['date':'3/16/2011', 'time':'1:00:00 am', 'data':'12'],  ['date':'3/15/2011', 'time':'2:00:00 am', 'data':'32'],  ['date':'3/16/2011', 'time':'2:00:00 am', 'data':'31'],  ['date':'3/15/2011', 'time':'3:00:00 am', 'data':'21'],  ['date':'3/16/2011', 'time':'3:00:00 am', 'data':'20'],  ['date':'3/15/2011', 'time':'4:00:00 am', 'data':'45'],  ['date':'3/16/2011', 'time':'4:00:00 am', 'data':'56'],  ['date':'3/15/2011', 'time':'5:00:00 am', 'data':'211'],  ['date':'3/16/2011', 'time':'5:00:00 am', 'data':'198'],  ['date':'3/15/2011', 'time':'6:00:00 am', 'data':'512'],  ['date':'3/16/2011', 'time':'6:00:00 am', 'data':'324']}

the intention store date, time , data information every record in dictionary. lump data particular location within list. finally, create dictionary of such lists having location key.

how can code this? also, there more efficient way this? data file have close 24gb in size. [is there map-reduce approach in python multiple threads - very new map reduce paradigm...]. appreciated!

the goal you've describe end data structure. however, data structures meant service query -- trying extract information? without knowing that, it's hard efficient or whether map-reduce helpful.

that said, seems easiest thing build dictionary you've described contain row ids rather row data themselves. save space, , still allow answer queries. if, however, data set 24gb on disk, you'll need more keep in ram. supposing given query, getting row ids sufficient, suggest:

import csv  headers = none records = {}  reader = csv.reader(open(csvfile))  # can have lists entries default collections import defaultdict index = {}  row in reader:     if reader.line_num == 1:         headers = row         # we'll set rows dictionary 1 defaultdict         # each of headers, mapping unique values         # rows match         index = dict((header, defaultdict(list)) header in headers)     else:         header, value in zip(headers, row):             index[header][value].append(reader.line_num)  # now, can find out rows have, say, 'location' set given value index['location']['201682']  # or rows 'time' set '1:00:00 am' index['time']['1:00:00 am']

that said, using python dictionaries build index, , there tools better suited this. off hand, mysql comes mind, if you're going doing lot of ad-hoc queries. supports better indexing dictionary can offer , doesn't suffer constraint of having fit memory.

Search This Blog

Parth Code

python - Efficiently creating a dictionary of dictionaries from very large csv data -

Comments

Post a Comment

Popular posts from this blog

c# - WPF Converters DLL - Failed to Add Reference -

linux - xterm copying to CLIPBOARD using copy-selection causes automatic updating of CLIPBOARD upon mouse selection -

qt - Errors in generated MOC files for QT5 from cmake -