python - lxml and fast_iter eating all the memory -
i want parse 1.6 gb xml file python (2.7.2) using lxml (3.2.0) on os x (10.8.2). because had read potential issues memory consumption, use fast_iter in it, after main loop, eats 8 gb ram, doesn't keep data actual xml file.
from lxml import etree def fast_iter(context, func, *args, **kwargs): # http://www.ibm.com/developerworks/xml/library/x-hiperfparse/ # author: liza daly event, elem in context: func(elem, *args, **kwargs) elem.clear() while elem.getprevious() not none: del elem.getparent()[0] del context def process_element(elem): pass context = etree.iterparse("sachsen-latest.osm", tag="node", events=("end", )) fast_iter(context, process_element) i don't get, why there such massive leakage, because element , whole context being deleted in fast_iter() , @ moment don't process xml data.
any ideas?
the problem behavior of etree.iterparse(). think uses memory each node element, turns out still keeps other elements in memory. since don't clear them, memory ends blowing later on, specially when parsing .osm (openstreetmaps) files , looking nodes, more on later.
the solution found not catch node tags catch tags:
context = etree.iterparse(open(filename,'r'),events=('end',)) and clear tags, parse ones interested in:
for (event,elem) in progress.bar(context): if elem.tag == 'node': # things here elem.clear() while elem.getprevious() not none: del elem.getparent()[0] del context do keep in mind may delete other elements interested in, make sure add more ifs needed. example (and .osm specific) tags nested nodes
if elem.tag == 'tag': continue if elem.tag == 'node': tag in elem.iterchildren(): # stuff the reason why memory blowing later pretty interesting, .osm files organized in way nodes come first, ways relations. code fine nodes @ beginning, memory gets filled etree goes through rest of elements.
Comments
Post a Comment