python - lxml and fast_iter eating all the memory -


i want parse 1.6 gb xml file python (2.7.2) using lxml (3.2.0) on os x (10.8.2). because had read potential issues memory consumption, use fast_iter in it, after main loop, eats 8 gb ram, doesn't keep data actual xml file.

from lxml import etree  def fast_iter(context, func, *args, **kwargs):     # http://www.ibm.com/developerworks/xml/library/x-hiperfparse/     # author: liza daly     event, elem in context:         func(elem, *args, **kwargs)         elem.clear()         while elem.getprevious() not none:             del elem.getparent()[0]     del context  def process_element(elem):     pass  context = etree.iterparse("sachsen-latest.osm", tag="node", events=("end", )) fast_iter(context, process_element) 

i don't get, why there such massive leakage, because element , whole context being deleted in fast_iter() , @ moment don't process xml data.

any ideas?

the problem behavior of etree.iterparse(). think uses memory each node element, turns out still keeps other elements in memory. since don't clear them, memory ends blowing later on, specially when parsing .osm (openstreetmaps) files , looking nodes, more on later.

the solution found not catch node tags catch tags:

context = etree.iterparse(open(filename,'r'),events=('end',)) 

and clear tags, parse ones interested in:

for (event,elem) in progress.bar(context):     if elem.tag == 'node':         # things here      elem.clear()     while elem.getprevious() not none:         del elem.getparent()[0] del context 

do keep in mind may delete other elements interested in, make sure add more ifs needed. example (and .osm specific) tags nested nodes

if elem.tag == 'tag':     continue if elem.tag == 'node':     tag in elem.iterchildren():         # stuff 

the reason why memory blowing later pretty interesting, .osm files organized in way nodes come first, ways relations. code fine nodes @ beginning, memory gets filled etree goes through rest of elements.


Comments

Popular posts from this blog

c# - Operator '==' incompatible with operand types 'Guid' and 'Guid' using DynamicExpression.ParseLambda<T, bool> -