You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 12 Next »

This page will contain the activity log of the pyFF+ experiments and endeavours.

Memory profiling

This is the bare import and code usage of using heapy to print heap information while running python code.

https://pkgcore.readthedocs.io/en/latest/dev-notes/heapy.html

from guppy import hpy
import code
hp=hpy()
...
# reset the heap counters
hp.setrelheap()
...

# just print the heap somewhere:
h = hp.heap()
log.debug(f"\nheapy: {h}")

# or possibly interrupt the code execution and inspect the hp object:
code.interact(local=dict(globals(), **locals()))

A typical dump in pyFF's mainloop then looks like this

Partition of a set of 117936 objects. Total size = 75634733 bytes.
Index Count  %    Size  % Cumulative % Kind (class / dict of class)
    0  1771  2 59385607 79 59385607 79 bytes
    1 25917 22 8466168 11 67851775 90 dict (no owner)
    2 25388 22 2640352  3 70492127 93 dict of pyff.samlmd.EntitySet
    3 23656 20 2232132  3 72724259 96 str
    4 25388 22 1218624  2 73942883 98 pyff.samlmd.EntitySet
    5  6795  6  380520  1 74323403 98 lxml.etree._Element
    6  2501  2  199024  0 74522427 99 tuple
    7   870  1  154828  0 74677255 99 types.CodeType
    8  1024  1  139264  0 74816519 99 function
    9    45  0  127872  0 74944391 99 dict of module
<196 more rows. Type e.g. '_.more' to view.>

Another way of profiling pyFF's memory usage is just following RES in top or htop for a long-running pyFF process, that has a 60s refresh interval. I normally use this pipeline



- when update: 
 - load: 
   - edugain.xml 
- when request: 
 - select: 
 - pipe: 
   - when accept application/samlmetadata+xml application/xml: 
     - first 
     - finalize: 
         cacheDuration: PT12H 
         validUntil: P10D 
     - sign: 
         key: cert/sign.key 
         cert: cert/sign.crt 
     - emit application/samlmetadata+xml 
     - break 
   - when accept application/json: 
     - discojson 
     - emit application/json 
     - break


to feed the edugain feed that has been dowloaded using

$ curl http://mds.edugain.org/ -o edugain.xml

Un/Pickling etree.ElementTree object

Here we demonstrate that externally parsed etree.ElementTree objects can be pickled (serialized) to be consumed later in pyFF, without the need to parse.

from lxml import etree, objectify
import pickle
# Create pickled datafile
source = open("edugain.xml", "r", encoding="utf-8")
sink = open("edugain.pkl", "w")

t = objectify.parse(source)
p = pickle.dumps(t).decode('latin1')
sink.write(p)

# Read pickled object back in pyFF
def parse_xml
	return pickle.loads(io.encode('latin1'))

In metadata parser:
t = parse_xml(content) #Instead of parse_xml(unicode_stream(content))

Using un/pickling, pyFF starts out using ~800Mb of RES that slowly extends to a steady 1.2-1.5G.

xml.sax etree.ElementTree parser

This code uses the event based xml.sax parser to create an etree.ElementTree object for pyFF, inside pyFF. As of the moment of writing, pyFF refuses validate the result, but it produces correct metadata?
The parsing could be brought outside of pyFF to create a dictionary type of object to be read and parsed as a metadata representation to create the ElementTree object in pyFF instead of parsing XML.

https://docs.python.org/3/library/xml.sax.reader.html

import xml.sax
class XML(xml.sax.handler.ContentHandler):
  def __init__(self):
    self.current = etree.Element("root")
    self.nsmap = {}

  def startElement(self, name, attrs):
    attributes = {}
    for key, value in attrs.items():
        key = key.split(':')
        if len(key) == 2 and key[0] == 'xmlns':
            self.nsmap[key[-1]] = value
        else:
            attributes[key[-1]] = value
    name = name.split(':')
    if len(name) == 2:
        name = f"{{{ self.nsmap.get(name[0], name[0]) }}}{ name[-1] }"
    else:
        name = name[-1]
    self.current = etree.SubElement(self.current, name, attributes, nsmap=self.nsmap)

  def endElement(self, name):
    self.current = self.current.getparent()

  def characters(self, data):
    d = data.strip()
    if d:
      self.current.text = d

def parse_xml(io, base_url=None):
    parser = xml.sax.make_parser()
    handler = XML()
    parser.setContentHandler(handler)
    parser.parse(io)
    return etree.ElementTree(handler.current[0])

Using xml.sax parser pyFF starts out using ~800Mb of RES that slowly extends to a steady 1.2-1.5G.




  • No labels