Logging microformats  

By Bill de hÓra under Python, Ruby, microformats on 13. December 2005

Here’s a slightly elided log line from a system I work with:

WARN 2005-09-28 14:38:38 StandardContext[/pmr]  {”mid”:”123″, “type”:”XYZ”, “msg”:”foo”, “T”:”2005-09-28 14:38″, “L”:”warn”}

It identifies an XML document passing through the system using some keys from the envelope. It may seem a strange looking way to log an event, but between the angle brackets lies a useful side effect - it’s a Python dictionary. Hence this (vastly oversimplified) example log processing code, that runs eval() on each log line to load the dictionaries into memory:

Python 2.4 (#60, Nov 30 2004, 11:49:19) [MSC v.1310 32 bit (Intel)] on win32
Type “help”, “copyright”, “credits” or “license” for more information.
>>> dicts =[]
>>> import re
>>> p = re.compile(”.*({.*?}).*”)
>>> file = open(”log.txt”).readlines()
>>> for line in file:
…     m = p.match(line)
…     if m:
…             d = eval(m.group(1))
…             dicts.append(d)
>>> for d in dicts:
…    print d
{’L': ‘warn’, ‘mid’: ‘123′, ‘type’: ‘XYZ’, ‘T’: ‘2005-09-28 14:38′, ‘msg’: ‘frobnizer  found’}
{’L': ‘warn’, ‘mid’: ‘124′, ‘type’: ‘XYZ’, ‘T’: ‘2005-09-28 14:38′, ‘msg’: ‘frobnizer  found’}
{’L': ‘warn’, ‘mid’: ‘125′, ‘type’: ‘XYZ’, ‘T’: ‘2005-09-28 14:38′, ‘msg’: ‘frobnizer  found’}
{’L': ‘warn’, ‘mid’: ‘126′, ‘type’: ‘XYZ’, ‘T’: ‘2005-09-28 14:38′, ‘msg’: ‘frobnizer  found’}

What’s nice about doing this is that it’s cheap to write the log line in a way that maximises usefulness later on - logs are often an afterthought, but good logging is critical to operating production systems. And it’s one example where creating metadata is cheap.

As well as programming data structures (changing the above to a Ruby hash is straightforward), you can get log lines to output something like Turtle, which is a greppable RDF format. What’s good about Turtle in particular is that you can dump or stream a log file straight into an RDF query engine, such as ARQ without any transform step and start writing queries against the data. Or how about writing down inferencing rules that help indicate the business consequences of a software error? The log lines themselves don’t carry enough information, but combined with rules and some context as to where the log event came from you can do interesting and useful things - the biggest issue with any monitoring software is gathering up all those machine level events and contextualising them so that data (noise) can be turned it into information (signal). Think of a machine readable data stucture inside logfiles as just another microformat.

10 Responses to “Logging microformats”

  1. l.m.orchard:

    Of course… if you’re going to eval log lines, make sure no end-user is able to drop a payload of their own in there. But other than that, neat trick!

    comment at 14. December 2005

  2. Bill de hOra:

    Les, oh sure; v2 will have a ’strict’ mode ;)

    comment at 14. December 2005

  3. Paul:

    I was going to post the same thing — q.v. August’s AWStats hole, which involved Perl code injected into log lines (via referrer) to be subsequently ingested, and inadvertently executed, by an eval() call in AWStats.

    comment at 15. December 2005

  4. Jochem:

    Why not directly log to XML? And when the file is closed, to writeout the closing root element. Using (c)ElementTree or any other XML parser to parse the loglines later.

    comment at 20. December 2005

  5. Bill de hÓra:

    “Why not directly log to XML? “

    Jochem: grep.

    comment at 20. December 2005

  6. Wade Leftwich:

    How about JSON for an eval’able microformat that’s a bit safer? I’ve been doing something similar to the O.P. for a while, saving a web request as str(dict) (and storing it in a db text field). Recently switched over to reading & writing it with json-py.

    comment at 20. December 2005

  7. Mike:

    Good to see someone else doing this… serialized dicts have been my logging format for years! It’s just so easy to add lines in your code like:

    if not parseData(data):
      print >> logfile, {'state' : 'failed-to-parse', 'strange-data' : data,
                         'time' : time.ctime() }

    Python’s serializing code ensures that I’ll only get one record per line, regardless of what binary garbage is in the ‘data’. And processing existing logs is as easy as grep or:

    for line in file('logfile'):
       rec = eval( line.rstrip() )
       if rec['state'] == 'failed-to-parse':
          lookOver( rec['strange-data'] )

    Granted, there are security problems if you eval someone else’s data, but for logs you generate, this is as easy & powerful as it gets.

    comment at 20. December 2005

  8. Maciej:

    wrap logging into function:

    def log(**kw):
        print >> logfile, kw

    and use it like this:

    log(state='failed-to-parse', strange_data=data, time=time.ctime())

    comment at 22. December 2005

  9. Masklinn:

    Jochem wrote
    Why not directly log to XML?

    Because it doesn’t have any advantage over using dicts, is completely unreadable to most human beings, is extremely verbose, and requires specific packages/programs to be parsed?

    comment at 20. February 2006

  10. The mass amateurisation of everything, in code at Suttree, Elixir for Immortal Baboon:

    […] Turning log files into JSON formatted code to be eval’ed by a Python script (how insecure is that!). Again, do it in PHP and you be laughed at, deservedly. […]

    pingback at 07. March 2006