gua.kapsi.fi :: Enron dataset in CouchDB

19 03
2009

Enron dataset in CouchDB

I imported the Enron e-mail dataset into CouchDB.

$ time ./enr.py /dev/shm/maildir
real    491m4.552s
user    9m1.030s
sys     1m36.954s
--
Size: 3.0 GB
Number Of Documents: 517341

A simplistic sequentially processing Python script was used, so this was probably far from optimal, in terms of import time.

Also, no views were present when importing.

import os, sys
from couchdb import Server

_server = Server('...')
_db = _server['enron']
_queue = []

def doDir(curr_dir):
    files = os.listdir(curr_dir)

    for f in files:
        if os.path.isfile(os.path.join(curr_dir,f)):
            _queue.append(os.path.join(curr_dir, f))
        if os.path.isdir(os.path.join(curr_dir, f)):
            _doDir(os.path.join(curr_dir, f))

def doFile(file_path):
    f = open(file_path, 'r')
    lines = f.readlines()

    msg = {}
    body = ""

    headers = True
    for l in lines:
        if len(l) < 3:
            headers = False
            continue
        elif headers == True:
            s = l.split(' ')
            msg[s[0].replace(':', '')] = ' '.join(s[1:]).replace('\n', '').replace('\r', '')
        else:
            body = body + l
    msg['body'] = body

    _db[msg['Message-ID']] = msg
    print msg['Message-ID']

doDir(sys.argv[1])

for f in _queue:
    try:
        doFile(f)
    except Exception:
        pass

Compaction

[Thu, 19 Mar 2009 07:02:16 GMT] [info] [<0.72.0>] Starting compaction for db "enron"
[Thu, 19 Mar 2009 07:28:46 GMT] [info] [<0.72.0>] Compaction for db "enron" completed.
--
Size after compaction: 1.8GB