19
03
2009
2009
Enron dataset in CouchDB
I imported the Enron e-mail dataset into CouchDB.
$ time ./enr.py /dev/shm/maildir real 491m4.552s user 9m1.030s sys 1m36.954s -- Size: 3.0 GB Number Of Documents: 517341
A simplistic sequentially processing Python script was used, so this was probably far from optimal, in terms of import time.
Also, no views were present when importing.
import os, sys from couchdb import Server _server = Server('...') _db = _server['enron'] _queue = [] def doDir(curr_dir): files = os.listdir(curr_dir) for f in files: if os.path.isfile(os.path.join(curr_dir,f)): _queue.append(os.path.join(curr_dir, f)) if os.path.isdir(os.path.join(curr_dir, f)): _doDir(os.path.join(curr_dir, f)) def doFile(file_path): f = open(file_path, 'r') lines = f.readlines() msg = {} body = "" headers = True for l in lines: if len(l) < 3: headers = False continue elif headers == True: s = l.split(' ') msg[s[0].replace(':', '')] = ' '.join(s[1:]).replace('\n', '').replace('\r', '') else: body = body + l msg['body'] = body _db[msg['Message-ID']] = msg print msg['Message-ID'] doDir(sys.argv[1]) for f in _queue: try: doFile(f) except Exception: pass
Compaction
[Thu, 19 Mar 2009 07:02:16 GMT] [info] [<0.72.0>] Starting compaction for db "enron" [Thu, 19 Mar 2009 07:28:46 GMT] [info] [<0.72.0>] Compaction for db "enron" completed. -- Size after compaction: 1.8GB