Multiprocessing the PolyScraper

Warning: This post was written at 4AM and contains a technical account of what I have been doing attempting to parallelize CIVX's internals. If you are looking for a more general overview of what I have been doing recently, you are better off looking to another one of my posts.

Recently, I have been working on CIVX's PolyScraper, a neat little piece of code designed to be able to read and understand structured text without knowing how the text is structured beforehand. On Thursday, I let a test scrape of a rather large dataset start, figuring it would finish sometime over the weekend and I'd be able to pick it up on Monday. Then Monday rolled around, and the scrape was still running.

Worse, it didn't seem to be taking full advantage of the available resources. Running as it was on the boat, it had four cores at its disposal, however it was steadfastly using only one core. Now PolyScraping is not an inherently parallelizable task, but on large datasets it should benefit from some kind of parallelization, particularly when using large numbers of small files (less so with small numbers of large files, those types of tasks are usually sequential as you have a smaller number of resources blocking on read IO). Clearly there were other things to look into, but if I could do this, this would mean a huge win for offline scraping, which was one of the things I had enabled with my addition of local file support to the PolyScraper.

This all led me to multiprocessing, a library I'd been wanting to try out in python for some time now. Without going into too much detail, multiprocessing attempts to get around the Global Interpreter Lock by spawning subprocesses instead of threads.

The first attempt was written pretty quickly, as I still remember a lot from my Parallel Computing class from a while ago. Indeed the main problem turned out to be SQLAlchemy, or, more specifically our use of sqlite for a backend db. Sqlite is not the most robust of databases and can't really handle multiple processes attempting to write to the db at once. Luke suggested (and I would love to try) moving over to Postgres as we will eventually be doing on the boat, but unfortunately the boat has been 'stuck ashore' for some time now due to an extended outage in CSH's network.

In the meantime I have been whittling the process down to what I think is the essentials. In the process I have made a complete mess of the code concerning the PolyScraper, but I should be able to make things at least look like the way they were before too long.

At this point, though, I have been working on this project for close to 20 hours today now. Luke is in town and we're all posted up in Remy's new place all hacking on our projects together. With any luck a good night's sleep will clear my head and give me new ideas for tomorrow.

Similar posts to this one