We’ve been working on Pydra for more about 6 months now and we’ve come a long way. So where is the project? Pydra is moving closer to a stable release, but there’s a lot to finish. We still expect to have something usable, by most people, by the end of the summer.
There are no releases yet but we’re getting ready to use it in production:
- We’re close to deploying Pydra with the Protein Geometry Project. I’ve been working bugs out of Pydra while implementing their protein data import tool. The main issues are related to the protein parser rather than Pydra itself.
- I’ve also deployed a small cluster of a few blades and other random desktop machines we had lying around. We’re throwing it at the Engine Yard Contest just to try out Pydra. The trial attempts have been good, but this was more about kicking the tires on Pydra than winning the contest.
You’re using it but, why are there no releases yet?
I’m a big fan of “release early, release often” but the project still needs have a certain degree of usability. There’s still missing features that will dramatically affect how easy Pydra is to deploy and use. Sure you can get by without them but it may require a bit of hacking and manual setup to deal with. The type of person who can make use of Pydra right now isn’t phased by checking out directly from a git repository.
Whats being worked on right now?
We’re updating the project on a regular basis to get it ready for an initial release. Here are the highlights:
I’m refactoring the core of pydra to use a module system. Modules will be loosely coupled using the observer pattern. It will allow us greater flexibility with our components and make it easier to implement features such as replication. It will also allow us to make functionality modular or optional.
The module system is intended to allow long term growth of the project. It wasn’t in the original plan because the core isn’t overly complicated right now. I’m implementing modules now, to prevent a jumbled mess of code a year or more down the road.
The task scheduler handles assigning work to workers in the cluster. The new task scheduler will fix quite a few bugs in our current scheduler and make it easier to extend.
We’re designing an abstract datasource API that will abstract interactions with storage systems. Datasources will simplify distribution of connection information along with your tasks. Datasources will reduce the amount of setup you have to do on a Node before you can run a task on it. They will also make it easier to slice your data into pieces that can be distributed amongst Pydra Nodes.
Want to know more, or better yet hack on Pydra?
I’ll be at OSCON all week. Follow me on twitter, here.