Pydra is a distributed computing or cluster computing framework for Python. Pydra seeks to provide a solution that is easier to deploy, manage, use than existing projects. This is on top of providing standard features such as fault tolerance.
Pydra was born out of a necessity. Other projects being developed by the Open Source Lab required a large amount of processing. Rather than implementing parallelism specific to our application, we chose to build a generic distributed computing framework with the features missing in other solutions. We see Pydra as a useful tool for future projects at the lab.
Management and Security First
A cluster is an inherently unstable and insecure application. As more hardware is added, the chance for failure increases. A distributed computing cluster is by definition provides remote code execution. For these reasons management and security were top priorities from the very beginning. There are several solutions for parallel computing within python, but none include these features.
Pydra will provide a management interface to control every aspect of the cluster. It will allow you to configure the cluster, monitor its health, and in some cases recover from failures. It will also allow you to manage the task queue, and track task progress. Some of these features have already been implemented, the rest are high up on the roadmap.
Security within Pydra extends beyond a simple password. Even when hashed, a password offers little security. Sniffing the hashed password is all that is required to bypass the authentication. Instead Pydra uses encryption key pairs within a handshake to ensure connections within the cluster are extremely secure. Even if an intruder is able to view traffic within the cluster, they will be unable to obtain login credentials.
Job Oriented Programming
Pydra uses a Job or Task oriented programming model. users can write tasks that logically encompass any task they need to perform with a predefined set of base classes and containers. The base classes include the ability to write both sequential and parallel processes as Tasks, and any combination therein.
A MapReduce base class will also be included by 1.0. The task framework is a more generic form of MapReduce and this style of task can be written already. The MapReduce base class will just simplify the process.
Technologies
Pydra makes use of other APIs where possible. I’d like development to be as abstracted from underlying technologies as much as possible. We’d like to use the best tool for the job, and that may change over time. RPC and distributed computing is a hot topic right now and there are some relatively new projects such as the recent multiprocessing module added in python 2.6.
- Python 2.5 is used for now, until dependencies are upgraded to python 2.6/3.0. We are watching the time-lines closely.
- Twisted is used for all of the Remote Procedure Calls (RPC) and networking. Twisted was chosen because it is a mature, well written API that has good cross platform support. It also was works well for fault tolerance and security.
- Django is used for the management interface as well as Object Relational Mapping(ORM) tool.
RoadMap
Pydra is still in its infancy, though much of the application has at least basic functionality. Clusters can be configured and tasks run on them, but there is still a lot of work to do. The task tracker contains a more detailed roadmap and version release plan.
I would like to have a usable beta by the end of the summer and a 1.0 release in late fall / early December. Its ambitious but we’ll hopefully have a summer of code student working on the project, potentially a student intern, and myself. We’d love to have anyone from the community help out. Contact peter(at)osuosl.org if you are interested in the project.
0 Responses to “Distributed Computing With Pydra”