Software

I do much of my work using python because of its awesome support for scientific computing (numpy, matplotlib, scipy). To maximize performance, I also do implement critical components using C/C++. This is in my opinion using the best of both worlds. It allows me to leverage the raw power that coding in C provides while allowing me to work 95% in pure python which increases my productivity.

I use GitHub as a backup system for my projects and because I'm cheap, they are all publically available here. Some of the more usable are:

My neural networks library. It started as a pure python implementation with an object oriented approach, as opposed to a vector/matrix-style approach for example. As the work progressed, it became clear that I needed more speed so I implemented the network as a C-module. Another nifty feature is multi-core support through processes, not threads, to bypass the GIL. This makes training a committee of networks x times faster, where x is the number of cores in your CPU.

Since I train  lots of networks, I am really interested in doing that fast. To that end I created the Jobserver. It's a simple module, but it delivers great power. Running slave.py on all the departments machines, I can utilize 25-35 machines at once. That's ~100 cores and ~200 GB of RAM. There is no limit (that I know of) to the number of machines you can add and no firewall limitations except for the machine running server.py. So you can attach slaves to your server from anywhere on the Internet.

DataMaker
A simple collection of programs that are aimed at creating fake censored survival data sets. This can be useful if one wants to test an algorithm. Then it is important to know the amount of error and noise present in the data sets so one can assess the performance of the algorithm. Real datasets are very noisy and it is impossible to know how noisy they are. Hence it becomes very difficult to judge the performance of the algorithm without also judging the quality of the data set. Using fake data sets with known errors makes it possible to separate the performance benchmark from the data quality.