Distributed Groundwater Computing Infrastructure at TU Dresden

Note: This article was written retrospectively, years after the project took place in 2017. While it captures my experiences and challenges from that time, it's enriched with insights and understanding I've gained since then.

TL;DR

Technologies: RabbitMQ, Ansible, Docker, Traefik, Consul, Grafana, Prometheus, NFS clustering, distributed computing

Role: Infrastructure engineer building scalable cloud infrastructure for distributed groundwater calculations

Key learning: Designing task-specific node optimization and implementing scalable worker patterns for scientific computing workloads

The INOWAS research project at TU Dresden presented a unique challenge: building a proof-of-concept for cloud infrastructure capable of handling complex groundwater calculations at scale. As part of a service delivery project, we designed and implemented a distributed computing system that could efficiently process real research datasets while demonstrating the viability of this architecture for their computational needs.

Orchestrating scientific computation at scale

At the heart of the system sat RabbitMQ, serving as the nervous system for task distribution across the cluster. This message broker elegantly solved the problem of coordinating work across multiple nodes, ensuring that computational tasks found their way to available workers without overwhelming any single resource. The asynchronous nature of message queuing proved ideal for scientific workloads where calculation times could vary dramatically based on the complexity of groundwater models.

The infrastructure divided work into three distinct task types: optimization, data reading, and calculation. Each category had different computational requirements and resource consumption patterns. Optimization tasks required significant CPU resources for iterative algorithms. Data reading operations needed fast I/O and efficient memory management. Calculation tasks demanded a balance of both, often running for extended periods on complex mathematical models.

Shared storage through NFS clustering

Data exchange between nodes presented its own challenges. Scientific computing often involves large datasets that need to be accessible from multiple locations. We implemented an NFS cluster that provided shared storage across all compute nodes, allowing calculated results to be immediately available to other parts of the system without complex data transfer mechanisms.

NFS was chosen for its simplicity and proven reliability. Every node could read input data and write results to a common location, simplifying the application logic and reducing the potential for data consistency issues. For this proof-of-concept, the straightforward nature of NFS made it the ideal choice.

Infrastructure as code with Ansible

Building scalable infrastructure meant embracing automation from the start. Ansible became our tool of choice for provisioning and managing the entire cluster. The beauty of this approach was its declarative nature - we could define the desired state of our infrastructure and let Ansible handle the implementation details.

Scaling became remarkably simple through Ansible variables. Need more optimization workers? Change a single number in the configuration. Require additional calculation nodes for a particularly complex simulation? Adjust the variable and run the playbook.

Modern service delivery stack

Supporting the core computation infrastructure required a modern service delivery stack. Traefik was chosen for its straightforward service exposure capabilities, providing HTTPS and basic authentication with minimal configuration. This simplicity was crucial for the proof-of-concept, allowing us to secure services quickly without complex setup procedures. Consul provided service discovery and configuration management, ensuring that components could find each other regardless of where they ran in the cluster.

For the POC environment, we deployed Prometheus and Grafana primarily as a demonstration tool. The visualization of load distribution across nodes provided immediate, tangible feedback about how the worker approach balanced computational tasks. Watching the real-time graphs as calculations spread across the cluster offered compelling visual proof that the architecture worked as designed. While these tools could certainly serve as the foundation for production monitoring, their immediate value was in showcasing the effectiveness of our distributed approach.

Containerization for consistency

Each worker type ran in carefully crafted Docker containers provided by the research team. This containerization ensured consistency across deployments and simplified updates. When researchers improved their algorithms, they could provide new container images that we'd deploy without worrying about dependency conflicts or environmental differences.

The container approach also simplified scaling. Spinning up additional workers meant launching new containers rather than provisioning entire virtual machines. This lightweight approach allowed us to be more responsive to changing computational demands while keeping infrastructure costs manageable.

Lessons from the worker pattern in practice

This project provided an incredible opportunity to see the RabbitMQ worker pattern applied to a real-world scientific computing problem. Watching actual groundwater calculations distribute across nodes, seeing the queue depths fluctuate with workload, and observing how workers picked up tasks autonomously was deeply satisfying. Theory became practice as the system processed real research data.

This POC project on a real-world example with actual computational tasks validated our architectural choices. The infrastructure wasn't just theoretically sound - it successfully processed the complex groundwater models that researchers needed to analyze. Seeing the worker approach handle genuine scientific workloads confirmed that the patterns we'd chosen could scale from proof-of-concept to production when needed.