Big Data Infrastructure for Measurement Sample Analysis

Note: This article was written retrospectively, years after the project took place in 2016. While it captures my experiences and challenges from that time, it's enriched with insights and understanding I've gained since then.

TL;DR

Technologies: HBase, Docker, Ansible, Terraform, Kubernetes, NoSQL database modeling

Role: Team member focused on HBase cluster setup and database modeling for supposedly terabyte-scale sensor data

Key learning: Start simple and scale when needed - not every "big data" problem requires big data solutions from day one

Another DevBoost project promised to immerse us in the world of big data. The task was to build infrastructure for collecting and analyzing sensor data from "Wirbelstrommessung" (eddy current measurements). From the project kickoff, we were prepared for massive scale - terabytes of data were projected, numbers that seemed almost mythical at the time.

The big data that wasn't

As the internship progressed, an interesting reality emerged: the data wasn't nearly as "big" as initially anticipated. The volumes that eventually materialized could have been handled by conventional database solutions. This mismatch between expectation and reality became one of the project's most valuable lessons. We had built a Formula 1 race car for a trip to the grocery store.

First contact with cloud-native technologies

Despite the scale mismatch, the project became an invaluable playground for emerging technologies. Working with Ansible, Docker, Kubernetes concepts, Terraform, HBase, and ZooKeeper in 2016 felt like getting early access to the future. These tools had just been released, and we were among the pioneers trying to understand their potential and limitations.

My internship colleague and future co-founder's analogy about Docker still resonates: it was like having the ability to return to a fresh Windows installation by simply stopping and deleting all containers. That comparison captured the elegance of containerization instantly, and this technology continues to fascinate me today.

Building clusters and learning hard lessons

While we didn't deploy Kubernetes in production, we built an HBase cluster from scratch. This manual process taught us exactly what orchestration platforms abstract away - the complexity of connecting multiple VMs into a coherent cluster where resources can migrate between machines. My primary responsibility was setting up the HBase cluster and modeling the database structure, which seemed like preparing for a data tsunami that never quite arrived.

The hands-on experience was invaluable, but it also highlighted the cost of premature optimization. We spent considerable time solving distributed systems problems for data volumes that didn't require distributed solutions.

The NoSQL query pattern trap

The project exposed a fundamental challenge with NoSQL databases that became painfully clear during implementation. Without knowing all query patterns upfront, our powerful HBase cluster became surprisingly inefficient. Unknown access patterns meant full table scans across the entire dataset, making our "big data" solution no faster than a traditional database - sometimes even slower.

This problem compounded the scale mismatch. Not only was our data smaller than expected, but our inability to predict all analysis patterns meant we couldn't even leverage the NoSQL advantages we did have. The sophisticated infrastructure became a complexity burden rather than a performance enabler.

The test data paradox

By the project's end, we delivered a complete system: an HBase cluster with a backend API, a visualization frontend, and a framework for parallelized test data generation. Ironically, we probably generated more test data than we ever saw in real measurement samples. The test data generation framework itself became a significant engineering effort, teaching us about parallel processing and resource management.

A manifesto for pragmatic solutions

This experience fundamentally shaped my approach to technology selection. While it was exciting to experiment with cutting-edge technologies, the project reinforced a crucial principle: focus on the actual problem first. Build a simple solution that works, even if it won't scale to terabytes immediately. Use that initial implementation to understand real usage patterns, document actual query requirements, and gather concrete metrics.

Only then, armed with real-world data about your data, should you architect for scale. The query patterns discovered during the "simple" phase become the foundation for the scaling phase. This approach would have saved us from building complex infrastructure for simple problems and from the NoSQL query pattern trap we fell into.

Lasting impact

Looking back, the DevBoost big data project was a masterclass in both technical exploration and architectural restraint. Yes, we learned Docker, dealt with distributed systems, and wrestled with NoSQL databases. But more importantly, we learned that "big data" technologies aren't always the answer to "big data" problems - sometimes the data isn't that big, and sometimes simple solutions document the patterns you need for complex ones.

The project taught me to question scale assumptions, to prototype before optimizing, and to let actual usage drive architectural decisions. These lessons have saved countless hours and significant complexity in every project since. Sometimes the best big data solution is admitting you don't have big data yet.