Sense of proportion for data volumes: Why "a lot" is not a size

We think of data volumes in adjectives, not numbers. This leads to architectures with phantom dimensions and blocks the simplest solutions.

(Image: ZenitX / Shutterstock.com)

at 9:15 am CEST

14 min. read

Developer

By

Golo Roden

A few days ago, two articles were published that are still on my mind. One is an interview I conducted with Hannes Mühleisen, the co-creator of DuckDB. The other is an article I wrote shortly after for our company blog, arguing that a read model in an event sourcing system often doesn't need a database at all, but can easily fit in memory. Both texts stand side by side, concern different architectural levels, and yet they have the same core argument: The standard architecture we reflexively choose is far more often oversized than we think.

What occupies me with both texts, however, is less their respective statements than the reaction that such statements typically trigger. When I suggest keeping a read model simply in RAM, or when someone argues with Hannes's data points that a single machine is sufficient for the vast majority of analytical loads, the most common reaction is not disagreement, but disbelief. And this disbelief, as far as I can tell, is not a logical problem but a perception problem. It has to do with the fact that developers today have hardly any feel for data volumes.

Golo Roden is the founder and CTO of the native web GmbH. He works on the design and development of web and cloud applications and APIs, with a focus on event-driven and service-based distributed architectures. His guiding principle is that software development is not an end in itself, but must always follow an underlying technical expertise.

Two observations, one common punchline

For years, Hannes Mühleisen has advocated the position that distributed systems are simply oversized for most analytical workloads. His argument rests on three pillars. Hardware has become enormously more powerful, the architectures of modern database systems have become more mature, and above all, an empirical analysis of Snowflake and Redshift by Fivetran shows that even the 99.9th percentile of queries on these distributed systems scans only about 300 GByte. This means: More than 99 percent of all queries running on a full-fledged distributed cloud database would run without problems on a single node. This is not an ideological position, but a hard number.

My own argument operates on a wholly different level but reaches the same conclusion. In an event sourcing system, the events are the only binding truth. A read model is a derived view of these events, a projection that takes exactly the form required by a specific query. This property leads to a remarkable consequence: A read model is dispensable. It can be discarded at any time and rebuilt from the events. Thus, the requirement for persistence is eliminated, as it already lies elsewhere. With it, the main reason for using a database for the read side disappears.

Instead, the read model can simply be kept in memory as a suitable data structure, in exactly the form that suits the queries. Upon restart, it is reconstructed from the events. Multiple instances behind a load balancer maintain their own state in RAM and do not need to coordinate with each other. The observation at this point is the same as with Hannes: What is the usual answer, namely that a dedicated, ideally specialized database is not necessary for a large proportion of practical cases? The standard architecture is therefore not wrong; it is merely too large for the actual problem.

What actually blocks the suggestion

When I suggest to customers in conversations that a read model be kept in RAM, I am met with skepticism almost every time. This is understandable, as an unfamiliar suggestion must be justifiable. However, it is noticeable that this skepticism rarely ignites with a concrete counter-argument. It remains diffuse. It is expressed in sentences like “but it doesn't scale” or “it doesn't fit in memory” or “and what happens if the system restarts?”. These sentences are not objections, they are reflexes.

Behind these reflexes, as far as I can observe, are three different prerequisites that are often missing in practice. The first is the understanding of the duality of working memory and persistent storage. Data can exist simultaneously in RAM and on SSD, with different tasks: the RAM part is for fast reading, the disk part for persistence. Upon restart, the RAM part is rebuilt from the disk part or from the event log. This is not exotic, it is just unusual.

The second prerequisite is the idea that suitable read structures can actually be built in working memory. Those who are used to directing every query to a database often have no concept that a simple hash map or a sorted array is completely sufficient for most queries and is orders of magnitude faster than any database query over a network.

The third prerequisite is the one that is actually my concern here. It concerns not knowledge, but perception. It is the feeling for how much data actually fits into today's working memory. The first two prerequisites can be read up on; they are explained in any decent textbook on distributed systems. The third is harder to convey because it does not arise from a concept, but from a scale. And precisely this scale is missing.

What a magazine page really costs

To get a feel for data volumes, familiar units are helpful. A magazine page of pure text, for example in an issue of iX or c't, contains roughly between 4000 and 5000 characters. That is approximately 4 to 5 KByte. Thus, 1 MByte can hold about 200 to 250 pages of pure text. A 3.5-inch floppy disk from the early 1990s with its 1.44 MByte could have carried a complete magazine issue without images and would not even have been three-quarters full.

At first glance, this calculation looks like a nostalgic anecdote. But it is not: it has a direct consequence for current architectural decisions. When I calculate for our own company blog how much memory the internal task management of the native web actually occupies after 18 months of continuous use, I arrive at 5 MByte. That's 8610 events, several team members, real data. 5 MByte. Four floppy disks. That's the dreaded size explosion in event sourcing systems.

On the other side of the scale is the machine on which this text is currently being created. It is a MacBook Pro from 2022, so four years old by now. It has an M2 processor, 24 GByte RAM, and a 1 TByte SSD. This is not a server, it is not a special device, it is a normal notebook from four years ago. And now consider for a moment how many 5 MByte data sets would fit into the 24 GByte of RAM of this device and how often this RAM could be stored in its entirety on the 1 TByte SSD. The answer in both cases is not a lack, but considerable leeway.

The other half of the asymmetry

Up to this point, I have only spoken about one side of the perception problem, namely overestimation. We believe that the data our application produces will be enormous, and in most cases, it is not. However, perception shifts when we switch sides and are no longer producing, but consuming. Then sentences like "oh, those few GByte” suddenly appear, and no one is surprised anymore.

Anyone who has ever tried to calculate the size of the node_modules directory in a JavaScript project knows the pattern. A medium-sized project easily pulls in several hundred MByte of dependencies, occasionally over a GByte. Docker images, which essentially contain only a single application, regularly reach 1 to 2 GByte because they carry half an Ubuntu with them. With every build, every deployment, in every continuous integration pipeline, this is transferred back and forth over the network.

These perceptual errors are not randomly opposite. They stem from the same source. We don't think of data volumes in numbers, but in evaluations. What we produce ourselves feels like “a lot” because it is important to us. What we consume on the side feels like “little” because it happens in the background. Both have little to do with the actual sizes.

Data as an adjective, not a number

This brings us to the actual diagnosis. The problem is not that we estimate data volumes too high or too low. The problem is that we don't think of data volumes in numbers at all, but in adjectives. “A lot,” “little,” “huge,” “a ton,” “hardly worth mentioning”: These are not sizes, they are moods. They say something about how the data feels to the speaker, and nothing about how large it actually is.

Adjectives are a poor basis for technical decisions. Anyone who bases an architecture on the assumption that “a lot of data” will be generated cannot assess whether the chosen database is oversized, whether the planned sharding is justified, or whether a distributed system will really be necessary. The scale against which the decision could be measured is missing. What remains is habit, reflex, and the vague feeling of planning too large rather than too small.

It is precisely this reflex that makes the suggestions from the two articles mentioned at the beginning seem indigestible. Those who perceive data volumes only as a mood cannot do anything with the suggestion to keep a read model in RAM or to run analytics on a single machine. It sounds too small, too insignificant, too much like tinkering. The feeling says: This won't be enough. The calculation would say: It's enough with considerable leeway. But the calculation is not done.

Architecture with phantom dimensions

The consequences are visible in many systems that I have encountered in recent years or have spoken about with others. Sharding is introduced before the data volume even approaches an order of magnitude where it would be justified. Distributed databases are chosen because they sound “big,” although PostgreSQL on a single machine would handle the expected load for years. Separate clusters are set up for read models whose data volume would have fit on a 3.5-inch floppy disk.

A recurring example from my own field of activity is the almost ritualistic question about event sourcing systems: “Won't the database get too big if you never delete anything?”. The question is understandable and follows the same logic. It assumes that many events inevitably mean a lot of data. In reality, the answer is almost always sobering: A typical business event is 200 to 500 bytes in size, an average business process generates a few events per transaction, and a 1 TByte hard drive can hold two billion events. Those who lack a sense of scale hear these numbers as if from another world. Those who take them seriously arrive at different architectures.

It's not about avoiding every database or forcing everything into RAM across the board. It's about the order of decisions. Those who never ask the question “do I even need a database?” have never been able to answer it no. There are cases where the standard answer is the right one. But there are also very many cases where it is not, and they are consistently missed because the question is not asked.

Videos by heise

Sense of proportion as an engineering virtue

Sense of scale is not nostalgia. It is not a homage to the floppy disk, not a plea for ascetic software, not a competition to get by with as few resources as possible. It is a prerequisite for being able to make technical decisions at all, instead of making them reflexively. Those who calculate before building arrive at different answers than those who follow their first feeling. This is not an exhaustive method, it is just a minimum requirement.

What unites the two articles mentioned at the beginning is not a common theme, but a common approach. Hannes Mühleisen refers to a concrete analysis of query loads. In my own text, I refer to concrete numbers on read model sizes, supplemented by a volume estimate on a real production application. In both cases, the reflex is replaced by a calculation. This is precisely what makes the suggestions viable.

If I could wish for anything for the next architecture discussion, it would not be that everyone puts their data into RAM or dismantles their clusters. Rather, before someone says “a lot” or “little,” a brief calculation is inserted. An estimate in bytes per event, an estimate in events per day, a multiplication over the planned operating period. This is not advanced mathematics; it's three numbers. But it makes the difference between an architectural decision and a reflex. And in this difference lies what distinguishes an engineering achievement from tinkering: a sense of proportion.

(mro)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.