Sharing data

First published 25th Feb 2023

Data delivers its value when you move it. Skipping the tedious details, it really delivers value when it excites someone’s sensory organs. Or makes a robot do something. Or both.

How do we move data? In broad brush strokes, you ask, I tell, or I leave it somewhere. Seems simple enough, but there are subtle disadvantages lurking in the shadows.

The example

This article is about sharing data between software engineering teams within an organisation. The organisation’s mission is to educate the world by taking the dreary corpus of documents we call science and rewriting it so that it’s a pleasure to read^[1]. Users can sign up to the service and request an article be re-written. Don't sweat the details of who stores what, they're only there to add flavour to the example. There are three engineering teams:

The User team
The Data team
The Re-writing team

The User team is responsible for handling information about a user:

Id
Name
Email address
Subscription information

The Re-writing team takes requests for translations and distributes the work to a team of re-writers. They store the following:

Translation Requests:

User id
Document doi
Status (pending, translated, rejected)

Document Library:

Document doi
Translation (LATEX file + images, etc)

The Data team centralises the company’s data for analysis. They store:

a copy of everything

The teams want to move independently of each other, so they’ve made the tradeoff to split their technology into separate services — i.e. not a monolith^[2]. If they want data, they have to get it from the team that owns it.

The company’s architecture

The organisation is staffed by people who have been around the block a few times; they understand that:

a given entity, such as a user, should have a single source of truth
you only write changes to the source of truth^[3]
copying data and keeping it in a local cache is allowed, but it comes at a price (more on this later)

Sharing is caring

As we said earlier, there are three ways to share data: you ask, I tell, I leave it somewhere. Details vary, and in your context those details might matter, but the general shape of the solutions will be the same.

Examples of different data sharing methods:

	Ask	Tell	Leave
Sync	HTTP GET	HTTP POST	n/a
Async	Polling	n/a	Pub/Sub

Most of the examples are self explanatory, but one of the categories is worth further subdividing. Asynchronous ways of leaving data somewhere:

	Ephemeral^[4]	Semi-Persistent^[5]	Persistent^[6]
Async	RabbitMQ, Google Pub/Sub, AWS SNS	Amazon Kineses	Apache Kafka, Data Warehouse table, Tuple spaces^[7]

These methods are common enough that I'll spare you the explanation of how they work and their advantages. The less commonly understood method is using a persistent log, check out this video of Martin Kleppmann explaining “turning the database inside out” for a brilliant explanation of the challenges involved in moving data around.

Tradeoff time

The tradeoffs come creeping out of the woodwork when you start sharing data — usually after you’ve built everything and it’s too late to turn back. We’ll explore:

Undesirable coupling
Eventual consistency
- Reading your own writes
The problem of when
Schema evolution

Coupling

Data is like communicable disease, the more coupled your systems are, the more ways there are to be infected.

Who, why, what, where, when and how. Coupling can occur by knowing about other systems (who), caring what they’re trying to achieve (why), the format of the data (what), where the other systems live — i.e. URL’s (where), the time the request takes place (when), and the specific technology/protocol/version (how).

The more of these couplings you have to consider for your given method of data sharing, the more complex your system is. The more complex it is, the more failure modes it has and the harder it is to understand.

Let’s explore this coupling in an example. The Data team need to know about all the users. Let’s imagine the Data team asked the User team to send them an HTTP POST whenever a new user is created. They are coupled by:

Who: the User team knows about the Data team
What: the format of the data, because the POST body must be the format that the Data team will accept
Where: we have to POST to a specific URL, just for this purpose ...data-team.org-name.com/...
When: the Data team will have to accept the information when the User team provide it
How: they will have to agree on protocol (e.g. HTTP POST).

That’s a lot of coupling. Why does it matter? Imagine a new user signs up, here’s the happy path:

Start transaction
Write user record to user table
POST the Data team the user info
Commit the transaction

What if step 3 fails^[8]? The choices are:

Reject the user request; roll back the transaction. Sorry, no sign ups today, the Data team have an outage.
Write the new user to the database and:
- forget about the Data team, or
- retry this request later

All those choices suck, and the last one is the only one I’d entertain in most circumstances.

Imagine you had a similar arrangement with the Re-writing team. You have all the downsides x2.

The final twist of the knife is that the User team needs to spend precious cognitive energy caring about the Data team. Caring isn’t cool. Sadly, they have to care because they need to know if it’s ok to skip the POST to the Data team on failure. The why we didn’t list above is now added to the list. Eugh. They split the teams up so they could move independently, not distract each other with their concerns in endless meetings.

There is another pernicious form of coupling, this one occurs when using HTTP GET to share data. The provider of the data now has the uptime and robustness requirements of their neediest user. If one single consumer needs to be five nines resilient, then you do too. If you have GET requests flying around between many teams then failure is likely.

If one consumer uses your GET API as part of a batch job, and that consumer is too trigger happy with the concurrency, you’re database is going to blow a gasket. How much control do you have over this? Not much.

Using HTTP to share data is a common pattern. This means there are many solutions to these problems, but take a step back and make sure you’re implementing solutions to a problem you need to have. Consider decoupling your teams in time and place; use a queue to share data instead.

Consistency

Imagine you’re a user of this fantastic re-writing service and the engineering team at this company took the coupling concerns seriously; they decided to use a pub/sub system to share data instead of HTTP GET APIs. You go to the user profile page to update your email address, you then request a document translation and the confirmation goes to your old email address. What the…?

If the Re-writing team store a local copy of the user information, then it is guaranteed, by nothing less than the laws of physics^[9], that there will be a delay between when an update is written and when the copy is refreshed. This is known as eventual consistency. It’s why nothing ever works properly. The desirable property, which this type of caching sorely lacks, is known as being able to read your own writes.

When

Time is like one of those friends who never misses an opportunity to get one over on you. Here are the ways it conspires to get you.

The way of the new team.

You’ve recently put the finishing touches on your pub/sub system. The dead letter queues are in place, you’ve eliminated the race conditions, the system is working well and the Customer Service team are recording an all time low in the number of complaints about missing data.

Things are going a little too smoothly and management are nervous about what the engineers will do if they have too much time on their hands. They decide to create a new team. The team will re-write overly long and boring technical blog posts by second rate engineers who fancy themselves as writers, with a focus on websites with questionable background colours.

This team comes along and — you wouldn’t believe the gall — they’ve hooked themselves up to your pub/sub topic, but they want users who signed up before the team existed. Before you know it, you’re in another meeting explaining the complexities of catching up with a live pub/sub feed. You also have to find a way to dump the existing database so the new team can read it.

The way of the buggy integration.

The Re-writing team are getting reports of strange behaviour. The re-writing browser page is crashing when certain users try to load it. They investigate and realise there are users who exist in the User database but not in their local copy. The pub/sub queues aren’t backing up so there must have been a bug at some point.

They find the bug, fix it, and realise they need to reconcile their copy against the user database. Another meeting later, the User team explain the same thing to the Re-writing team as they did to the new team and a back-fill takes place. This causes tremendous problems because the service has become rather popular and syncing all the users takes a long time.

In summary: if you use a pub/sub system, you ~~should~~ must not assume that you’ll only need to share the data on creation and update. If you use a persistent log^[10], you’re no longer beholden to this issue because it can be read from beginning to end by anyone at any time.

Time is a sophisticated adversary; when has another way to sneak up and bite. If you send a pub/sub message or write to a data warehouse table, you have no idea when someone will read that data. It could be milliseconds later, but it’s possible the read happens days or years after the write. If your payload/row/message has an attribute like { … “hasAccessToProduct”: true }, it could be dangerous, meaningless, or both.

You could add context about when that attribute applies or avoid structuring the data that way. If you rely on someone reading the message soon after it’s been written, you’ve allowed yourself to become re-coupled in time. Be careful.

The final way that time sticks its oar in and stirs the pot is around message ordering. Most pub/sub systems don’t have strong ordering guarantees. Even if they do, if you have multiple nodes of your service writing to a topic, it’s hard for you to guarantee what order you’re writing messages. Use something that’s capable of providing a monotonic number (a PostgresSQL sequence will suffice) and put the sequence number on your messages. Consumers will need to be aware of out of order messages and how to handle them

Schema evolution

The problem of schema evolution applies to all methods of sharing data. Here are some tips:

Version your schema, put that version in the message/topic/payload/route.
Wrap things in an object, even if you don’t think you need to. Adding keys to an associative data structure is a backwards compatible change. { "data": [...] } not [...].
Pick the attributes you care about from shared data, don’t assume new keys won’t be added. Never (read “extremely rarely”) reject a request/response because it has extra keys you don’t know/care about
Tell the truth. Call things what they are, not what you wish they were or what they’re going to be used for. This way you’re less likely to need to evolve your schema. For example, if you capture an event every time the user gives feedback, call it { "starsGiven": 1} not {"askCustomerServicesToContactUser": true}.

Summary of tradeoffs

	Coupling	Consistency	Schema evolution	Out of order messages	New consumers
HTTP GET	high	good	ok	no	ok
HTTP POST	high	good	ok	no	ok
Pub/Sub	low	eventual	harder	yes	hard
Persistent log	lowest	eventual	hardest	yes	trivial

Conclusions

Sharing data is a matter of balancing tradeoffs. There’s no single right answer, nor a sensible default way of doing it. It depends on context. How many teams do you have? What service are you offering? How resilient is your product to eventual consistency? What are the consequences of failure? How quickly do you need to move? What kinds of change are likely?

The companies and teams that make up the bulk of my professional experience benefit from lower coupling. If those teams can focus on the work that justified making them a team in the first place, they can move faster and add value more effectively. This can be achieved for the price of learning how to mitigate the downsides of sharing data using methods like pub/sub or persistent logs. This allows the teams to do their work with no concern for what the other teams are doing because they’ve shared their data with anyone who needs it in a lowly coupled way.

It’s hard to know at what level to pitch an article, and this one is pitched pretty high. If you didn’t understand everything or don’t have experience with all the systems I’ve referred to then that’s on me, don’t be disheartened. I recommend Designing Data Intensive Applications - Martin Kleppmann if you want to learn about this topic in more detail.

Too many science papers use a complex, narcissistic style of writing that borders on illiteracy. It’s so normal to write in this style that respectable journals publish auto-generated articles by mistake! ↩︎
A single deployable artefact (like a NodeJS web service) and database that everyone in the company shares ↩︎
…lest you spend eternity burning in the firey pits of Hell. ↩︎
Once you’ve read the message once, it’s gone. ↩︎
You can read the message more than once, but it’s only stored for a limited amount of time. ↩︎
You can read the message forever. ↩︎
I barely know anything about these, but I think they fit into this category and sound very cool. ↩︎
This is impossible to ascertain with certainty, the response to the successful POST may have been lost ↩︎
Ignoring all that stuff about spooky action at a distance and assuming you haven't slipped into a state of psychosis and decided to implement a distributed transaction. ↩︎
Like Apache Kafka ↩︎