All software is either maintained or abandoned
Assumed audience: Engineers and leaders who work with complex systems and are curious about team dynamics.
Some years back I had the privilege to work at the health company Lifesum, first as an engineer, then leading their data team, then leading the whole engineering org.
When I look back at that time, it's clear that snowballing engineering challenges held back the business and prevented it from thriving. The biggest challenge was mobile sync.
Unsuccessful efforts to solve our sync issues were a substantial drain on engineering for multiple years, preventing us from shipping essential features and taking on important opportunities. And at its core, the problems lay in a codebase that was both abandoned and yet essential to the business.
To understand how this happened, we need to go back to the earliest days of iOS.
The earliest days of iOS
Lifesum was an app created in the early days of the iPhone's release and subsequent app store.
At those times, mobile apps were truly mobile-first, and the data for them was often stored locally on device. For example, the Things app that won an Apple Design Award in 2009 stored all its data on the device.
As another year or two passed, it became urgent for people to be able to migrate data to new devices. A kind backup and sync method was retrofitted to apps like Lifesum and Things.
It usually worked great for the case of transitioning from one device to another, but these systems were not designed or tested for smooth multi-device usage -- you'd end up with consistency issues and data conflicts all the time.
The problem being solved
Unlike Things, Lifesum was also integrating with other services like Runkeeper and Fitbit, and doing a server-side sync against their APIs. This kind of health data can have much more frequent updates than a todo list, and basically means that both the phone and the server are masters that can accept writes.
The solution came as part of the masters thesis of early engineers Olle Lind and Joakim Hammer, who went on to do very strong iOS and Android development for Lifesum in the years that followed.
Essentially, the phone kept a full copy of the user's data, and a system based on tables and timestamps could efficiently indicate when data was changed and needed to be synced. Then only the delta would be sent each way.
The problem that did not get solved
As a thriving project, demands on Lifesum grew and it was natural that the database schema grew and changed with it.
This led to a delicate dance. On the mobile side, the app could more or less progress, but on the backend side, then sync needed to support years of old app versions with data schemas that could not be updated.
Predictably, the main sync loop became a snowballing mess of if-else statements and special cases, growing to thousands of lines of code. It was a nightmare to maintain and very difficult to test. Senior engineers abandoned the "scout rule" of leaving the code cleaner than you found it, and just got in and got out as quickly as possible.
One senior complained bitterly when I pushed him to do a basic refactor to make the code he was touching more readable, because he had not mentally planned for the "risk" of changing it.
The code got so bad that the team mentally moved on from it, and started the process of replacing it. However, the replacement would turn out to take years, and the unspoken decision to stop leaning into it, stop mastering it, would turn out to have big consequences.
Limiting support for app versions
One solution to all the branching special cases was to begin cutting off old app versions, and only supporting a sliding window of app versions. After all, if an old version was no longer used, we wouldn't need the code to contort our data model to its old schema.
We did, eventually, limit support for app versions. It was a long, controversial process, that got pushback from users and sometimes from the team. We developed a system of "soft nudge", "hard nudge" and "force update" that could encourage users to update their app to a newer version.
At the moment we shipped it to app stores, all our existing users were running old app versions. That means that they did not even have support for these nudges yet!
We shipped them into recent releases, then waited months before trying them out. But when we tried them, we had not QA'd them enough —- they were buggy and did not work as expected. It would take a year before we were in a habit of routinely deprecating old app versions, and before apps in the wild had reliable support for update nudges.
The refactor that never happened
In the meantime, we did a major refactor of the sync on the backend, coming up with a new data model that we thought was more future-proof. However, shipping it would require multiple teams to prioritise implementing it, and a plan for simultaneous release! Our two mobile teams were desperately pushing for new features, and it was a constant struggle to get the resources to do the change.
In the end, the sync problem outlived my tenure at Lifesum, and the new data model was never fully implemented. Years later I enquired, and was told that a different, related, partial solution had been picked up instead.
Essentially, the entire exercise was a colossal cost to the org. At that time, Lifesum was close to profitable but not quite there, and faced fierce competition from other companies, so the opportunities lost were very real for everyone involved.
Maintained or abandoned
Lessons about software architecture are often learned painfully, by following choices over a long period and understanding their ultimate consequences.
In this case, I learned that you cannot allow the code for any core part of your business to be mentally abandoned. Unless you can replace it very rapidly, you must eat the pain of mastering it and incrementally improving it.