The New CTO's Survival Guide: Rescuing a System at its Breaking Point
Newly hired as CTO, you inherit a revenue-generating system that's gasping for air. 3,000 daily active users are pushing the infrastructure to its — or the team's — limits. What's your next move?
As a newly hired CTO, you've inherited a system that is functioning well enough to generate meaningful monthly revenue, based on up to 3,000 daily active users.
With recent growth, many functions have been slowing down, and there are even occasional outages of various features. It looks like the system is functioning at maximum capacity, and all the engineering team can recommend is to "increase resources" or "increase the DB instance size" etc.
Basically, the development team cannot squeeze any more capacity from the system. This effectively means that the business cannot grow.
You got hired to rescue the company from this situation. The CEO and the other execs expect you to have all the answers. The clock is ticking ...
How would you approach this problem?
A. Rewrite the system from scratch
The current implementation is clearly and severely flawed from a scalability, reliability and resilience perspective. Despite this, rewriting the system from scratch is the wrong answer — in most situations.
During the rewrite, business growth, and product growth, will be frozen — making it an expensive option. Functioning businesses do not typically have the patience for such things.
The current solution has been trouble-shooted (trouble-shot?) enough to iron out the most egregious bugs and is functioning well enough to garner revenue. Rewriting it will create a whole new set of bugs and quirks that will have to be ironed out.
Thirdly, you'll need to migrate data from the old platform to the new one — which could be fraught with pain.
In summary: business growth will be frozen during the rewrite, after which there will be a turbulent period during the shakedown of the replacement platform.
B. Buy time and show some quick wins by easing bottlenecks
Analyse known problems, re-examine prior assumptions, and use your knowledge and experience to guide the team to implement quick (and sometimes dirty) fixes or "plasters". This way, you will very likely squeeze more capacity out of the system while getting more intimately familiar with it.
Showing some quick wins will also generate build confidence with the executive team, which you may need to rely on later.
These are examples of quick wins I have used successfully for significant results:
Database Optimisation:
Ensure suitable indexes are used for retrieval. (But not too many, which slows down inserts/updates.)
Identify and optimise slow queries.
Consider using read replicas for reporting / OLAP.
Where suitable, cache expensive query results using an in-memory cache such as Redis.
Traffic Management:
Use a content delivery network (CDN) such as Cloudflare to cache static files e.g. images and scripts.
Use a web application firewall (WAF) such as Cloudflare to filter out distributed denial of service (DDoS) bots.
Networking Optimisation:
Ensure that requests from one service to another are not routed through your cloud's public network interface.
Ensure that your service-to-service communications use connection pools with keep-alive.
You may also need to stoop to these "dirty" (but apparently effective) solutions:
The dreaded "Increase resources" (RAM in DB machines, ...)
Scheduled rebooting of various resources that "get stuck," either prior to crashing, or which then go into "zombie mode."
What other quick win ideas have you used? — please comment below.
Remember to measure the effect of the improvements, and their impact on core business metrics!
This option not a solution on its own. It supports incremental growth and buys you time to make a more informed choice about your next move.
By the way, there is also a non-technical bottleneck you may decide to solve at this stage: the team's skill level. They built the current system, and were failing at scaling it. You will likely want to upgrade the team's abilities by bringing in some more capable and experienced people.
C. Refactor or rewrite pieces of it gradually in stages
The system is functioning and bringing in revenue. You'll have the luxury of being able to get immediate feedback on any modification you make.
You'll need to identify components that can be replaced without affecting the rest of the system. Even parts of existing components can be re-routed and replaced using an approach along the lines of Martin Fowler's strangler fig pattern.
This approach seems slower than a complete rewrite, but it's far less risky, and has the added advantage of delivering improvements continually. It's definitely more "messy" than a rebuild, but in a true engineer's mind, there should be nothing more beautiful than a system that actually delivers — and it will continue delivering at every step of the way.
Conclusion
Freezing growth in favour of a complete rewrite is almost always the wrong answer from a technical viewpoint, and is a complete non-option from a business perspective.
As you've probably guessed by now, my default approach, in situations akin to the idealised dilemma I presented, has been as follows:
First: Buy time, and show some quick wins by easing bottlenecks.
Next: Refactor or rewrite pieces of it gradually in stages.
Have you been in a similar situation? How did you handle it? Please comment below.
Starting with B (finding quick wins) to do C (refactor) is definitely the way to go. Like you said starting with B gains trust with the company that hired you and allows you to get familiar with the system.
You asked for extra ideas for option B. Perhaps you may be able to duplicate the cloud infrastructure and split the user base over the two copies of your system.
You write that refactoring is more messy than a rebuild. Refactoring is more risky, but rewrites are always vastly underestimated. You might make the refactoring less risky by adding a lot of unit tests/integration tests first.