How Clojure's concurrency model helped us feed 12000 people
On March 4, 2020, I saw that my friend Anand was building an open platform to help an NGO. This group had designed a Food and Health Kit and had established safety protocols. I called him up and found myself at a conference the next day.
15 odd people, whom I had never met before were trying to feed the marginalised. They had already raised Rs. 700,000 (~ $ 9,000) and had partnered with an NGO to handle on-ground distribution.
I was told that the our kit was designed to sustain a small family for 1 month.
The optimist in me saw a set of volunteers with good intentions, trying to do their best to support the community. The pessimist in me saw the restrictions due to lockdown and the general fear in the air.
The realist in me saw a 35 page Excel Sheet horror, which was supposed to be our central control panel to manage this operation. We had to raise funds, procure raw materials, pack them, transport them to the place of supply, distribute them while collecting proofs of distribution.
We also had to convert these pictures into social media posts and update our donors about the distributions.
We had to send tax receipts to new and existing donors. We had to train volunteers to handle these processes. We needed to handle requests that would come from various sources, validate the list of beneficiaries, and arrange on-ground support.
Karuna 2020 Runtime - v1
A simplified view of the organization - The arrows represent the flow of information
There was a central coordinator - aka the "main thread". The main thread would call "async functions". These functions were actual volunteers who'd do some work in the real world and return a response. For example, we might call the Godown Manager to get inventory status.
Most functions were tightly coupled, if one failed, the "main thread" will have to abort. Excel was our Global Scope, STDOUT, STDIN.
The process was operating in-memory - literally inside the central coordinator's brain. This made it hard to resume in case of errors, or transfer to other machines.
Our interrupts were actions from the outside world, for example, "A donor donating". Interrupts to had to be handled on the main thread. This clogged the system even more.
On top of that, we didn't have any form of Garbage Collector. All comments, status updates, from both successful and unsuccessful executions would stay in Excel until the end of time or end of Google since it was a G-Sheet.
Our main thread was running hot, and unlike a DevOps Engineer, we didn't have the option to run multiple nodes behind a load-balancer.
We had already done the most natural thing. We divided responsibilities and synchronized with the main thread. Volunteers had well-defined roles and tasks, just like Classes in an OOP Language. They encapsulated data and had public methods to communicate.
The problem with our system, just like with every computer program, was side effects and state.
The central coordinator became the God Object. And due to the async nature of the system, it became hard to diagnose problems.
For example, if we had inventory to provide for 100 families and three requests A, B, and C for 50 kits each. And request A failed validation, then should we try and fulfill B and C or wait for A to pass validation. And how to decide this without the God Object. Because God has umpteen other things to look after.
This led to poor utilization of available volunteers and attrition. Since everything was happening on the main thread, it became hard to untangle and divide work.
When the opportunity presented, I volunteered to be the coordinator, the "main thread". I didn't fully understand the nature of work or the efforts it would need.
In 48 hours, I called the last coordinator and told him that I don't understand anything. To which he smiled and said that give it some time, you'll get a hang of it.
From JS Async to Clojure Concurrency
Clojure has native support for Go like channels. I use the phrases "Clojure Concurrency" and "CSP model" interchangeably.
After being the central coordinator for a while, I wanted to:
- Reduce or completely remove dependency on myself (the "main thread") for running tasks
- Allow for fast contribution, i.e. a volunteer should be able to take over a task without knowing the complete mechanics
- Get a real-time birds-eye view of all parameters
- Streamline all tasks and distribution requests that came over phone and WhatsApp groups
What we ended up with was similar to Go channels like concurrency, but I realized this only in retrospect.
Everyone is just a function
So far, everything was inside the coordinator's head and hard to consume Excel dumps. My first task was to isolate all processes and move them to a Kanban board. We chose Airtable because of its flexibility and a generous free tied. They also offered us the pro plan for free.
I started by isolating the Donor Lifecycle Management process. My idea was inspired by the project management workflow at my last company. A request comes in and moves across stages. Each progression is handled by the same or different volunteers.
I also made sure that the guides were like pure function definitions, ie had no state.
For example, the donor management process broadly had 4 pure steps:
- Get notified about a donation (happens via an SMS from the bank)
- Identify the donor: could be someone from a personal reference or someone who donated via the website
- If identified, contact them and fetch details like the address, to generate a receipt. After getting details, mark as "Ready for generating receipt"
- If not identified, add an empty record and mark as "Receipt not needed"
A contrived version of Donor Lifecycle management
In reality, the Airtable board had more columns to handle edge cases. But the idea was the same.
A volunteer needs to process only one stage at a time. All "state" needed to process that stage is available on the card.
We essentially pushed side effects to the edges of the system and made each volunteer a pure function. We also published guides about handling these states and moving cards from one channel to another.
Communicate with Main Thread Move over channels or queues
We were able to achieve Clojure like channels that solved the problem of a hot running "main thread". The system was horizontally scalable now. We onboarded volunteers and assigned them a task. We also got a bird's eye overview for free.
Over time, we wrote similar guides for Volunteer Onboarding, Social Media Outreach and Distribution Management. We also had processes for Procurement and Accounting, but we have not been able to generalize them yet.
Our Airtable base as of today
This system took multiple iterations and feedback from the entire team.
I was happy to see other volunteers pointing out mistakes and improvements in the guide. I think this truly worked because 2 of the volunteers onboarded by me, used the guides and onboarded 4 more people.
Since all our data was open source, we could easily issue pull requests to update the guide. Anand had setup Github actions to rebuild the site periodically, which meant all updates would go live within 24 hours.
I was pleasantly surprised at Anand's velocity. The man built this website update mechanism in less than 48 hours. Another 24 hours for integration with the static website. Perhaps less than that.
The intellectual lesson
Use queues everywhere
Queues make our life easy. I recalled Rich Hickey's talk about queues and core.async and was amazed at his first principles approach. We as a team literally implemented the high-level concepts of CSP to a real, async, and multi-threaded system.
The system in action
I considered it my responsibility to test the system and make improvements. So the developer in me picked the phone and called a beneficiary who had requested a dry ration kit.
We had received a request from a woman who took tuitions for underprivileged children. She told us that some of her students' families were facing issues due to lack of supplies.
These kids were not more than 12-15 years old. Yet, they took the initiative to contact us and to provide us with a list of beneficiaries and other required docs.
I called the kid up and asked if her mom or dad was nearby.
She connected me to her father, who told me that they had run out of food two days ago. He was now going out to buy vegetables. He told me that his children were hungry.
It was overwhelming to hear that. I had tears in my eyes and wanted to help them as soon as possible.
Until now, I was just building a system like any other. After realizing the situation on-ground, I wanted it to succeed even more. We delivered kits to these families within 72 hours of receiving the request. The delays were majorly due to restrictions in movement due to regulations.
Our process was to capture pictures of beneficiaries and post them publicly. We made sure to blur faces. This transparency led to more donations.
We raised almost 2,000,000 INR ($ 27,000) in less than 35 days, in cash and kind. We estimate that we have helped over 10000 people with food and health supplies for 1 month and another 20000 with masks.
We are happy with the impact we made, but also realize that our relief work is peanuts compared to the damage caused by the virus.
When all hope is lost, our only option is to stay hopeful.
We are working to continue our efforts and have decided to focus all attention on a community of 500 families.
We are also working with two other NGOs that deal with metal health issues to streamline their efforts using our processes.
Feel free to send any questions, or thoughts on this article on twitter @shivek_khurana.