Thought Machine's core product is called Vault. It is a cloud native core banking platform that allows the client to define, create and run any type of banking product using a flexible set of APIs on top of a composable python based smart contract schema.
A key selling point of Vault is that the platform will allow far greater operational efficiency compared with traditional legacy cores as a result of lighter weight, flexible and easily scalable architecture. Among the ways the platform helps clients is through automation of business processing and reporting.
Traditionally our target clients would need to keep a large roster of analysts and other knowledge workers to pull data, create reports and perform reconciliations, to name a few tasks. By comparison with Thought Machine's platform you would only need a smaller team to monitor a more automated system, release new code and fix issues as they arise.
The business case made to client executives is compelling and Thought Machine continues to see excellent growth in client deployments.
There is however an important caveat here;
The success of a deployment that is reliant on automation also needs to provide a capable set of tools to allow for system monitoring, issue alerting, problem diagnosis and ultimately a path to problem resolution.
Schedule management problem space
Our team were initially tasked with determining where our existing suite of APIs were underserving the client uses cases in the zone of observability and operational maintenance. In essence, where could we create most value by extending our self-service capabilities.
Following a short study that made use of a set of qualitative interviews with in-production clients and analysis of quantitive client incident data the direction was clear.
We needed to help our client and colleague teams better monitor/diagnose/remediate issues within automated account-level code executions - which are known as scheduled jobs.
Following some analysis of incident ticket tracking data we were able to conclude that in excess of 600 man days were spent by Thought Machine's engineers in assisting clients to resolve scheduler related incidents in the prior 12 months.
Within our first review workshop we took our incident data together with the interview anecdotes and crafted a vision vision for a new schedule management toolset:
A first party app build that gives clients oversight of account schedules and self service remediation - would create significant value for both clients and colleague engineering teams.
We next took time to consider the user types within our client organisations that are touched by this problem area. Here we settled on two to take to client validation - and which held up during subsequent client interviews:
1. Production system reliability engineers
- Look after the day to day working of the system. Need to understand if there are errors, if the system is running slowly and assess the impact of any issues to decide how to remediate them.
2. Pre-production contract writers/testers
- Our smart contracts are code and schedules events are part of that code. Because of this contracts combined with resultant schedules need to be robustly tested to ensure reliability.
We also wanted to expand our vision into a set of more relatable problem statements again for validation purposes. Here we settled on a two tier consideration with the first level setting the scene for why we want to create the solution in the first place and the second level being a consideration of the user level problems:
Getting the project green lit
The team were confident that we had a bone fide problem set to solve but we also knew it would be complex to implement for two principle reasons. Firstly we knew that from the engineering side there would be complexity to assemble the kind of data we would need to create a good experience (more to come on this). Secondly we knew that to get the project green lit we would need to make the messaging about the complex subject matter understandable to our client stakeholders in other to validate our overall direction before we commit to a build.
We decided to tackle the communication complexity first. Where our basic strategy was to create a set of low-fidelity concept artefacts that we would use to roadshow our early solution ideas internally and with clients. If we got good feedback from these then we were confident we could make a compelling business case and get the project onto the roadmap.
To do some initial motivational work for our team and to try to get some early internal validation of our direction we created an Amazon style press release for our ideas and an illustration of the before and after within our target user teams:
High level solution
The team got to work thinking about what would be needed in terms of a high level solution. What we came up with was a need for a simplification abstraction that would allow for a more humanistic lens to be applied to the potentially millions of schedule jobs that could be run on any given day.
Our thinking was that if we could conceive a useful abstraction to explain what was happening under the hood then this same abstraction could also form the cornerstone of the solution experience.
When thinking about a sensible abstraction our starting place was to consider the genesis of schedule jobs within our data architecture:
The data model above as presented would perhaps not be an issue if were not for the fact that as a consequence of a micro-service architecture the "account schedule", or instance of a schedule, is not self aware of which instruction (template) it was created from. As such you could have a million of the same account, all having the same schedule job but no way to aggregate these carbon copy schedules together. This in turn means that monitoring can only be done by looking at a firehose of data - which is onerous to say the least.
Our proposed innovation was to create a new type of synthetic data resource which we called a schedule set. Put simply this was a way to bucket schedules across separate accounts together based on shared parameters. Our concept was that if we could deliver this kind of aggregation abstraction then we would be able to build a visual experience that would be much more useful and through the use of this intuitive data wrapper.
To convey some potential here I mocked up a crude set of visuals to illustrate possible combinations of parameters and statuses within schedule sets. Showing these early concepts to our clients yielded significant positive reaction. This gave us the confidence to progress with our abstraction direction.
In addition to the concept of sets we also did some early thinking about a high level information architecture (IA) and application journey. The idea was that the user can start by looking at the health of all schedule sets, via a dashboard, and then increasingly drill down to find more detail - for example when investigating how widespread an issue may be of the cause(s) of a shared issue.
This crude IA when used as a workshop prop, helped our clients imagine practical use of the proposed solution and gave us insights into how we could refine the direction further.
At this stage we had enough client signal to put together a business case to secure a position in the product roadmap. Following creation of a case paper and a successful pitch to the CTO + engineering directors we got the green light to continue discovery and design work. We also got a tech lead assigned to join our discovery efforts and represent the implementation feasibility side.
Prototyping & feasibility investigations
Having got a green light for the project, the next stage was twofold. First and foremost was to ensure that the aggregation abstraction and subsequent drill down was feasible to implement. Second, on the basis that it was feasible we needed to start thinking about how to craft the UX and ultimately UI for the application.
System of record vs. system of engagement
Early on in our feasibility consideration we had difficulty. The trouble was that the API that we had hoped to leverage for our solution was heavily optimised for database writes, with reads being of secondary concern. This was because as a company we were chasing dramatic improvement in supported number of transactions per second - a key metric for product success.
We realised quickly that the new types of data associations we were looking to create, along with the sort of complex calls that would be core to the experience, would create significant performance drag. Simply put they would not be viable in the current architecture.
What we needed was a different method to access the data, where performance was less of a concern and where complex read support could be baked in from the start. The solution we came up with was to propose the creation of a separate eventually consistent database that would be populated by listening to a select collection of Kafka topics (a form of push messaging feed). We would then create a new API that would be optimised for flexible reads at the cost of less performance.
We knew this would take some effort to build and so made a second set of internal pitches to gauge senior stakeholder support. The response we got was again positive however it was clear we would need to get creative so as to not invest too much in the build before first putting something into the target users hands for usage feedback.
Low fidelity prototypes
We knew that we wanted to prototype in order to get more client validation. While ultimately our goal was to deliver a series of implementation prototypes the team felt that given the complexity and steep cost of full implementation that the right place to start was with low fidelity prototypes.
The team had had good prior success with hand drawn artefacts combined with client demo workshops and felt a similar approach could work well here.
Our first step was to divide and individually consider different patterns for how we might present our journey, starting with our aggregation abstraction onto which we considered various forms of drill down interaction.
We were also interested in how we might be able to visualise overall health of schedules in order to provide an "at a glance" system check.
Following a few rounds of low commitment design investigations the team decided to return to our problem statements and carry out a user story mapping session. We did this primarily as we knew that the result would be more features than we could realistically deliver in an MVP. Because of this we needed to quite aggressively prioritise which features we would want to showcase in the first prototype that we would show to clients.
With an initial internal consensus on priority I took on the task to quickly put together an initial wire-flow that would illustrate to our clients the principle user journey we were looking to satisfy.
I included references to a number of possible features in order to gauge support.
Feedback we received from clients was mixed. They greatly appreciated the direction we were heading with the aggregation and visualisation but felt that the depth of the journey could be reduced. There was also not enough at-a-glance representation of progress over time.
We asked each of the 4 clients who received the demo to rank proposed features based on perceived business value. Results from this were clear.
1. We should initially prioritise visualisations that include reference to time;
2. We should also focus on interactions that allow the user to quickly narrow their field of interest and to have these choices persist.
Iterating the prototype
With the feedback received the team knew that we needed to increase user controls and at the same time reduce number of interactions required to drill down to the lower level detail. Clients also gave us a clear message to prioritise overall status visuals within our MVP, even if that meant not providing the lowest level of data drill down - as this data was also available by other means.
While we were considering how best to iterate on the experience side we also needed to make progress on the technical implementation. To help with that effort I put together a more detailed IA in order to assist our tech lead plan for the data model we would need.
The team took time to discuss all feedback both from clients and from internal client engineering teams. Having agreed as set of changes I set about producing a set of revised artefacts.
The result was a v2 wire-flow which we took back to our clients for a further round of demos/feedback.
Next steps and retrospective
Feedback received on the v2 wire flow was much more positive both internally and with clients. We knew however that the feature set being presented was still too rich to be feasible for an MVP implementation.
Our next planned step was to move over to a stripped down high fidelity prototype, which would also give our engineers a clear direction for the UI build. Our intention was to put an interactive design prototype into the hands of a set of target users that we had connected with from within our client organisations.
However while this next phase was just about to get going the team received the news that following a board level strategic review it had been decided that our engineering resources were to be reassigned to core API performance work for at least the next 6 months. Our project was recognised as important value-add, but would have to wait two quarters before progressing.
What could we have done differently?
At the point of my departure from Thought Machine this project had not progressed further. Looking back I feel that we were perhaps too ambitious with the scope and our ideas to create a new system of engagement to support a 1st party application build such as this.
Although the case was very strong for strategic investment, the reality was that our core proposition was simply not mature enough to confidently assign resources to a project that in some respects would further increase subsequent cost of change within our data architecture.
Perhaps instead we should have considered how we could have leveraged our network of partners to build the SOE layer on our behalf. I feel that partnership here would have allowed us to deliver capability to the clients much faster. Which in turn would also have freed up additional internal engineering resources to concentrate on strategic performance gains.