Blockchain solutions are being considered for a host of challenges. However, many of the high-potential use cases come with a complex set of requirements. In order to navigate the resulting trade-offs across different technology and design choices, in-depth evaluation is necessary. In this post, we shall consider evaluating candidate solutions with regard to the risk of violating requirements. Risk is hard to quantify, as it requires determining the systems vulnerability to various threats and quantifying the respective potential loss. In most cases, this requires measuring the systems behaviour, as complexity renders estimation infeasible. Here, simulation offers an efficient alternative to real deployments. Simulation allows to establish a comprehensive risk profile efficiently and to explore mitigation strategies for severe risks, strengthening the systems resilience. This will be demonstrated in an example. Furthermore, simulation can be leveraged from an early stage in the design process – long before PoCs and pilots.
Blockchain solutions are being engineered for a variety of use-cases, leading both to innovation in legacy systems and to novel solutions for long-standing challenges. The ecosystem is evolving rapidly, as our understanding of this novel technology is maturing. However, blockchains are still on the cutting edge and rely heavily on sophisticated technology, like consensus algorithms [1, 2, 3, 4]. As a consequence, blockchains are complex. They display emergent behaviour, which is furthermore sensitive to small design changes. Therefore, their behaviour is hard to predict and experience from specific systems is hard to generalise. The scarcity of experts and efficient tools, coupled with the high demand for new systems further intensify this problem. Thus, designing blockchain solutions is hard, especially when it comes to risk management: identifying threats and quantifying potential losses is time and work intensive. It is our belief, however, that this challenge can be met with the right methods and tools.
A tool that has proven transformative in many fields of engineering is simulation. Simulation allows to predict the emergent behaviour of many complex systems; be it to quantify the stress on machine parts under load, the efficiency of thermodynamic processes or drag coefficients of arbitrary geometries. This has revolutionised innovation processes, allowing to scan large areas of the respective design space at a fraction of the time and cost of experiments. Also, it allows comprehensive analyses of sensitivity and risk, without the need to recreate the corresponding conditions and therefore without endangering systems or people. Bringing simulation to the field of blockchains holds great promise. Through simulation, the process of evaluating solutions can be stream-lined and moved forward in the innovation cycle. Thus, risk assessments can be rendered rigorous, affordable and timely. Also, identifying risk mitigation strategies early on not only safeguards eventual solutions, it allows for a more informed comparison of competing solutions.
Currently, major central banks are investigating the introduction of central bank digital currencies (CBDCs) [5, 6] and pilots are being announced . The challenges associated with CBDCs are formidable, as a range of different, often conflicting requirements needs to be considered [8, 9, 10]. There are fiscal and monetary issues, societal needs and benefits, and performance considerations. At the same time, a range of technological solutions is available. Blockchains/DLTs are considered a relevant candidate for CBDCs, as they allow for decentralisation and thereby provide resilience and availability. They compete alongside more established, centralised solutions. However, while some requirements can and should be considered qualitatively, a few need to be established on a firm quantitative footing for serious evaluation of the trade-offs involved. This is hard to achieve, as candidate solutions need to be designed and optimised to yield representative results. Relying on deployments renders this process very time and cost intensive, therefore allowing only to cover a small number of viable candidate solutions in coarse detail. Here, simulation can make significant contributions, as we shall demonstrate in an example.
As a first step, we will consider classes of threats in blockchain deployments. Speaking very broadly, threats can originate from the usage, the hardware, rule-breaking or from false assumptions. We will establish a set of high-level classes of threats, which we expect to cover most, if not all, relevant threats one may need to consider.
The load on most blockchain systems will vary over time. A sudden and unexpectedly heavy load could result in a severe degradation of system performance or even an overload. This can mainfest as a sudden spike in overall usage, as well as spikes from individual participating organisations. Clearly, the magnitude and distribution of the load is relevant, both in terms of likelihood and of potential loss. Therefore, a reliable analysis must cover a range of different patterns and different magnitudes of load. Also, external heavy load can impact the system, for instance congestion in the network from third-party applications or an internet exchange point coming under DDoS attack.
Hardware issues include a wide range of possible events. They can directly impact the hardware running the blockchain system, e.g. servers crash, compute cores or memory chips are damaged, hard disks fail or are corrupted. Also, the network can be impacted, leading to increased latencies and decreased bandwidth, e.g. due to fibre-optic cable unexpectedly going out of commission or due to routers crashing. Blockchain solutions are specifically designed to tolerate a specific number of certain failures and for now we assume we will remain within these bounds. This usually implies safety, i.e. participating non-faulty nodes will always remain in agreement, and to some degree liveness, i.e. the system will eventually make progress. Of course, it does not guarantee the system will remain within business requirements.
Malicious attackers gaining control over a part of a blockchain deployment can wreak all sorts of havoc. They can attempt to disrupt availability, e.g. by spamming the system with transactions or delaying deliveries. Or they can attempt to break consensus, e.g. by sending fake messages, dropping messages or voting for conflicting updates. While the likelihood of a successful attack depends heavily on the security of the system, the possible exploits and their consequences depend on its design and usage.
Organisations join consortia to benefit from collaboration. Consensus and tamper-evident records on the blockchain are key to ensuring mutual trust, as rule-breaking can be easily spotted and proven. However, this only holds true for what happens on the blockchain. Importantly, the code running the blockchain is itself not on chain – except, of course, smart contracts – so tampering with it is itself not evident. Thus, it is relevant to assess the kind of self-serving behaviour possible without leaving evidence on chain. For example in Raft, the current leader can drop or front-run transactions. Front-running can be observed e.g. in Ethereum for viable, net-profit transactions . One must assume this can occur also in permissioned blockchains, if opportunities arise.
Consensus algorithms are fault tolerant. The maximum number and nature of faults that can be tolerated depend on the algorithm, as do further hard requirements with regard to the system, e.g. messages between peers being delivered in order, bounds on communication etc. As long as their specific requirements are met, consensus algorithms give certain guarantees. Most importantly, they usually guarantee safety and some degree of liveness. However, what guarantees are broken if requirements are violated and the potential consequences, can vary dramatically.
Blockchains ensure agreement across multiple copies of the ledger. This is achieved by total order of the updates, i.e. the same updates are applied to every copy in the same order. Therefore, conflicts between updates get resolved in real-time by applying the earlier update and rejecting the later. However, this implies a degradation of performance: first, clients having their update transactions rejected experience higher latencies, as they need to resubmit their update transactions, and second, the system’s maximum goodput (valid throughput) is reduced, as conflicting transactions are processed, only to get discarded. The frequency of conflicts occurring is influenced by multiple factors, mainly by the likelihood of updates touching the same data and by the volume of potentially conflicting data in flight, i.e. by the frequency, size and latency of update transactions. Thus, e.g. an unexpected spike in the number of conflicts can lead to a severe performance degradation or even system overload.
Please note that Hyperledger Fabric’s execute-order-validate architecture behaves very differently from the more common order-execute architectures.
After possible threats have been established, it is necessary to determine their likelihood and their potential for loss. Both are, of course, necessary to quantify risk. Here, we are interested in the later and will not dwell on determining likelihoods, except to note that in many cases, the threats pre-date blockchains and that there is a rich understanding to draw upon, e.g. for server failures . Based on risk quantification, relevant risks can be identified and mitigation strategies investigated.
Risk is the product of loss and likelihood. We will assume the likelihood has already been assessed. To quantify loss, we need to establish the specific system behaviour – in response to an actualised threat – and map this behaviour onto loss. Defining a reasonable mapping is highly domain and use-case specific. It requires careful consideration by subject matter experts and benefits greatly from collaboration with consensus/blockchain experts. The latter can establish an abstraction of the system behaviour, while the former quantify loss as a function of this abstract representation.
However, for blockchain solutions the gap in this approach is in establishing the system behaviour resulting from actualised threats. Blockchain systems are complex and hard to predict. Also, the experience accumulated to date is scarce, due to the novelty of these systems. Finally, the behaviour is sensitive to small changes, thereby limiting the generalisability of observed behaviours. Data is especially scarce for threat scenarios, as they are intrinsically rare. Hence, specific behaviour in response to actualised threats needs to be analysed in targeted studies. This can be achieved either with real deployments or with simulations.
Let’s first consider real deployments. When studying a production system, prudence requires halting production, as the consequences of the threats have not been established yet. The alternative is studying a second deployment. This is viable if generic and scalable hardware is being used. Otherwise, key infrastructure needs to be duplicated, driving up cost and effort. However, studying threats in a real deployment – be it the production system or a second deployment – is difficult. It requires methods of provoking the threats to actualise. This involves careful crafting, manipulating algorithmic, low-level behaviour and exercising a high degree of control over the hardware, which may not even be feasible. Also, measuring and gathering the relevant data needs to be well orchestrated and executed, preferably using dedicated tools like Hyperledger Caliper . Finally, studying a real deployment requires a fully operational system.
Simulation has some key advantages over real deployments: it is efficient, yields full control and transparency, enables straight-forward migration of scenarios between blockchain architectures and is possible even before the system has been fully developed and deployed – please note that the major challenge with leveraging simulation is faithfully modelling and representing real-world systems. The combination of model crafting and validation is therefore key to unlocking the full potential of simulation. Let’s consider these advantages of simulation in more detail.
First, the full control afforded enables straight-forward, arbitrary changes to simulation scenarios, even if those changes are hard to actualise in a real deployment. E.g. the threat of a whole AWS availability zone going offline is easy to simulate, but hard to actualise in reality.
Second, full transparency allows accessing and logging any relevant data at any level of detail throughout the simulation. In real deployments, some data are difficult to attain.
Third, by simulation being efficient, evaluating all relevant scenarios becomes affordable. Even to the extent of gathering robust statistics for scenarios with high variability of outcome. E.g. a crashed node in the context of Hyperledger Fabric  will have a very different effect depending on its role within the system: A crashed client leads to a different outcome than a crashed peer, which in turn is different from a crashed ordering node. Even among ordering nodes, a crashed leader has a much more significant impact than a crashed follower. Coupled with e.g. different system loads or load distributions, a large variety of outcomes can manifest themselves. With simulation, it is affordable to simulate a representative set of those outcomes, evaluate each one and create statistics.
Fourth, straight-forward migration allows a modularised approach to investigating threats. Each threat can be formalised as a parametrisable scenario, down to the set of required simulations and the type of statistical data derived.
This threat can then be simulated for any new system with minimal effort.
And finally, simulation can bring quantitative evaluation forward within the design process, without requiring expensive and time-consuming PoCs.
In consequence, simulation allows to develop and constantly update a comprehensive set of threats, with corresponding statistical evaluation methods and efficiently deploy the full set towards new system architectures. Also, it enables fast design iteration and subsequent evaluation, e.g. to mitigate risk.
Let us consider a simple CBDC as an example. Clearly, the following is not a realistic set-up by any stretch of the imagination, but it will serve to illustrate the challenges involved designing a resilient system and the opportunities afforded by simulation. We assume a decentralised blockchain-based CBDC, jointly run by four central banks across Europe, implemented on Hyperledger Fabric and deployed on AWS. Each central bank, henceforth referred to as organisation, will run an orderer, two peers and four clients (Note: Orderer is the name for consensus nodes in Hyperledger Fabric; peers store the ledger, answer queries and execute/validate transactions; clients process transactions, i.e. send them to peers for execution and to orderers for inclusion in the blockchain). Each runs on a different AWS datacenter; in Dublin, Stockholm, Frankfurt and Milano. We will assume transactions only change a few entries in the ledger and are therefore compact. Under ordinary circumstances, this set-up can process around 1000 transactions per second (tps), before congestion starts setting in, eventually leading to a system overload.
As we have seen, there are a number of threats, which – if actualised – can lead to a deterioration of performance. For demonstration, we will focus on a single, straight-forward threat: one datacenter experiences technical difficulties, leading to higher network latencies to and from the other datacenters (Note that the orderer in the affected datacenter is not the Raft leader). We will assume these difficulties have a duration of 60s and simulate with 2x, 3x, 4x and 5x the original latencies. To determine what level of performance deterioration incurs what loss, we will assume a simple metric: Transactions that settle within one second do not incur any loss, every latency beyond one second shall incur a linear loss of one Euro per second (per transaction).
loss per second = sum(max( latency, 0)) / duration [€/s]
One could imagine a response time service level agreement along these lines. Finally, we will run at 300 tps, a load well within the capacity of the set-up. Arriving at a reasonable estimate of the loss ad hoc, based on expertise and experience, is very challenging, even though the example system and metric are both simple. With simulation, determining the loss is straight-forward: The scenario is run, the latency data gathered and the loss equation (above) evaluated, see Figure 1. Please note that the results (loss per second) are averages over multiple runs and the error bars indicate the corresponding standard deviations. Also, note that these standard deviations are surprisingly high, the reasons for which will become clear later.
If a risk turns out to be quantitatively relevant, mitigation strategies can be investigated using simulation. Note that the mitigation strategies need to be determined through expertise, while simulation allows to quantify the achieved effect. In the example above, the obvious approach to mitigation is circumventing the affected datacenter. Also, please note that any three out of the four datacenters are sufficient to make progress with Raft, therefore consensus performance should be robust towards the network degradation.
First, let us consider the origin of transactions: Users connect to clients to submit their transaction. Users connected to clients in the affected datacenter will realise in real-time when transactions are delayed. As a first mitigation, let us therefore consider the following: allowing users to submit their transactions to a different organisation. It is reasonable to assume, this will have a large impact, as affected users will quickly migrate their transactions away from the affected datacenter. Of course, decision heuristics need to be formulated, along with strategies to retest the viability of the original organisation. Let’s assume this transaction migration is executed perfectly, i.e. users switch immediately. In Figure 2 we can see the effect of this mitigation strategy on loss and as it turns out, it is far less impactful than anticipated. This is surprising, as zero transactions are being submitted to the affected datacenter during the technical difficulties.
However, a subtle behaviour of Hyperledger Fabric comes into play: Every peer receives new blocks to add to its copy of the blockchain from a single orderer, chosen at random. Therefore, the affected datacenter remains on the critical path for all transactions processed by peers subscribing to the orderer residing at that datacenter. These peers can reside in any datacenter. A mitigation consists of reconfiguring this behaviour: all peers receive new blocks from the orderer of their own organisation (in the scenario studied here, every organisation has an orderer). Please note that in the current release of Hyperledger Fabric (Version 2.2.1), this requires a patch of the peer software, as subscribing to a random peer is the only option considered. We hope this alternative will be added in a future update. Figure 3 shows the resulting loss per second when deploying only with this second mitigation (i.e. without migrating the transactions). As can be seen, reconfiguring the peers reduces the loss per second somewhat. More significantly, it significantly reduces the deviations between the runs (error bars), indicating that subscribing to random orderers for new blocks was the source of this uncertainty.
As can be seen in Figure 4, combining both mitigation strategies – transaction migration and peer reconfiguration – leads to a nearly complete mitigation of the potential loss and corresponding risk. This indicates that the affected datacenter was successfully removed from the critical path for all transactions. Please note that both mitigation strategies are minimally invasive.
Blockchains are highly complex systems. They display emergent behaviour, making them hard to predict and sensitivity, making available data hard to generalise. This renders potential loss from threats, and by extension, the risk associated with them, challenging to evaluate. Yet, when investigating competing designs, for example in the context of CBDCs, resilience belongs to the core requirements put forward [8, 9, 10]. In order to evaluate resilience reliably, not only must the behaviour of the system be mapped onto loss, simple mitigation strategies need to be explored, lest promising candidates get discarded due to suboptimal design details that could easily have been avoided. Simulation allows to achieve this level of insight efficiently, at an early stage of the design process and for a large set of competing candidate designs. Leveraging simulation, quantitative evaluation and iterative design optimisations – e.g. with regard to risk mitigation – can be integrated into the design process, without requiring lengthy pilot studies, as we demonstrate with a simple example. This allows for a far more informed comparison of competing candidate solutions, leading to a more realistic navigation of trade-offs and, ultimately, to a more resilient and better performing production system.
 L. Lamport. The Part-time Parliament. ACM Transactions on Computer Systems 16, 1998.
 D. Ongaro and J. Ousterhout. In Search of an Understandable Consensus Algorithm. In Proc. USENIX Annual Technical Conference, ATC’14, 2014.
 M. Poke, T. Hoefler, and C.W. Glass. AllConcur: Leaderless Concurrent Atomic Broadcast. In Proc. 26th International Symposiumon High-Performance Parallel and Distributed Computing, HPDC ’17, 2017.
 M. Poke and C.W. Glass. A Dual Digraph Approach for Leaderless Atomic Broadcast. In Proc. 38th International Symposium on Reliable Distributed Systems, SRDS’19, 2019.
 Bank of Canada, Bank of England, Bank of Japan, European Central Bank, Federal Reserve, Sveriges Riksbank, Swiss National Bank and BIS. Central bank digital currencies: foundational principles and core features. 2020.
 C. Boar, H. Holden, and A Wadsworth. Impending arrival – a sequel to the survey on central bank digital currency. BIS, 2020.
 Digital Dollar Project. Exploring a United States Central Bank Digital Currency, Proposed Pilot Programs. 2020.
 Bank of England. Central Bank Digital Currency; Opportunities, challenges and design. 2020.
 S. Scorer. Beyond blockchain: what are the technology requirements for a Central Bank Digital Currency? https://bankunderground.co.uk/2017/09/ 13/beyond-blockchain-what-are-the-technology-requirements-for-a-central-bank-digital-currency/ 2017. Visited October 2020.
 R. Auer and R. Böhme. The technology of retail central bank digital currency. BIS Quarterly Review, 2020.
 D. Robinson and G. Konstantopoulos. Ethereum is a Dark Forest. https://medium.com/@danrobinson/ethereum-is-a-dark-forest-ecc5f0505dff , 2020. Visited October 2020.
 K. Sato, N. Maruyama, K. Mohror, A. Moody, T. Gamblin, B. R. de Supinski, and S. Matsuoka. Design and Modeling of a Non-blocking Checkpointing System. In Proc. International Conference on High Performance Computing, Networking, Storage and Analysis, SC’12, 2012.
 https://www.hyperledger.org/use/caliper Visited October 2020.
 E. Androulaki, A. Barger, V. Bortnikov, C. Cachin, K. Christidis, A. De Caro, D. Enyeart, C. Ferris, G. Laventman, Y. Manevich, S. Muralidharan, C. Murthy, B. Nguyen, M. Sethi, G. Singh, K. Smith, A. Sorniotti, C. Stathakopoulou, M. Vukolic, S.W. Cocco, and J. Yellick. Hyperledger Fabric: A Distributed Operating System for Permissioned Blockchains. In Proc. 13th EuroSys Conference, EuroSys ’18, 2018.