Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Using Reinforcement Learning to Perform Qubit Routing in Quantum Compilers

Using Reinforcement Learning to Perform Qubit Routing in Quantum Compilers Using Reinforcement Learning to Perform Qubit Routing in Quantum Compilers MATTEO G. POZZI, University of Cambridge Computer Laboratory STEVEN J. HERBERT, University of Cambridge Computer Laboratory and Cambridge Quantum Computing AKASH SENGUPTA, Department of Engineering, University of Cambridge ROBERT D. MULLINS, University of Cambridge Computer Laboratory “Qubit routing” refers to the task of modifying quantum circuits so that they satisfy the connectivity con- straints of a target quantum computer. This involves inserting SWAP gates into the circuit so that the logical gates only ever occur between adjacent physical qubits. The goal is to minimise the circuit depth added by the SWAP gates. In this article, we propose a qubit routing procedure that uses a modified version of the deep Q-learning paradigm. The system is able to outperform the qubit routing procedures from two of the most advanced quantum compilers currently available (Qiskit and t|ket), on both random and realistic circuits, across a range of near-term architecture sizes (with up to 50 qubits). CCS Concepts: • Computing methodologies → Reinforcement learning;• Hardware → Quantum computation;• Software and its engineering → Compilers; Additional Key Words and Phrases: qubit routing, qubit mapping, Q-learning, simulated annealing, qiskit, tket, quantum circuits, machine learning, deep learning, neural networks ACM Reference format: Matteo G. Pozzi, Steven J. Herbert, Akash Sengupta, and Robert D. Mullins. 2022. Using Reinforcement Learning to Perform Qubit Routing in Quantum Compilers. ACM Trans. Quantum Comput. 3, 2, Article 10 (May 2022), 25 pages. https://doi.org/10.1145/3520434 1 INTRODUCTION In his highly influential 2018 paper, John Preskill coined the term Noisy Intermediate-Scale Quantum (NISQ) technology [33] and suggested that the so-called “NISQ era” would arrive in the near future; that is, we would soon have quantum computers with 50–100 qubits that would be able to solve problems that are intractable for even the best classical computers (a phenomenon Authors’ addresses: M. G. Pozzi and R. D. Mullins, University of Cambridge Computer Laboratory, 15 JJ Thomson Ave, Cambridge CB3 0FD; emails: mgp35@cantab.ac.uk, rdm34@cam.ac.uk; S. J. Herbert, University of Cambridge Computer Laboratory, 15 JJ Thomson Ave, Cambridge CB3 0FD; email: sjh227@cam.ac.uk; A. Sengupta, Department of Engineering, University of Cambridge, Trumpington St, Cambridge CB2 1PZ; email: as2562@cam.ac.uk. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2022 Copyright held by the owner/author(s). Publication rights licensed to ACM. 2643-6817/2022/05-ART10 $15.00 https://doi.org/10.1145/3520434 ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. 10:2 M. G. Pozzi et al. known as quantum supermacy [32]). Since then, we have seen Google and more recently IBM [10] produce quantum computers with over 50 qubits. In October 2019, Google announced they had achieved quantum supremacy with their 53-qubit Sycamore processor [8] (although IBM were quick to dispute this claim ). Such near-term quantum computers support a series of one- and two-qubit operations, or “gates,” which can be assembled into quantum “circuits,” in a similar spirit to how logic gates that act on classical bits can be assembled into sequential logic circuits. A high-level circuit description (in a language such as OpenQASM [13]) must be compiled before it can be executed on a target quan- tum architecture—this process includes passes to satisfy the constraints of the target hardware. Specifically, each quantum architecture has an associated “topology,” or connectivity graph, con- sisting of a set of physical nodes and links between them. Qubits inhabit the nodes and can only interact with qubits on adjacent nodes—SWAP gates can swap the nodes they inhabit. To make an arbitrary quantum circuit executable on a given target architecture, a quantum compiler has to insert SWAP gates so that gates in the original circuit only ever occur between qubits located at adjacent nodes, a process known as “routing.” This will produce a new circuit, possibly with a greater depth, that implements the same unitary function as the original circuit while respecting the topological constraints. The quantum architectures of today are extremely resource-constrained devices, with relatively low numbers and fidelities of qubits. Minimising the added circuit depth is a key goal in maximising the amount of useful work that can be done by today’s systems before decoherence, so much so that in 2018, IBM offered a prize for the best qubit routing algorithm. Tan and Cong [38] recently compared the performance of various state-of-the-art routing algorithms on benchmarks with known optimal depth and concluded that even the most advanced algorithms are significantly lacking—scope therefore exists to improve upon them. In this article, we frame the qubit routing problem as a reinforcement learning (RL) problem, employing a modified deep Q-learning approach to route qubits. We consider actions to be sets of parallelisable two-qubit quantum gates, namely Controlled-NOT (CNOT) and SWAP gates— the system uses simulated annealing, a combinatorial optimisation technique, to select actions (sets of gates) at each timestep to be scheduled in the routed circuit. We consider qubit routing as an abstracted problem and chose to minimise the added circuit depth rather than alternative metrics such as gate count. We believe minimising circuit depth to be a more significant goal, for several reasons: First, even idling qubits can decohere, and, second, one can readily envisage applications where long-running sparse circuits with low total gate count would not be favourable. Another thing worth mentioning is that minimising gate count is actually a trivial problem to solve with Q-learning, via an off-the-shelf formulation (by considering single gates to be actions, rather than sets/layers thereof)—however, part of this article’s contribution lies in solving the interesting problem of Q-learning with a combinatorial action space, which is necessary to truly minimise added circuit depth. With these ideas in mind, we then benchmark our system against the routing passes of state-of-the-art quantum compilers and demonstrate that our system is able to outperform its competitors on the most pertinent benchmarks. 2 BACKGROUND In this section, we begin by formalising and defining the terms used throughout the article. Google AI Blog, Quantum Supremacy Using a Programmable Superconducting Processor (blog post); https://ai.googleblog. com/2019/10/quantum-supremacy-using-programmable.html. IBM Research Blog, On “Quantum Supremacy”; https://www.ibm.com/blogs/research/2019/10/on-quantum-supremacy/. IBM Research Blog, We Have Winners! ... of the IBM Qiskit Developer Challenge; https://www.ibm.com/blogs/research/ 2018/08/winners-qiskit-developer-challenge/. ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. Using Reinforcement Learning to Perform bit Qu Routing in antum Qu Compilers 10:3 Fig. 1. An example of a quantum circuit and its decomposition into layers. 2.1 antum Qu Circuits A quantum circuit is composed of a series of operations, or gates, which transform the state of one or more logical qubits. Figure 1(a) shows an example of a quantum circuit with four qubits—this circuit contains two Hadamard gates and five CNOT gates in various orientations. Quantum circuits can be decomposed into a universal set of one- and two-qubit gates—on many architectures, the two-qubit gate of choice is the CNOT. In this article, we therefore only consider CNOT gates—single-qubit gates are not relevant in qubit routing, since they can occur at any node, and in any case, they are much quicker than two-qubit gates on all real quantum computers. An important concept is the notion of circuit depth. For an ordered set of two-qubit gates G, each acting on qubits q and q (such that д = {q ,q }), indexed from i =1: j k i j k d (q) = 0 for all qubits q in the circuit, (1) ⎪ max (d (q ),d (q )) +1for q ,q ∈ д if q ∈ д t j t k j k t t d (q) = , (2) t+1 d (q) otherwise Circuit depth d = max(d (q)). (3) |G| This can be visualised as slicing the circuit into timesteps of gates that can be performed in parallel—the depth of a circuit is then the minimum number of timesteps it can be decomposed into, without any qubit performing more than one interaction in any given timestep (see Figure 1(b)). 2.2 antum Qu Architectures For our purposes, a quantum architecture is a connectivity graph, composed of a set of physical qubits, or nodes, and a set of links between them. Figure 2 provides an example of one of IBM’s quantum architectures with 20 qubits. In this article, we only consider undirected qubit connectiv- ity graphs—that is, CNOT gates can be performed in either direction. In practice, the direction of CNOT gates can be inverted by using Hadamard gates if necessary, so this simplification makes little difference in our domain. 2.3 Placements and SWAP Gates At any time t during circuit execution, qubits Q are mapped onto nodes N according to some placement p : Q → N . Gates may only occur between two qubits if they lie on adjacent nodes— that is, gate д = {q ,q } may only occur at time t if (p (q ),p (q )) ∈ E,where E is the set of edges 0 1 t 0 t 1 in the architecture’s connectivity graph. ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. 10:4 M. G. Pozzi et al. Fig. 2. The IBM Q20 Tokyo [1]. Fig. 3. A SWAP gate and its decomposition into three CNOT gates. SWAP gates allow two qubits on adjacent nodes to switch positions. Formally, for SWAP gate s = {q ,q } at time t, p (q ) = p (q ) and vice versa. Figure 3(a) shows the circuit symbol for a 0 1 t+1 0 t 1 SWAP gate, which can be decomposed into three CNOT gates in sequence, as shown in Figure 3(b). 2.4 bit Qu Routing “Routing” denotes the task of inserting SWAP gates into quantum circuits so that every gate in the original circuit can be performed on a given target architecture. Qubit routing passes in quantum compilers generally accept a circuit together with a connectivity graph and initial layout, and output a new circuit that respects these architectural constraints. The routing process can thus be represented as a function R : (c,д,l) → c , with input circuit c, output circuit c , connectivity graph д, and initial layout l. The goal is to minimise the added depth of the output circuit versus that of the original circuit. This goal is important because we want to try and maximise the amount of useful work performed by qubits before decoherence and thus maximise the Quantum Volume (QV) [10] our current systems can achieve. Added depth can be formalised as two metrics, circuit depth overhead (CDO) and circuit depth ratio (CDR): d(c ) CDO = d(c ) − d(c) CDR = , (4) d(c) where d denotes circuit depth, c is the original circuit, and c is the routed circuit. Figure 4 shows a quantum circuit of depth 6 before and after routing on the topology and initial layout in Figure 4(a). The routed circuit has depth 7 and therefore a CDO of 1 and a CDR of . Notice how two of the SWAP gates do not add any extra depth to the circuit—the routing procedure is able to perform these while CNOT gates are happening, thus minimising the depth overhead. ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. Using Reinforcement Learning to Perform bit Qu Routing in antum Qu Compilers 10:5 Fig. 4. An example of a quantum circuit before and after the routing procedure. The process of routing qubits is inherently linked to their initial placement p . In this work, we mostly consider random initial placements for fair comparisons of the routing algorithms them- selves, although many quantum compilers do provide strategies to optimise initial placement and these are also important (for some simple circuits, inserting swaps might not be necessary at all). 3 RELATED WORK 3.1 Challenges of the NISQ Era Quantum Computing in the NISQ era presents several challenges. Gyongyosi and Imre [16]pro- vide a recent summary of the space. Martonosi and Roetteler [25] give a broad overview of the importance of computer science principles in the development of quantum computing. Möller and Vuik [28] describe the lessons that can be learned from how classical architectures have evolved to do scientific computing. Almudéver et al. [ 7] provide a summary from an engineering perspective. The general theme among these papers is that quantum computing is currently facing many of the challenges that classical computing faced during is development. Franke et al. [14] highlight an important concern that came about in the late 1950s, called “the tyranny of numbers,” which refers to the massive increase in the number of components and interconnections required as architectures were scaled up, before the invention of the integrated circuit. The authors note that current quantum computing architectures will see a similar lack of scalability—they adapt Rent’s rule [23] and define a quantum Rent exponent p to quantify the progress made in this aspect of optimisation. Another key challenge is the lack of qubit fidelity, causing limits on achievable QV—this is a hardware-agnostic metric coined by IBM [10], which quantifies the limits of executable circuit size on a given quantum device. Minimising circuit depth provides an effective way of maximising QV, which is why the problem of routing qubits is such a key piece of the puzzle when it comes to maximising the usefulness of NISQ-era machines. 3.2 bit Qu Routing Herbert [17] provides some theoretical bounds on the depth overhead incurred from routing, while Tan and Cong [38] provide insight into the performance of various state-of-the-art routing systems on benchmarks with known optimal depth. IBM’s Qiskit [5] is considered to be the most advanced and complete open-source quantum compiler. In 2018, IBM offered a prize for whoever could write the best routing algorithm for their quantum architectures. The winner was Alwin Zulehner, with an algorithm based on A* IBM Research Blog, Now Open: Get quantum ready with new scientific prizes for professors, students and developers; https://www.ibm.com/blogs/research/2018/01/quantum-prizes/. IBM Research Blog, We Have Winners! ... of the IBM Qiskit Developer Challenge; https://www.ibm.com/blogs/research/ 2018/08/winners-qiskit-developer-challenge/. ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. 10:6 M. G. Pozzi et al. Fig. 5. An illustration of the RL process [37]. 6 7 search [42]. Second place was tied between Sven Jandura and Eddie Schoute [9]. Zulehner claims that his background in Computer-Aided Design helped guide his prize-winning approach, which further supports the theme of Section 3.1 (above). Zulehner has gone on to publish an approach targeting SU(4) quantum circuits [43] and an approach for arbitrary quantum circuits based on Boolean satisfiability [ 40], which is only tractable for circuits with small numbers of qubits. IBM have currently integrated two approaches from the literature into Qiskit [6, 24], although open- source code exists for the others [36, 41]. Cambridge Quantum Computing’s (CQC) t|ket [4] is a proprietary compiler with state-of- the-art performance. Its routing procedure is detailed by Cowtan et al. [12]—it effectively adds SWAPs to minimise some cost function based on proprietary heuristics. The t|ket documentation suggests that the system now uses BRIDGE gates—Itoko et al. [19] provide some insight into how the use of BRIDGE gates can improve performance of SWAP-only routing algorithms. Many other methods exist in addition to the ones above, such as that used by Google’s Cirq [3] or that proposed by Li et al. [24], who also propose a technique for finding initial qubit placements. Research also exists regarding approaches that consider alternative factors when routing, such as differing qubit error rates [ 29], and even approaches that use quantum computers for quantum compilation [21]. 3.3 Reinforcement Learning: Deep Q-learning Reinforcement learning is a sub-field of Machine Learning that offers a powerful paradigm for learning to perform tasks in contexts with very little apriori knowledge of what the optimal strat- egy might be. The paradigm has proven to be very useful in complex situations with lots of input data, such as robotics [22] and video games [20, 27]. Under the paradigm, agents learn to achieve some goal in an environment by freely interacting with it and observing rewards for performing actions in different states (see Figure 5). The Deep Q-learning (DQN) paradigm proposed by Mnih et al. [27] uses a convolutional neural network to learn a function Q (s, a) that represents the quality of being in state s and taking action a. The function is defined as follows: Q (s, a) = r (s, a) + γ max Q (s , a ) (5) where r is the reward conveyed to the agent for taking action a in state s, s is the state resulting from taking that action, andγ is a discount factor for discounting the value of future rewards. When Sven Jandura, Improving a Quantum Compiler; https://medium.com/qiskit/improving-a-quantum-compiler-48410d7a7 Eddie Schoute, Constraints on Quantum Circuits and getting around them; https://medium.com/qiskit/constraints-on- quantum-circuits-and-getting-around-them-7de973bd1a18. Alwin Zulehner, How Computer-Aided Design helped me winning the QISKit Developer Challenge; https://medium.com/ qiskit/how-computer-aided-design-helped-me-winning-the-qiskit-developer-challenge-4b1b60c8930f. ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. Using Reinforcement Learning to Perform bit Qu Routing in antum Qu Compilers 10:7 using a neural network to update the Q function in response to new experiences, the learning equation can be written as follows: Q (s , a ) ← (1− α )Q (s , a ) + α r + γ max Q (s , a ) . (6) t t t t t t+1 t+1 t+1 In general, a neural network for DQN will be designed to take a state vector as input (e.g., an image bitmap) and will have one output for each possible action (e.g., buttons on a games console)— this allows the system to easily select the maximum Q-value from the neural network’s outputs, rather than have to re-run a forward pass of the neural network for each action individually. State, action, and reward formulations depend on the problem type, but a common feature of DQN is to do some feature pre-processing via a feature selection function ϕ—for example, one may wish to convert an image to greyscale and down-sample it if dealing with large RGB images as input (as in Reference [27]). Such feature pre-processing may help the neural network to learn more effectively. 3.3.1 Improvements to DQN. There exist two common improvements to deep Q-learning, Dou- ble Deep Q-learning (DDQN) [39]and Prioritised Experience Replay (PER) [35]. The former helps improve the stability of the learning process by using two neural networks instead of one, while the latter enhances the learning process by replaying more useful experiences with a higher priority. 3.3.2 Action Selection Strategies. DQN methods often use an ϵ-greedy exploration policy dur- ing training (and perhaps also during testing): This means that a random action is taken with probability ϵ, and the action with the highest Q-value is taken with probability 1− ϵ.The valueof ϵ begins at 1 and gradually decays after each learning batch—this means that, initially, the agent often chooses random actions, while as training goes on, it increasingly chooses actions that max- imise the Q-value. 4 QUBIT ROUTING WITH Q-LEARNING In this section, we propose a new DQN paradigm to tackle the problem of routing qubits. The problem fits quite naturally into an RL-based interpretation: The end goal is to schedule a set of CNOT gates, given an initial mapping of logical qubits onto physical nodes, by inserting SWAP gates as necessary. The environment thus consists of a partially scheduled circuit, and the agent can decide to schedule CNOT and SWAP gates as necessary, where physically possible. 4.1 RL Formulation for bit Qu Routing Consider a mapping of logical qubits onto physical nodes, such that each logical qubit effectively “inhabits” a given node at a given timestep. Given an initial such mapping, an RL agent is given the ability to “schedule” gates from the original logical circuit, i.e., CNOTs, but only if the hardware constraints permit—that is, only if the two qubits involved in the logical gate inhabit adjacent physical nodes in the architecture’s topology. The agent is also able to swap the nodes that logical qubits inhabit, with the goal of perhaps resolving such hardware constraints—this corresponds to “scheduling” a SWAP gate. In this situation, “scheduling” a gate means adding it to an ordered list of gates—at the end of the process, we will be left with a complete list of gates that, when parallelised into layers, represent a routed version of the original logical circuit. Under this formulation, the state of the environment at any given timestep consists of a layout of logical qubits on physical nodes, and a partially-routed circuit—that is, a set of gates that have already been scheduled by the agent, and a set of gates that have yet to be routed. This state can be represented by a tuple containing the following elements, for example: ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. 10:8 M. G. Pozzi et al. Fig. 6. A circuit and target topology, and two possible routed circuits. • Qubit locations: a mapping l : N → Q denoting which architectural nodes are currently holding which logical qubits. • Qubit targets: a partial mapping t : Q  Q, with q = t (q ) iff q ’s next interaction in the 2 1 1 circuit is with q , or undefined if q has performed all of its interactions. 2 1 • Circuit progress: a mapping p : Q → N , for a circuit with depth d, with n = p(q) iff q has d+1 completed n interactions so far. A reward signal can be issued to the agent for each CNOT it schedules. Actions can be formulated either as scheduling individual gates/swaps, or scheduling a layer of gates/swaps—in both cases, we assume that CNOTs and SWAPs both take one timestep to complete. We argue that for the problem at hand, the latter action formulation is preferable. Consider the following example. Figure 6 shows a circuit to be routed on a target topology (with given initial layout) and two possible solutions. Figure 6(c) and (d) represent two different solutions that both require two “actions” under the former action formulation, and therefore appear equiva- lent, since they both schedule two gates. However, in reality, c can occur in two timesteps, since the first CNOT and the SWAP can occur in parallel, while c must take three timesteps. c is there- 2 1 fore the optimal solution, in terms of added circuit depth, but the former RL formulation provides no way of telling that this is the optimal choice. As this example demonstrates, an RL system could never hope to be optimal if it relies merely on implicit execution of single quantum gates, since it has no concrete way of minimising cir- cuit depth. It is therefore important for actions to be formulated as sets of parralelisable quantum gates/swaps—in other words, an action is a layer of gates, potentially mixing logical CNOTs from the original circuit together with SWAPs. 4.2 Combinatorial Action Spaces A key challenge of this formulation lies in the fact that the action space is combinatorial—that is, for a connectivity graph with n edges, there are O (2 ) possible parallelisable sets of gates to choose from. It is therefore intractable to have a neural network learn a quality function based on state- action pairs—a one-hot encoding of actions on the output layer of the neural network would be infeasible. To mitigate this issue, we propose a modified learning equation that learns the quality of state transitions and uses a combinatorial optimisation technique (simulated annealing in this case) to search for the action leading to the highest-quality state transition after that. The quality function can therefore be expressed as follows: Q (s , s ) = r + γ max Q (s , env(s , a )), (7) t t+1 t t+1 t+1 t+1 a+1 where the env function yields the resulting state when applying a given action to a given state (the resulting state would be s in this particular case). The DQN model presented in this article t+2 ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. Using Reinforcement Learning to Perform bit Qu Routing in antum Qu Compilers 10:9 is trained to represent the above equation—its (single) output value is thus the “quality” of being in one state and transitioning to another under a given action. Like traditional Q-learning, this method attempts to capture the idea of quality being assigned to a state and an action, but by using the next state instead of the action itself and passing both state vectors as inputs to the neural network, the system is able to overcome the issue of the combinatorial action space. 4.3 Scheduling Gates In our system, CNOT gates are automatically scheduled when their logical qubits inhabit adjacent nodes. Actions can therefore be considered as a mandatory set of CNOT gates, together with a set of SWAP gates that can be chosen by the agent to be performed in the same timestep (as long as they do not involve qubits already locked in a CNOT gate in the current timestep). 4.3.1 Performing the Actions. In standard RL style, the environment takes a state s and action a , and outputs a tuple (s , r ), as follows: t t+1 t (1) Schedule any CNOT gates between adjacent logical qubits—consider these nodes to be “pro- tected,” so that no SWAPs can be scheduled involving them in the current timestep. (2) Calculate the total distance between each pair of mutually-targeting nodes (i.e., gates), which are not in the protected set—call this the total pre-swap distance d . pre (3) Perform the swaps in a , and calculate the total post-swap distance d . t post (4) Compute reward r . Effectively, this amounts to scheduling some gates, scheduling some swaps that don’t conflict with them, and then updating the state in response to this action—this means that gates and swaps are mixed into the same actions, but the gates are mandatory, i.e., gates are performed as soon as their two qubits land next to each other. A fixed gate reward is issued for each pair of mutually targeting qubits that are brought next to each other (i.e., when a gate is made possible). In the absence of other reward signals, the reward for a gate whose qubits are very distant would be discounted to such an extent that it would essentially be lost in the noise. To ensure that the system scales well to quantum architectures with a higher diameter, we also introduce a distance reduction reward (when d < d )—this is just a fixed post pre constant for each qubit that is brought closer to (but not next to) its target. 4.3.2 Simulated Annealing. This is the combinatorial optimisation process used to find actions to perform. The process searches for higher-quality states by first swapping a random edge in the architecture and then probing neighbouring solutions (i.e., actions that are one further edge swap away) and accepting them based on some acceptance probability. Actions that would lead to a non- parallelisable set of SWAPs are immediately disqualified and removed from further consideration, as are those that would conflict with the current set of “protected” nodes, i.e., nodes inhabiting qubits that are currently involved in a CNOT gate. The quality of a given action a, which acts on state s to yield next state s is just Q (s , s ).The 0 x 0 x acceptance probability is then (Q −Q )/t ⎧ x e if Q ≤ Q x x P (Q , Q , t) = , (8) acc x x 1 otherwise “Quality” refers to the value of the Q-function defined above, i.e., the output of the DQN model. ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. 10:10 M. G. Pozzi et al. for qualities Q = Q (s , s ) and Q = Q (s , s ) (i.e., current action and new candidate action), and x 0 x x 0 x current “temperature” t. The temperature decays by a fixed multiplier upon each iteration, until a given minimum temperature is reached. 4.4 State Representation The state tuple described above needs to be condensed into a fixed-length vector that can be learned from readily by a neural network. The processed state representation (or feature selection function ϕ) we propose here is a distance vector d, such that d[i] represents the number of qubits that are a distance of i from their targets. One benefit of this representation is that it scales well—rather than scaling with the number of qubits n, which the original state tuple would have, this state representation now scales with the diameter of the connectivity graph, which is O ( n) for a grid, and may be as little as O (log n) in some cases [17]. This representation is still not injective, since many different scenarios can map onto the same distance vector, hindering the learning process. It is therefore helpful to allow the agent to distinguish situations in which its action choice will end up more or less constrained by the currently scheduled gates—we add another component to our proposed feature selection function, e, such that e[i] represents the number of nodes n that have i edges conforming to the following conditions: (1) The edge neighbours n, and lies along the shortest path to n’s target. (2) The edge does not involve a currently protected node. The system concatenates the above state vectors for s and s andpassesthisnew repre- t t+1 sentation to the neural network. In other words, the system uses a feature selection function Φ(s , s ) = (d , e , d , e ). t t+1 s s s s t t+1 t t+1 4.5 Model Algorithm At a high level, the DQN model combines the above concepts into a qubit routing procedure that works as follows. At each timestep, the model searches for an action to perform by carrying out the simulated annealing process described above—this process executes multiple passes of the neural network, once per candidate action, to search for an action that maximises the neural network’s output value Q. Once such an action, a, is selected, the environment updates the state in response to this action to yield a new state. This process continues until a terminal state is reached, i.e., all of the CNOT gates in the original circuit have been scheduled. This algorithm is generally the same for both training and inference, apart from some minor differences. During training, when an action is selected, a reward signal is observed, and this ex- perience tuple is then saved for replaying later when training on a batch of experiences. During inference, when an action is selected, the reward signal is disregarded, and instead the CNOT and SWAP gates represented by the action are added to the routed circuit. 4.6 Model Structure and Hyperparameters The DQN model presented in this article is a neural network with three 32-node fully connected layers, with ReLU activation functions. The output is a single-node layer with linear activation function. The input vector size is |(d , e , d , e )| = 2· ((d + 1) + (e + 1)), for a connectivity s s s s t t t+1 t+1 graph with furthest distance d between two nodes, and max node degree e. The loss function is Mean Squared Error, and the model uses the Adam optimizer with a learning rate of 0.001. The model is constructed using the keras package, using a tensorflow backend. ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. Using Reinforcement Learning to Perform bit Qu Routing in antum Qu Compilers 10:11 Besides the learning rate, the model uses a set of default hyperparameters. The value of γ in the above equation is 0.6. The ϵ parameter of the model’s ϵ-greedy strategy begins at 1.0, and decays by a factor of 0.9 for each training episode, until it reaches a minimum value of 0.001. The annealer also has a set of default hyperparameters. Its initial temperature t is 60.0, with a minimum temperature of 0.1 and a cooling multiplier for each iteration of 0.95. Our model also makes use of the two mentioned improvements to DQN, namely DDQN and PER. 5 RESULTS In this section, we evaluate our DQN system on a variety of quantum circuits and architectures by comparing it to other routing algorithms in state-of-the-art compilers. The compilers we bench- mark against are CQC’s t|ket, IBM’s Qiskit, and Google’s Cirq. Other compilers exist, but Tan and Cong [38] found t|ket and Qiskit to be the leaders in the space, so this selection is sufficient to demonstrate how our approach compares to the industry standard. The code for our DQN system is available on GitHub [31]. 5.1 Benchmarking Setup When referring to the performance of a given system, we mean how it behaves in terms of circuit depth overhead/ratio—that is, better performance means lower circuit depth overhead/ratio. We instead use the term runtime to refer to how long a given system takes to run. We consider a SWAP gate as a primitive operation taking a single timestep (same time as a CNOT)—this assumption represents the simplest fair method of comparison, and has precedent in the literature [12] (see Discussion for more on this). We have verified that the other compil- ers also output circuits with a mix of SWAPs and CNOTs, rather than performing SWAP decom- position. Throughout the following benchmarks, we have disabled every sort of compiler pass except for the routing process itself, to ensure that our comparison is fair and pertinent to the task at hand. This includes placement routines—we have chosen to use random initial placements instead. For the baseline systems, we downloaded the most recent versions compatible with Python 3.7. Qiskit comes with various routing algorithms—the most recent and performant are StochasticSwap and SabreSwap [24]. Which of the two performs better depends on the benchmark, so for fair comparison, we picked the one with the best CDO/CDR in each case. Where hyperparameters made a difference, we also chose values that maximised performance—in practice, we found that the number of trials for Qiskit’s StochasticSwap algorithm was the only parameter that heavily impacted the results. For CQC’s t|ket, we disabled the use of BRIDGE gates, so it could be fairly compared to our SWAP-only algorithm—we briefly assess the impact of such gates, as well as SWAP decomposition, in the Discussion section below. 5.2 A Word on Runtime It is worth noting that our DQN system requires re-training for each target architecture, and poten- tially a round of hyperparameter optimisations, which critics may view as a drawback. However, in NISQ-era quantum computing, classical (compiler) runtime is not such a concern if quantum run- time (CDO/CDR) can be improved at all. In addition, new quantum architectures are not developed every day, and once an RL agent (or “model”) has been trained for a given quantum architecture, it can be re-used on the same architecture indefinitely, to compile a limitless number of quantum circuits. ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. 10:12 M. G. Pozzi et al. Fig. 7. An example of a 16-qubit single full-layer circuit, and a 4 × 4 grid architecture for it to be executed on. 5.3 Training In RL, it is common practice to train up a few different models and pick the best. For each of the following benchmarks, we trained a series of models for each architecture on randomly-generated sets of circuits, and the models were then run on separate test sets. In situations where there was significant variation in quality between models in training, only the best-quality models were retained and subsequently used in testing. Such cases are clearly identified below, and we still run through the same total number of test circuits, for fairness. 5.4 Single Full-layer Circuits The first benchmark involves single full-layer circuits on increasing grid sizes. More precisely, these are n-qubit circuits, each with disjoint gates. This benchmark represents the worst kind of situation for a routing algorithm to deal with, since the original depth is very low (d = 1) but the number of gates to schedule is maximal for this depth. Figure 7(a) shows an example of such a circuit with 16 qubits. Clearly, single full-layer circuits can be scheduled immediately on grid architectures if the cor- rect initial placement is chosen, so a random placement is used to test the effectiveness of the routing scheme. Figure 7(b) shows an example of a 4 × 4 grid architecture that could be used to execute the circuit represented by Figure 7(a). For the random initial placement shown, only the gate between qubits q and q can be performed immediately—SWAP gates must be inserted to 2 3 make the other gates possible. Figure 8 shows how CDO increases with increasing grid size. For each system, five batches of 100 test circuits were executed and the results were averaged. For the DQN system, the model was retrained on a separate training set also consisting of similar circuits, once per batch, for each different grid size, using a number of training circuits that increased linearly with the number of qubits. Qiskit is clearly the best system for this benchmark, which is unsurprising, since its Stochastic- Swap is the only system here that schedules all of the necessary swaps before scheduling all of the original (logical) gates at once, for each layer, which is an effective strategy for maximally dense layers. However, the number of trials had to be increased significantly from the default value of 20 to reach this level of performance, which some might view as a drawback of the system. We were unable to replicate the results in CQC’s paper [12], so we include their reported results as well as the results we obtained from running t|ket as described above. The DQN system ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. Using Reinforcement Learning to Perform bit Qu Routing in antum Qu Compilers 10:13 Fig. 8. Single full-layer performance on various grid topologies. Qiskit here is StochasticSwap. outperforms the current version of t|ket (although it is unable to outperform their reported results), and its performance scales sublinearly with the number of qubits, which is encouraging. Cirq’s performance is poor, even with a better parameter value—an r of more than 3 rapidly max becomes intractable, with minimal benefit. We therefore exclude Cirq from the benchmarks that follow. 5.5 Multi-layer Circuits Another important type of circuit to consider is the multi-layer circuit. These are circuits composed of a series of N layers with density ρ, such that each layer will have ρ gates. A density of ρ = 1.0 thus yields a series of full layers, and the number of gates in such a circuit is maximal for the given depth. Lower densities naturally lead to a less strict layer structure, reducing the number of gates in a circuit of given depth. Figure 9 shows the performance data for three circuit densities. The system was trained on two- layer circuits with ρ = 1.0, since we found that training on circuits with more layers actually worsened performance. The three best models of five were selected for testing. Qiskit’s StochasticSwap exhibits good performance on full-layer circuits, since it employs its single-layer strategy for each (maximally dense) layer in sequence. Such behaviour is evident from its perfectly constant CDR in Figure 9(a). The DQN system also performs well in this case, outper- forming t|ket by about a third. It comes within about 20% of Qiskit’s performance on circuits with 10 layers. As density is decreased, the performance of the DQN system begins to surpass that of Qiskit on deeper circuits. This is not a surprise, since Qiskit rigidly schedules each layer in sequence and ignores gates in future layers. DQN’s performance improvement is a good sign as we turn towards random circuits below, which mostly have layer densities in the range [0.25, 0.45]. 5.6 Random Circuits Random circuits are a reasonable simulation of real quantum circuits. They are generated by adding gates between random qubits, leading to circuits with low layer densities. Figure 10 shows the performance data for each system on four different quantum architectures. For each datapoint, the systems were executed on five batches of 100 test circuits each, and the results were averaged. For ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. 10:14 M. G. Pozzi et al. Fig. 9. Multi-Layer performance on the IBM Q20 Tokyo with differing circuit densities. Qiskit here is Stochas- ticSwap. each architecture, we trained 16 DQN models with different hyperparameters on separate training sets consisting of random circuits, and retained the highest-quality models for testing. The DQN system has the best performance across all architectures and circuit sizes in the above plots, which is very encouraging. In particular, it is encouraging to see that the DQN system is able to maintain best-in-class performance on larger quantum architectures. For random circuits, average layer density increases with the number of gates in the circuit—despite this, the DQN system’s CDR still remains lowest on random circuits with 1,000 gates, across all architectures. As the grid size increases, hyperparameter optimisation becomes ever more important—in par- ticular, it is important to slow down the annealer’s temperature decay, so that optimal actions can indeed be found. Nevertheless, these results demonstrate that the best DQN models are able to sur- pass the best competitor (Qiskit’s StochasticSwap, here) on architectures of up to about 50 qubits, even as circuit depth increases, which suggests that the DQN system will remain competitive on most near-term quantum architectures and circuits. 5.7 Realistic Test Set As a final benchmark, we sought to test each system on a set of real quantum circuits. We chose the test set of 158 circuits used by Zulehner [2] and filtered out any circuits with a depth of 200 or ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. Using Reinforcement Learning to Perform bit Qu Routing in antum Qu Compilers 10:15 Fig. 10. Each compiler’s performance on random circuits of different sizes, for four quantum architectures. Qiskit here is StochasticSwap. more (due to runtime constraints). The final benchmark set thus consisted of 95 realistic quantum circuits, ranging from 3 to 16 qubits, depths of 5 to 199, and 5 to 240 CNOT gates. We ran each system on four different realistic architectures. For each architecture, we trained 5 DQN models on random circuits with 50 gates, and generated five sets of random initial place- ments (one placement per circuit in each set). We then ran the system on the benchmark set, once per model and initial placement set, yielding a total of 25 runs through each circuit. We chose not to isolate the best models here, to give an indication of average-case performance. For the other systems, we repeated each circuit and initial placement five times with the same parameters. We also used a fixed seed when generating the placements, such that each system would use the same placement sets. Figure 12 shows the mean CDR for each system and quantum architecture. The error bars rep- resent one standard deviation, obtained over the mean CDRs of each model. We only plotted error bars for the DQN system, since the error bars for the other systems were too small to plot—for the former, variance between models is far more significant than variance between different runs of the same model, meaning that most of the variance between runs arises from variation in the quality of models. ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. 10:16 M. G. Pozzi et al. Fig. 11. The connectivity graph of the Rigetti 19Q Acorn [ 12]. The performance gap between the DQN system and the other baselines on the 4 × 4gridand the IBM Q20 Tokyo is significant—as the error bars illustrate, even the worst model on each archi- tecture outperforms t|ket by at least 11% (and the best yields CDRs up to 17% lower). The DQN models are also able to outperform the other baselines by a good margin on the IBM Q16 Rüsch- likon, which is effectively a 2 × 8 grid and thus has worse connectivity than a 4 × 4grid(seethe next section for a discussion of architecture connectivities). The Rigetti 19Q Acorn (Figure 11) is an example of an architecture with sparse connectivity, with nodes of degree either 2 or 3, which poses difficulties for routing algorithms. The DQN system runs into some issues on this architecture—most notably, we found that training stability decreased, and that the performance gap between DQN and its competitors narrowed. Nonetheless, two models were somewhat higher quality than the other three—isolating these two (see Figure 12(d)) yields a performance that is still better than that of t|ket, albeit more marginally than on other architec- tures. The error bar was too small to plot here, since the two models have very similar performance, but the benchmark did still run through each placement set five times, for consistency. In any case, with both models, every run through the test set yielded a lower CDR that t|ket’s average, so we can be reasonably confident that the DQN system is indeed still able to outperform the others on this architecture when training a few models and selecting the best (which is standard practice in RL). Table 1 shows a breakdown of results for the deepest benchmark circuits tested. For the first three architectures, the DQN system outperforms the other systems on every benchmark circuit listed. For the Rigetti 19Q Acorn, the DQN system outperforms on 10 of 15 circuits. For reference, we have included a similar table for gate count—that is, displaying Circuit Gate Ratios (CGRs), the analogue of CDR for gate count. Table 2 shows a breakdown of results for the same benchmark circuits as in the previous table, i.e., the deepest. On average, the DQN approach tends to use more gates, yet has smaller circuit depth ratios—to help shed some light on this char- acteristic, we wrote some code to automatically display diagrams (such as Figure 13) representing the internal state of the RL algorithm at each timestep. From this process, we made the observation that the DQN algorithm sometimes fills up layers with gates that do not achieve much improve- ment in the overall state quality Q—for example, swapping two qubits in one given timestep, only to swap them back in the next. Such behaviour stems from the fact that the DQN method does not currently incorporate a penalty for adding gates to a given layer, and it is therefore free to fill up layers with operations that may not greatly improve the quality of the current state. This issue could be mitigated by adding an appropriate penalty in the RL reward function—this would then allow the DQN method to balance the increase in Q from imminently scheduled CNOTs with the decrease in Q from the addition of SWAPs in the layer. In the case of redundant gates, such a reward would certainly be unfavourable. It is important to bear in mind that minimising CGR was not the goal of this article, as outlined in the Introduction previously, and as such, Table 2 is really only for reference. Furthermore, we would speculate that minimal CDR is unlikely to coincide with minimal CGR—that is, to schedule ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. Using Reinforcement Learning to Perform bit Qu Routing in antum Qu Compilers 10:17 Fig. 12. Each compiler’s performance on a test set of 95 realistic circuits [2], for four quantum architectures. Qiskit here is SabreSwap. all operations in as few timesteps as possible, it will generally be the case that some additional swaps may be needed (compared to an algorithm that simply minimises swap count). Nevertheless, we are confident that our above proposed modification to the DQN method to incorporate an extra gate penalty would help decrease CGR significantly. In fact, we would argue that training an agent to minimise CGR would be relatively straightforward, and we would expect such an agent to still perform well—in this case, actions would simply be single gates, which would remove the need for simulated annealing altogether. 6 DISCUSSION Overall, the DQN system’s performance throughout the benchmarks has been very positive. Per- formance on the layerised benchmarks was good, and the system still fared well against its competi- tors; looking instead at the random and real circuit benchmarks, the DQN system outperforms the other state-of-the-art baselines across all of the quantum architectures we tried. We would argue that such benchmarks (especially the realistic circuits) are the most important, since they most ac- curately represent the type of circuit that a routing algorithm may end up tackling in a real-world context, and it is therefore very encouraging to see how well the DQN system performs here. ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. 10:18 M. G. Pozzi et al. Table 1. CDRs for Circuits in the Benchmark Set with an Original Depth above 100 for Each Architecture and Algorithm Circuit 4 × 4grid IBM Q20 Tokyo IBM Q16 Rüschlikon Rigetti 19Q Acorn Name d DQN Qiskit tket DQN Qiskit tket DQN Qiskit tket DQN Qiskit tket alu-v2_30 199 1.237 1.527 1.488 1.16 1.303 1.232 1.355 1.611 1.626 1.636 1.667 1.634 mod8-10_177 178 1.238 1.513 1.437 1.161 1.321 1.337 1.353 1.542 1.526 1.61 1.658 1.597 rd53_131 175 1.283 1.494 1.479 1.182 1.399 1.319 1.421 1.558 1.605 1.648 1.692 1.682 C17_204 173 1.31 1.526 1.492 1.215 1.366 1.369 1.44 1.611 1.599 1.683 1.688 1.674 alu-v2_31 172 1.268 1.533 1.502 1.154 1.301 1.329 1.392 1.585 1.535 1.637 1.697 1.699 4gt4-v0_73 160 1.277 1.533 1.493 1.166 1.326 1.351 1.366 1.542 1.568 1.627 1.674 1.661 ex3_229 157 1.239 1.497 1.511 1.173 1.278 1.279 1.376 1.606 1.545 1.631 1.688 1.701 cnt3-5_180 148 1.373 1.628 1.599 1.286 1.495 1.48 1.579 1.676 1.569 1.769 1.897 1.801 mod8-10_178 135 1.252 1.503 1.456 1.164 1.32 1.376 1.388 1.559 1.527 1.631 1.658 1.603 decod24-enable_126 134 1.252 1.557 1.51 1.159 1.396 1.361 1.384 1.587 1.509 1.635 1.693 1.637 ham7_104 134 1.245 1.53 1.497 1.179 1.332 1.26 1.37 1.563 1.554 1.618 1.626 1.637 one-two-three-v0_97 116 1.262 1.546 1.517 1.146 1.284 1.381 1.356 1.541 1.572 1.629 1.631 1.643 rd53_135 114 1.289 1.555 1.481 1.199 1.419 1.318 1.42 1.603 1.579 1.656 1.689 1.651 mini-alu_167 111 1.274 1.523 1.494 1.147 1.352 1.396 1.375 1.587 1.544 1.638 1.679 1.686 4gt4-v1_74 108 1.271 1.564 1.506 1.166 1.353 1.296 1.372 1.616 1.68 1.633 1.7 1.754 Table 2. CGRs for Circuits in the Benchmark Set with an Original Depth above 100 for Each Architecture and Algorithm Circuit 4 × 4grid IBM Q20 Tokyo IBM Q16 Rüschlikon Rigetti 19Q Acorn Name n DQN Qiskit tket DQN Qiskit tket DQN Qiskit tket DQN Qiskit tket alu-v2_30 223 3.726 1.593 1.49 3.275 1.296 1.222 3.365 1.656 1.63 3.641 1.7 1.622 mod8-10_177 196 3.59 1.592 1.463 3.212 1.342 1.322 3.288 1.632 1.548 3.548 1.674 1.611 rd53_131 200 4.151 1.58 1.485 3.846 1.455 1.301 3.957 1.642 1.62 4.075 1.684 1.697 C17_204 205 4.11 1.59 1.479 3.815 1.378 1.334 3.869 1.652 1.593 4.139 1.687 1.632 alu-v2_31 198 3.178 1.562 1.463 2.772 1.284 1.289 2.883 1.594 1.499 3.05 1.654 1.594 4gt4-v0_73 179 3.749 1.583 1.487 3.344 1.33 1.335 3.385 1.596 1.593 3.694 1.664 1.641 ex3_229 175 3.681 1.579 1.502 3.259 1.327 1.283 3.388 1.656 1.563 3.645 1.718 1.698 cnt3-5_180 215 4.102 1.618 1.545 4.29 1.378 1.353 4.346 1.657 1.591 5.045 1.767 1.752 mod8-10_178 152 3.726 1.608 1.468 3.351 1.328 1.372 3.45 1.627 1.55 3.735 1.689 1.607 decod24-enable_126 149 3.586 1.6 1.501 3.19 1.395 1.358 3.283 1.635 1.505 3.513 1.704 1.639 ham7_104 149 4.221 1.579 1.494 3.922 1.319 1.251 3.849 1.635 1.562 4.27 1.637 1.634 one-two-three-v0_97 128 3.265 1.607 1.47 2.885 1.276 1.364 2.97 1.636 1.572 3.138 1.68 1.628 rd53_135 134 4.055 1.612 1.531 3.758 1.442 1.272 3.733 1.647 1.604 4.043 1.714 1.67 mini-alu_167 126 3.221 1.576 1.459 2.835 1.346 1.335 2.991 1.63 1.522 3.135 1.685 1.679 4gt4-v1_74 119 3.77 1.626 1.524 3.426 1.36 1.316 3.454 1.686 1.703 3.666 1.744 1.773 Another key point to note is that the DQN approach is very flexible—it has a wealth of hyper- parameters that can be optimised for each specific architecture. However, the approaches used in state-of-the-art compilers tend to have very few hyperparameters and are somewhat fixed. In par- ticular, Qiskit’s strategy of focussing on each layer in sequence is wasteful on low-density circuits, since not all qubits will be involved in a gate in a given layer, so swapping idle qubits is neces- sary to help schedule future gates. In fact, one can obtain a lower bound for the CDR that such a method could achieve on a given architecture by generating a random layer of gates (of a fixed target density) and initial placement, and computing half of the average furthest distance between any pair of qubits involved in a gate. This bounds the number of layers of SWAP gates required to schedule a given layer of logical gates. Doing so for a 7 × 7 grid, using the correct density for random circuits with 1,000 gates, for example, yields a CDR that is still higher than that of the DQN system. This means that even with an infinite number of trials, Qiskit will not be able to outperform the DQN system for this architecture and circuit size. Such a result demonstrates the inherent limitation of treating layers separately when routing. ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. Using Reinforcement Learning to Perform bit Qu Routing in antum Qu Compilers 10:19 Fig. 13. A sample visualisation of the internal state of the RL algorithm at a given timestep, on an IBM Q20 Tokyo. On each physical node, the top number represents the logical qubit inhabiting that node, and the bottom number represents its logical target. As shown here, logical qubits 8 and 9 are currently performing an interaction—red links (such as that between logical qubits 2 and 4) represent ongoing swaps. A target of -1 denotes a logical qubit that has finished all of its interactions. In this timestep, qubit 13 will move to another node, despite the fact that this swap will have no impact on its distance to qubit 11—such a swap is redundant, since the overall value of Q will be the same with or without it, but in the absence of a swap count penalty, the agent cannot distinguish between the two states and is free to choose either action. The DQN system struggles slightly on architectures with poor connectivity, notably the Rigetti 19Q Acorn. One possible explanation is the fact that the DQN system cannot distinguish between situations in which shortest paths cross and those that do not and thus cannot predict upcoming conflicts. On such architectures, there are very few shortest paths between any two nodes (versus, e.g., a grid), so choosing the path that minimises conflict is key. Another problem is when multiple qubits are all waiting to interact with the same one—the system’s state representation has no way of prioritising the movement of such qubits, despite the fact that their interactions are crucial for the progress of the routing process. This drawback is especially hard-hitting on architectures with poor connectivities, where its action choice is heavily constrained—the system might get stuck in local minima, since it cannot tell which qubit is the source of the bottleneck. It is unclear what the future will hold in terms of architecture connectivities. The above points could help motivate future work on the system, especially with respect to very large grid sizes. While the DQN system’s performance was still best-in-class on the grid sizes we tried (which are sufficiently large to indicate near-term performance), breaking ties between shortest paths and tackling the wider issue of qubit priority will likely be essential to unlock better performance on even larger grid architectures, which is especially relevant as we move towards architectures with ever increasing numbers of qubits in the future. 6.1 Applying SWAP Decomposition Throughout the Results section, we considered both CNOT and SWAP gates to take one timestep each. This was the simplest fair method of comparing routing procedures (with precedent in the literature [12]), since it is the most architecture-agnostic—for example, for some quantum tech- nologies, pulse-level optimisation can lead to SWAP gates executing in 1.5 timesteps (by using iSWAPgates)[15], and in future we can expect a wider variety of such optimisations to emerge. However, at the moment, SWAP gates must be performed via decomposition into CNOT gates, ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. 10:20 M. G. Pozzi et al. Fig. 14. Performance of each routing system, calculating CDR after SWAP decomposition into CNOT gates. For random circuits, we once again show Qiskit’s StochasticSwap; for realistic circuits, we once again show Qiskit’s SabreSwap. and they thus take three timesteps to execute (as illustrated in Figure 3(b)). While the DQN system has not been optimised with this in mind, we nonetheless sought to assess each routing system’s performance when performing such decomposition. Figure 14(a) shows each system’s performance on random circuits after performing SWAP de- composition, while Figure 14(b) shows performance on realistic circuits. We also tried enabling BRIDGE gates for t|ket, in both cases, and show these results separately. Encouragingly, the DQN system still outperforms its competitors on random circuits, even on such a large grid size—in fact, the general shape of the graph remains largely unchanged. However, the ordering of the systems is reversed for realistic circuits. In particular, t|ket is the best system when decomposing SWAPs into CNOTs, suggesting that its routing process might be optimised for this particular scenario. The relatively poor performance of DQN here (despite excellent performance when not perform- ing decomposition) could be explained by a variety of factors, perhaps most significantly by the fact that it makes no effort to minimise SWAP count, only depth (as demonstrated by Table 2). When considering SWAPs and CNOTs to take the same amount of time, performing a redundant SWAP carries no penalty in the RL formulation—however, when performing decomposition, an unnecessary depth cost may additionally be incurred from such redundancy, especially on sparse (e.g., realistic) circuits with low CDRs. Furthermore, the DQN system has no way of optimising its action choice specifically with decomposition in mind—awareness that a SWAP takes three timesteps would allow the system to choose SWAPs that allow for “pipelining,” that is, beginning a SWAP while another is ongoing, to minimise depth overhead. To reiterate, DQN’s poor performance relative to t|ket and Qiskit here is not surprising, since in this work, the DQN system was not optimised with SWAP decompositon in mind, nor was min- imising gate count prioritised. However, the RL formulation can certainly be modified to mitigate both of the above issues—much in the same way that this article proposes mixing SWAPs and CNOTs into the same timesteps, future work may well allow SWAP gates to be scheduled with an awareness of their decomposition, thus allowing them to occur “out of time,” enabling decom- positions that yield a lower depth overhead than otherwise possible. Equally, we could minimise gate count by incorporating an added reward (or rather, penalty) signal into the RL formulation to penalise the frivolous addition of gates. BRIDGE gates may also be added as potential actions for the annealer to choose from, and mod- els can be trained on realistic circuits to cope with common patterns that arise therein (rather than ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. Using Reinforcement Learning to Perform bit Qu Routing in antum Qu Compilers 10:21 training still on random circuits, as we do above). We firmly believe that with such improvements, a DQN system will still be able to remain competitive in the most realistic scenarios. Meanwhile, a more static method such as Qiskit’s presents no such opportunities for improvement, and its performance is therefore bounded. Overall, however, DQN’s best-in-class performance on random circuits, even when performing SWAP decomposition, demonstrates that the system is well ahead of its competitors when consid- ering routing as an isolated problem. We therefore believe that there are more gains to be had by adopting an RL approach in quantum compilers more generally, and certainly by implementing the improvements we outline above. 6.2 Dealing with Noise and Variability in bit Qu Errors Another thing worth considering in future work would be the fact that on realistic quantum ma- chines, different qubits/links have differing measurement/gate errors. It is therefore preferable to execute gates along links that have higher fidelity. Such details can be incorporated into the state formulation, for example by extending the state vector to include information about the fidelity of links along which gates are taking place. The reward signal can also be similarly extended to this effect. After such improvements, the DQN system would naturally learn to schedule SWAP gates with variable gate errors in mind, which would be beneficial when compiling quantum circuits for lower-fidelity quantum architectures. 6.3 Other Recommendations for Future Development The feature selection function is the component that we have found most impactful throughout the work—we thoroughly believe that with a more rich state representation, the DQN agent will be able to achieve even higher performance in complex scenarios. The main problem with the current representation is the fact that a lot of information is lost when converting the full state into a mere distance vector. The information about available swaps helps somewhat, but in practice this does little to help break ties when qubits are far away for their targets. A new representation might encode some information about shortest paths between mutually-targeting qubits and their potential for conflict, as well as information about which qubits should be prioritised. Ultimately, there is a balance here between the size of the representation and the information it is able to capture. It is also possible that a massive network with massive amounts of training could learn its own optimal representation, in the true spirit of “deep” learning—we would be curious to see whether this works. Furthermore, many RL methods employ some form of lookahead, in which a series of future episodes are simulated to choose the best action—this could help the DQN system to predict upcoming bottlenecks and react accordingly. Another possible improvement relates to automating the learning process. At the moment, the choice of the number of training episodes is somewhat arbitrary, but it would certainly be more useful (and reliable) to employ a deterministic scheme with some well-formed criteria, such as detecting when the weights of the neural network have converged. Furthermore, the system currently lacks the ability to use BRIDGE gates, or delay gates instead of scheduling them as soon as possible. Extending the RL paradigm to incorporate such charac- teristics is certainly achievable, and can simply be done by adding these as possible actions to the RL formulation. It is also worth noting that t|ket not only uses BRIDGE gates, but it also has optimisation passes to simplify chains of CNOT gates—while we disabled such functionality to perform a fair comparison, it would certainly improve t|ket’s performance when considering cir- cuit depth after SWAP decomposition, a necessary step for current architectures. Adding similar functionality to the DQN system would be a simple yet important task, to remain competitive as a full compilation process (i.e., beyond mere routing). ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. 10:22 M. G. Pozzi et al. One thing we have not considered in this article is the impact of initial placement—instead, we have chosen to use random initial placements throughout. The motivation for this choice is that random initial placements allow the routing algorithms to be evaluated independently from any other optimisations, as well as the fact that random placements effectively simulate a potential state of the system mid-way through execution of a larger quantum circuit. Besides, it is not unrea- sonable to assert that a separate optimised placement routine will benefit each routing algorithm equally, and therefore will not have any effect on their comparison. However, this is something that future research should certainly look at—in fact, there could be ways of applying RL to the task of finding optimal initial placements. For example, one could do so by picking a random placement, constraining the current DQN system to apply only SWAP gates initially until the highest-quality state possible is achieved, and use the final placement as the initial placement of the regular DQN routing procedure. 6.4 Another Word on Runtime At the moment, the system is not very well optimised in terms of runtime—we have always pre- ferred to run the system longer, or use a more exhaustive method, to minimise CDO and CDR. Runtime optimisation would require significant future work, which is why we have shied away from directly comparing the runtime of our system to that of the other baselines in this article. That said, it is important to give at least some indication of the timescales involved. For the re- alistic test set on a 4 × 4 grid, disregarding training time, the DQN system took about 2,400 s to complete one run (of 100 circuits), while Qiskit (StochasticSwap) took about 36 s and t|ket took about 6 s. The DQN system is clearly much slower, but for perspective, Qiskit’s LookaheadSwap routing method [6] (which came second in the Qiskit Developer Challenge) is almost 4 times as slow as the DQN system on the 4 × 4 realistic circuits benchmark, despite only having as good a CDR as Qiskit’s faster StochasticSwap method. Equally, it is worth noting that compiling Tensor- flow to take advantage of SIMD extensions (such as AVX or FMA) and GPUs could help improve the runtime of our method, without touching the code itself. Besides, the runtime of the DQN system can certainly be reduced while still maintaining good performance. One such area for improvement is the annealer—an adaptive scheme with a vari- able number of iterations would help greatly. In fact, other combinatorial optimisation techniques could be used, such as random restart hill-climbing [34]. Another area for optimisation is clearly the size of the neural network used. In practice we found such changes to make little difference, but it is perfectly possible that the layer structure used at the moment is wasteful—a more principled search would be necessary for each quantum architecture. Once again, this would be a worthy time investment, since new quantum architectures are developed infrequently—the time required to de- velop a new architecture is clearly far greater than the time required for such a search. Equally, returning to a single state approach (with some scalability improvements) may prove effective while greatly diminishing training times due to the lack of annealing when replaying past expe- riences. Better parallelisation would also be useful—the literature on this topic includes several methods that could be of use in an RL context [11, 26, 30]. 7 CONCLUSION In this article, we have presented a RL approach to address the problem of routing qubits on near- term quantum architectures. We proposed a modified deep Q-learning formulation, in which ac- tions are sets of parallelisable swaps/gates—the agent uses simulated annealing to select actions from this combinatorial space. We then benchmarked our DQN system against the qubit routing passes of state of the art quantum compilers. ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. Using Reinforcement Learning to Perform bit Qu Routing in antum Qu Compilers 10:23 The key research question throughout has been: Can a DQN approach be used to perform qubit routing in quantum compilers, and, if so, is it able to compete with state-of-the-art approaches? We would say the answer is emphatically, yes. The results demonstrate that a DQN approach is able to surpass the performance of other industry-standard approaches in realistic near-term scenarios, with a level of adaptability that is not possible with other more static approaches, which will be particularly useful as a wider variety of quantum architectures appear in future. Further work is required to maintain best-in-class performance when performing SWAP decomposition, but we are confident that this will be achievable with some modest improvements to the system. Overall, our work demonstrates the value of using an RL approach in the compilation of quantum circuits, and we hope that such an approach can bring further benefits in the space in future. ACKNOWLEDGMENTS The idea of using reinforcement learning (specifically Q-learning) for qubit routing was first pro- posed by a subset of the present authors (namely Herbert and Sengupta) in an arXiv preprint [18], which has not been published elsewhere (i.e., in a journal or the proceedings of a conference). Spe- cial thanks to Silas Dilkes from Cambridge Quantum Computing (CQC) for providing guidance on how to set up t|ket for our benchmarks. REFERENCES [1] 2018. IBM Q Devices and Simulators. Retrieved from https://web.archive.org/web/20181203023515/https://www. research.ibm.com/ibm-q/technology/devices/. [2] 2018. Quantum Circuit Test Set (Zulehner). Retrieved May 2020 from https://iic.jku.at/eda/research/ibm_qx_ mapping/. [3] 2020. Cirq Documentation (accessed for 0.8.0). Retrieved May 2020 from https://cirq.readthedocs.io/en/stable/. [4] 2020. CQC - Our Technology (accessed for pytket 0.5.4). Retrieved May 2020 from https://cambridgequantum.com/ technology/. [5] 2020. IBM Qiskit (accessed for 0.20.0). Retrieved May 2020 from https://qiskit.org. [6] 2020. Jandura’s routing method (LookaheadSwap documentation). Retrieved from https://qiskit.org/documentation/ stubs/qiskit.transpiler.passes.LookaheadSwap.html#qiskit.transpiler.passes.LookaheadSwap. [7] C.G.Almudever,L.Lao,X.Fu, N. Khammassi,I.Ashraf, D. Iorga, S. Varsamopoulos, C. Eichler,A.Wallra,L ff .Geck, A. Kruth, J. Knoch, H. Bluhm, and K. Bertels. 2017. The engineering challenges in quantum computing. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE’17). IEEE, 836–845. https://doi.org/10.23919/ DATE.2017.7927104 [8] Frank Arute, Kunal Arya, Ryan Babbush, Dave Bacon, Joseph C. Bardin, Rami Barends, Rupak Biswas, Sergio Boixo, Fernando G. S. L. Brandao, David A. Buell, Brian Burkett, Yu Chen, Zijun Chen, Ben Chiaro, Roberto Collins, William Courtney, Andrew Dunsworth, Edward Farhi, Brooks Foxen, Austin Fowler, Craig Gidney, Marissa Giustina, Rob Graff, Keith Guerin, Steve Habegger, Matthew P. Harrigan, Michael J. Hartmann, Alan Ho, Markus Hoffmann, Trent Huang, Travis S. Humble, Sergei V. Isakov, Evan Jeffrey, Zhang Jiang, Dvir Kafri, Kostyantyn Kechedzhi, Julian Kelly, Paul V. Klimov, Sergey Knysh, Alexander Korotkov, Fedor Kostritsa, David Landhuis, Mike Lindmark, Erik Lucero, Dmitry Lyakh, Salvatore Mandrà, Jarrod R. McClean, Matthew McEwen, Anthony Megrant, Xiao Mi, Kristel Michielsen, Masoud Mohseni, Josh Mutus, Ofer Naaman, Matthew Neeley, Charles Neill, Murphy Yuezhen Niu, Eric Ostby, Andre Petukhov, John C. Platt, Chris Quintana, Eleanor G. Rieffel, Pedram Roushan, Nicholas C. Rubin, Daniel Sank, Kevin J. Satzinger, Vadim Smelyanskiy, Kevin J. Sung, Matthew D. Trevithick, Amit Vainsencher, Benjamin Villalonga, Theodore White, Z. Jamie Yao, Ping Yeh, Adam Zalcman, Hartmut Neven, and John M. Martinis. 2019. Quantum supremacy using a programmable superconducting processor. Nature 574, 7779 (October 2019), 505–510. https://doi.org/10.1038/s41586-019-1666-5 [9] Andrew M. Childs, Eddie Schoute, and Cem M. Unsal. 2019. Circuit transformations for quantum architectures. In Pro- ceedings of the 14th Conference on the Theory of Quantum Computation, Communication and Cryptography (TQC’19), Leibniz International Proceedings in Informatics, Wim van Dam and Laura Mancinska (Eds.), Vol. 135. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany, 3:1–3:24. https://doi.org/10.4230/LIPIcs.TQC.2019.3 [10] Jerry M. Chow and Jay Gambetta. 2020. Quantum takes flight: Moving from laboratory demonstrations to building systems. Retrieved May 2020 from https://www.ibm.com/blogs/research/2020/01/quantum-volume-32/. ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. 10:24 M. G. Pozzi et al. [11] Alfredo V. Clemente, Humberto N. Castejón, and Arjun Chandra. 2017. Efficient parallel methods for deep reinforce- ment learning. arXiv:1705.04862. Retrieved from http://arxiv.org/abs/1705.04862. [12] Alexander Cowtan, Silas Dilkes, Ross Duncan, Alexandre Krajenbrink, Will Simmons, and Seyon Sivarajah. 2019. On the qubit routing problem. In Leibniz International Proceedings in Informatics. Vol. 135. 5:1–5:32. https://drops. dagstuhl.de/opus/volltexte/2019/10397/. [13] Andrew W. Cross, Lev S. Bishop, John A. Smolin, and Jay M. Gambetta. 2017. Open quantum assembly language. arXiv:1707.03429. Retrieved from http://arxiv.org/abs/1707.03429. [14] D. P. Franke, J. S. Clarke, L. M. K. Vandersypen, and M. Veldhorst. 2019. Rent’s rule and extensibility in quantum computing. Microprocess. Microsyst. 67 (June 2019), 1–7. https://doi.org/10.1016/j.micpro.2019.02.006 [15] Pranav Gokhale, Ali Javadi-Abhari, Nathan Earnest, Yunong Shi, and Frederic T. Chong. 2020. Optimized quantum compilation for near-term algorithms with openpulse. arXiv:2004.11205. Retrieved from http://arxiv.org/abs/2004. [16] Laszlo Gyongyosi and Sandor Imre. 2019. A survey on quantum computing technology. Comput. Sci. Rev. 31 (February 2019), 51–71. https://doi.org/10.1016/j.cosrev.2018.11.002 [17] Steven Herbert. 2020. On the depth overhead incurred when running quantum algorithms on near-term quantum computers with limited qubit connectivity. Quant. Inf. Computat. 20, 9 & 10 (August 2020), 787–806. https://doi.org/ 10.26421/QIC20.9-10-5 [18] Steven Herbert and Akash Sengupta. 2018. Using reinforcement learning to find efficient qubit routing policies for deployment in near-term quantum computers. arXiv:1812.11619. Retrieved from http://arxiv.org/abs/1812.11619. [19] Toshinari Itoko, Rudy Raymond, Takashi Imamichi, and Atsushi Matsuo. 2020. Optimization of quantum circuit mapping using gate transformation and commutation. Integration 70 (2020), 43–50. https://doi.org/10.1016/j.vlsi.2019. 10.004 [20] Alice Karnsund. 2019. DQN Tackling the Game of Candy Crush Friends Saga: A Reinforcement Learning Approach. Retrieved from http://www.diva-portal.org/smash/record.jsf?pid=diva2%3A1368129&dswid=-15%8. [21] Sumeet Khatri, Ryan LaRose, Alexander Poremba, Lukasz Cincio, Andrew T. Sornborger, and Patrick J. Coles. 2019. Quantum-assisted quantum compiling. Quantum 3 (May 2019), 140. https://doi.org/10.22331/q-2019-05-13-140 [22] Jens Kober, J. Andrew Bagnell, and Jan Peters. 2013. Reinforcement learning in robotics: A survey. Int. J. Robot. Res. 32, 11 (September 2013), 1238–1274. https://doi.org/10.1177/0278364913495721 [23] B. S. Landman and R. L. Russo. 1971. On a pin versus block relationship for partitions of logic graphs. IEEE Trans. Comput. C-20, 12 (December 1971), 1469–1479. https://doi.org/10.1109/T-C.1971.223159 [24] Gushu Li, Yufei Ding, and Yuan Xie. 2019. Tackling the qubit mapping problem for NISQ-era quantum devices. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’19), Iris Bahar, Maurice Herlihy, Emmett Witchel, and Alvin R. Lebeck (Eds.). ACM, 1001–1014. https://doi.org/10.1145/3297858.3304023 [25] Margaret Martonosi and Martin Roetteler. 2019. Next steps in quantum computing: Computer science’s role. arXiv:1903.10541. Retrieved from http://arxiv.org/abs/1903.10541. [26] Volodymyr Mnih, Adria Puigdomenech Badia, Lehdi Mirza, Alex Graves, Tim Harley, Timothy P. Lillicrap, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML’16), Vol. 48. 1928–1937. https://proceedings.mlr.press/v48/ mniha16.html. [27] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning. arXiv:1312.5602. Retrieved from http://arxiv.org/ abs/1312.5602. [28] Matthias Möller and Cornelis Vuik. 2017. On the impact of quantum computing technology on future developments in high-performance scientific computing. Ethics Inf. Technol. 19, 4 (December 2017), 253–269. https://doi.org/10. 1007/s10676-017-9438-0 [29] Prakash Murali, Jonathan M. Baker, Ali Javadi-Abhari, Frederic T. Chong, and Margaret Martonosi. 2019. Noise- adaptive compiler mappings for noisy intermediate-scale quantum computers. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’19). Association for Computing Machinery, New York, NY, 1015–1029. https://doi.org/10.1145/3297858.3304075 [30] Arun Nair, Praveen Srinivasan, Sam Blackwell, Cagdas Alcicek, Rory Fearon, Alessandro De Maria, Vedavyas Panneershelvam, Mustafa Suleyman, Charles Beattie, Stig Petersen, Shane Legg, Volodymyr Mnih, Koray Kavukcuoglu, and David Silver. 2015. Massively parallel methods for deep reinforcement learning. arXiv:1507.04296. Retrieved from http://arxiv.org/abs/1507.04296. [31] Matteo Pozzi. 2020. Qubit Routing with Reinforcement Learning (GitHub Repository). Retrieved November 2020 from https://github.com/Macro206/qubit-routing-with-rl. ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. Using Reinforcement Learning to Perform bit Qu Routing in antum Qu Compilers 10:25 [32] John Preskill. 2012. Quantum computing and the entanglement frontier. arXiv:1203.5813. Retrieved from http://arxiv. org/abs/1203.5813. [33] John Preskill. 2018. Quantum computing in the NISQ era and beyond. Quantum 2 (2018), 79. https://arxiv.org/abs/ 1801.00862 [34] Stuart Russell and Peter Norvig. 2010. Artificial Intelligence A Modern Approach, Third Edition . https://doi.org/10.1017/ S0269888900007724 [35] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. 2016. Prioritized experience replay. In Proceedings of the 4th International Conference on Learning Representations (ICLR’16), Conference Track Proceedings. https://arxiv. org/abs/1511.05952. [36] Eddie Schoute. 2019. Circuit Transformations for Quantum Architectures—Compiler Code (GitLab Repository). Re- trieved May 2020 from https://gitlab.umiacs.umd.edu/amchilds/arct/-/blob/master/arct/compiler.py. [37] Richard S. Sutton and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction (2nd ed.). The MIT Press. [38] Bochen Tan and Jason Cong. 2021. Optimality study of existing quantum computing layout synthesis tools. IEEE Trans. Comput. 70, 9 (2021), 1363–1373. https://doi.org/10.1109/TC.2020.3009140 [39] Hado Van Hasselt, Arthur Guez, and David Silver. 2016. Deep reinforcement learning with double Q-Learning. In Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI’16) . 2094–2100. [40] Robert Wille, Lukas Burgholzer, and Alwin Zulehner. 2019. Mapping quantum circuits to IBM QX architectures using the minimal number of SWAP and h operations. In Proceedings of the 56th Annual Design Automation Conference (DAC’19). Association for Computing Machinery, New York, NY, Article 142, 6 pages. https://doi.org/10.1145/3316781. [41] Alwin Zulehner. 2018. Quantum Information Software Kit (QISKit)—Compiler Code (GitHub Repository, fork). Re- trieved May 2020 from https://github.com/azulehner/qiskit-sdk-py/blob/mapping/qiskit/mapper/%_mapping.py. [42] Alwin Zulehner, Alexandru Paler, and Robert Wille. 2019. An efficient methodology for mapping quantum circuits to the IBM QX architectures. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 38, 7 (2019), 1226–1236. https://doi.org/ 10.1109/TCAD.2018.2846658 [43] Alwin Zulehner and Robert Wille. 2019. Compiling SU(4) quantum circuits to IBM QX architectures. In Proceedings of the 24th Asia and South Pacific Design Automation Conference (ASPDAC’19) . Association for Computing Machinery, New York, NY, 185–190. https://doi.org/10.1145/3287624.3287704 Received November 2020; revised February 2022; accepted February 2022 ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png ACM Transactions on Quantum Computing Association for Computing Machinery

Using Reinforcement Learning to Perform Qubit Routing in Quantum Compilers

Loading next page...
 
/lp/association-for-computing-machinery/using-reinforcement-learning-to-perform-qubit-routing-in-quantum-fKg7Pw9tnQ

References (58)

Publisher
Association for Computing Machinery
Copyright
Copyright © 2022 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ISSN
2643-6809
eISSN
2643-6817
DOI
10.1145/3520434
Publisher site
See Article on Publisher Site

Abstract

Using Reinforcement Learning to Perform Qubit Routing in Quantum Compilers MATTEO G. POZZI, University of Cambridge Computer Laboratory STEVEN J. HERBERT, University of Cambridge Computer Laboratory and Cambridge Quantum Computing AKASH SENGUPTA, Department of Engineering, University of Cambridge ROBERT D. MULLINS, University of Cambridge Computer Laboratory “Qubit routing” refers to the task of modifying quantum circuits so that they satisfy the connectivity con- straints of a target quantum computer. This involves inserting SWAP gates into the circuit so that the logical gates only ever occur between adjacent physical qubits. The goal is to minimise the circuit depth added by the SWAP gates. In this article, we propose a qubit routing procedure that uses a modified version of the deep Q-learning paradigm. The system is able to outperform the qubit routing procedures from two of the most advanced quantum compilers currently available (Qiskit and t|ket), on both random and realistic circuits, across a range of near-term architecture sizes (with up to 50 qubits). CCS Concepts: • Computing methodologies → Reinforcement learning;• Hardware → Quantum computation;• Software and its engineering → Compilers; Additional Key Words and Phrases: qubit routing, qubit mapping, Q-learning, simulated annealing, qiskit, tket, quantum circuits, machine learning, deep learning, neural networks ACM Reference format: Matteo G. Pozzi, Steven J. Herbert, Akash Sengupta, and Robert D. Mullins. 2022. Using Reinforcement Learning to Perform Qubit Routing in Quantum Compilers. ACM Trans. Quantum Comput. 3, 2, Article 10 (May 2022), 25 pages. https://doi.org/10.1145/3520434 1 INTRODUCTION In his highly influential 2018 paper, John Preskill coined the term Noisy Intermediate-Scale Quantum (NISQ) technology [33] and suggested that the so-called “NISQ era” would arrive in the near future; that is, we would soon have quantum computers with 50–100 qubits that would be able to solve problems that are intractable for even the best classical computers (a phenomenon Authors’ addresses: M. G. Pozzi and R. D. Mullins, University of Cambridge Computer Laboratory, 15 JJ Thomson Ave, Cambridge CB3 0FD; emails: mgp35@cantab.ac.uk, rdm34@cam.ac.uk; S. J. Herbert, University of Cambridge Computer Laboratory, 15 JJ Thomson Ave, Cambridge CB3 0FD; email: sjh227@cam.ac.uk; A. Sengupta, Department of Engineering, University of Cambridge, Trumpington St, Cambridge CB2 1PZ; email: as2562@cam.ac.uk. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2022 Copyright held by the owner/author(s). Publication rights licensed to ACM. 2643-6817/2022/05-ART10 $15.00 https://doi.org/10.1145/3520434 ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. 10:2 M. G. Pozzi et al. known as quantum supermacy [32]). Since then, we have seen Google and more recently IBM [10] produce quantum computers with over 50 qubits. In October 2019, Google announced they had achieved quantum supremacy with their 53-qubit Sycamore processor [8] (although IBM were quick to dispute this claim ). Such near-term quantum computers support a series of one- and two-qubit operations, or “gates,” which can be assembled into quantum “circuits,” in a similar spirit to how logic gates that act on classical bits can be assembled into sequential logic circuits. A high-level circuit description (in a language such as OpenQASM [13]) must be compiled before it can be executed on a target quan- tum architecture—this process includes passes to satisfy the constraints of the target hardware. Specifically, each quantum architecture has an associated “topology,” or connectivity graph, con- sisting of a set of physical nodes and links between them. Qubits inhabit the nodes and can only interact with qubits on adjacent nodes—SWAP gates can swap the nodes they inhabit. To make an arbitrary quantum circuit executable on a given target architecture, a quantum compiler has to insert SWAP gates so that gates in the original circuit only ever occur between qubits located at adjacent nodes, a process known as “routing.” This will produce a new circuit, possibly with a greater depth, that implements the same unitary function as the original circuit while respecting the topological constraints. The quantum architectures of today are extremely resource-constrained devices, with relatively low numbers and fidelities of qubits. Minimising the added circuit depth is a key goal in maximising the amount of useful work that can be done by today’s systems before decoherence, so much so that in 2018, IBM offered a prize for the best qubit routing algorithm. Tan and Cong [38] recently compared the performance of various state-of-the-art routing algorithms on benchmarks with known optimal depth and concluded that even the most advanced algorithms are significantly lacking—scope therefore exists to improve upon them. In this article, we frame the qubit routing problem as a reinforcement learning (RL) problem, employing a modified deep Q-learning approach to route qubits. We consider actions to be sets of parallelisable two-qubit quantum gates, namely Controlled-NOT (CNOT) and SWAP gates— the system uses simulated annealing, a combinatorial optimisation technique, to select actions (sets of gates) at each timestep to be scheduled in the routed circuit. We consider qubit routing as an abstracted problem and chose to minimise the added circuit depth rather than alternative metrics such as gate count. We believe minimising circuit depth to be a more significant goal, for several reasons: First, even idling qubits can decohere, and, second, one can readily envisage applications where long-running sparse circuits with low total gate count would not be favourable. Another thing worth mentioning is that minimising gate count is actually a trivial problem to solve with Q-learning, via an off-the-shelf formulation (by considering single gates to be actions, rather than sets/layers thereof)—however, part of this article’s contribution lies in solving the interesting problem of Q-learning with a combinatorial action space, which is necessary to truly minimise added circuit depth. With these ideas in mind, we then benchmark our system against the routing passes of state-of-the-art quantum compilers and demonstrate that our system is able to outperform its competitors on the most pertinent benchmarks. 2 BACKGROUND In this section, we begin by formalising and defining the terms used throughout the article. Google AI Blog, Quantum Supremacy Using a Programmable Superconducting Processor (blog post); https://ai.googleblog. com/2019/10/quantum-supremacy-using-programmable.html. IBM Research Blog, On “Quantum Supremacy”; https://www.ibm.com/blogs/research/2019/10/on-quantum-supremacy/. IBM Research Blog, We Have Winners! ... of the IBM Qiskit Developer Challenge; https://www.ibm.com/blogs/research/ 2018/08/winners-qiskit-developer-challenge/. ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. Using Reinforcement Learning to Perform bit Qu Routing in antum Qu Compilers 10:3 Fig. 1. An example of a quantum circuit and its decomposition into layers. 2.1 antum Qu Circuits A quantum circuit is composed of a series of operations, or gates, which transform the state of one or more logical qubits. Figure 1(a) shows an example of a quantum circuit with four qubits—this circuit contains two Hadamard gates and five CNOT gates in various orientations. Quantum circuits can be decomposed into a universal set of one- and two-qubit gates—on many architectures, the two-qubit gate of choice is the CNOT. In this article, we therefore only consider CNOT gates—single-qubit gates are not relevant in qubit routing, since they can occur at any node, and in any case, they are much quicker than two-qubit gates on all real quantum computers. An important concept is the notion of circuit depth. For an ordered set of two-qubit gates G, each acting on qubits q and q (such that д = {q ,q }), indexed from i =1: j k i j k d (q) = 0 for all qubits q in the circuit, (1) ⎪ max (d (q ),d (q )) +1for q ,q ∈ д if q ∈ д t j t k j k t t d (q) = , (2) t+1 d (q) otherwise Circuit depth d = max(d (q)). (3) |G| This can be visualised as slicing the circuit into timesteps of gates that can be performed in parallel—the depth of a circuit is then the minimum number of timesteps it can be decomposed into, without any qubit performing more than one interaction in any given timestep (see Figure 1(b)). 2.2 antum Qu Architectures For our purposes, a quantum architecture is a connectivity graph, composed of a set of physical qubits, or nodes, and a set of links between them. Figure 2 provides an example of one of IBM’s quantum architectures with 20 qubits. In this article, we only consider undirected qubit connectiv- ity graphs—that is, CNOT gates can be performed in either direction. In practice, the direction of CNOT gates can be inverted by using Hadamard gates if necessary, so this simplification makes little difference in our domain. 2.3 Placements and SWAP Gates At any time t during circuit execution, qubits Q are mapped onto nodes N according to some placement p : Q → N . Gates may only occur between two qubits if they lie on adjacent nodes— that is, gate д = {q ,q } may only occur at time t if (p (q ),p (q )) ∈ E,where E is the set of edges 0 1 t 0 t 1 in the architecture’s connectivity graph. ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. 10:4 M. G. Pozzi et al. Fig. 2. The IBM Q20 Tokyo [1]. Fig. 3. A SWAP gate and its decomposition into three CNOT gates. SWAP gates allow two qubits on adjacent nodes to switch positions. Formally, for SWAP gate s = {q ,q } at time t, p (q ) = p (q ) and vice versa. Figure 3(a) shows the circuit symbol for a 0 1 t+1 0 t 1 SWAP gate, which can be decomposed into three CNOT gates in sequence, as shown in Figure 3(b). 2.4 bit Qu Routing “Routing” denotes the task of inserting SWAP gates into quantum circuits so that every gate in the original circuit can be performed on a given target architecture. Qubit routing passes in quantum compilers generally accept a circuit together with a connectivity graph and initial layout, and output a new circuit that respects these architectural constraints. The routing process can thus be represented as a function R : (c,д,l) → c , with input circuit c, output circuit c , connectivity graph д, and initial layout l. The goal is to minimise the added depth of the output circuit versus that of the original circuit. This goal is important because we want to try and maximise the amount of useful work performed by qubits before decoherence and thus maximise the Quantum Volume (QV) [10] our current systems can achieve. Added depth can be formalised as two metrics, circuit depth overhead (CDO) and circuit depth ratio (CDR): d(c ) CDO = d(c ) − d(c) CDR = , (4) d(c) where d denotes circuit depth, c is the original circuit, and c is the routed circuit. Figure 4 shows a quantum circuit of depth 6 before and after routing on the topology and initial layout in Figure 4(a). The routed circuit has depth 7 and therefore a CDO of 1 and a CDR of . Notice how two of the SWAP gates do not add any extra depth to the circuit—the routing procedure is able to perform these while CNOT gates are happening, thus minimising the depth overhead. ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. Using Reinforcement Learning to Perform bit Qu Routing in antum Qu Compilers 10:5 Fig. 4. An example of a quantum circuit before and after the routing procedure. The process of routing qubits is inherently linked to their initial placement p . In this work, we mostly consider random initial placements for fair comparisons of the routing algorithms them- selves, although many quantum compilers do provide strategies to optimise initial placement and these are also important (for some simple circuits, inserting swaps might not be necessary at all). 3 RELATED WORK 3.1 Challenges of the NISQ Era Quantum Computing in the NISQ era presents several challenges. Gyongyosi and Imre [16]pro- vide a recent summary of the space. Martonosi and Roetteler [25] give a broad overview of the importance of computer science principles in the development of quantum computing. Möller and Vuik [28] describe the lessons that can be learned from how classical architectures have evolved to do scientific computing. Almudéver et al. [ 7] provide a summary from an engineering perspective. The general theme among these papers is that quantum computing is currently facing many of the challenges that classical computing faced during is development. Franke et al. [14] highlight an important concern that came about in the late 1950s, called “the tyranny of numbers,” which refers to the massive increase in the number of components and interconnections required as architectures were scaled up, before the invention of the integrated circuit. The authors note that current quantum computing architectures will see a similar lack of scalability—they adapt Rent’s rule [23] and define a quantum Rent exponent p to quantify the progress made in this aspect of optimisation. Another key challenge is the lack of qubit fidelity, causing limits on achievable QV—this is a hardware-agnostic metric coined by IBM [10], which quantifies the limits of executable circuit size on a given quantum device. Minimising circuit depth provides an effective way of maximising QV, which is why the problem of routing qubits is such a key piece of the puzzle when it comes to maximising the usefulness of NISQ-era machines. 3.2 bit Qu Routing Herbert [17] provides some theoretical bounds on the depth overhead incurred from routing, while Tan and Cong [38] provide insight into the performance of various state-of-the-art routing systems on benchmarks with known optimal depth. IBM’s Qiskit [5] is considered to be the most advanced and complete open-source quantum compiler. In 2018, IBM offered a prize for whoever could write the best routing algorithm for their quantum architectures. The winner was Alwin Zulehner, with an algorithm based on A* IBM Research Blog, Now Open: Get quantum ready with new scientific prizes for professors, students and developers; https://www.ibm.com/blogs/research/2018/01/quantum-prizes/. IBM Research Blog, We Have Winners! ... of the IBM Qiskit Developer Challenge; https://www.ibm.com/blogs/research/ 2018/08/winners-qiskit-developer-challenge/. ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. 10:6 M. G. Pozzi et al. Fig. 5. An illustration of the RL process [37]. 6 7 search [42]. Second place was tied between Sven Jandura and Eddie Schoute [9]. Zulehner claims that his background in Computer-Aided Design helped guide his prize-winning approach, which further supports the theme of Section 3.1 (above). Zulehner has gone on to publish an approach targeting SU(4) quantum circuits [43] and an approach for arbitrary quantum circuits based on Boolean satisfiability [ 40], which is only tractable for circuits with small numbers of qubits. IBM have currently integrated two approaches from the literature into Qiskit [6, 24], although open- source code exists for the others [36, 41]. Cambridge Quantum Computing’s (CQC) t|ket [4] is a proprietary compiler with state-of- the-art performance. Its routing procedure is detailed by Cowtan et al. [12]—it effectively adds SWAPs to minimise some cost function based on proprietary heuristics. The t|ket documentation suggests that the system now uses BRIDGE gates—Itoko et al. [19] provide some insight into how the use of BRIDGE gates can improve performance of SWAP-only routing algorithms. Many other methods exist in addition to the ones above, such as that used by Google’s Cirq [3] or that proposed by Li et al. [24], who also propose a technique for finding initial qubit placements. Research also exists regarding approaches that consider alternative factors when routing, such as differing qubit error rates [ 29], and even approaches that use quantum computers for quantum compilation [21]. 3.3 Reinforcement Learning: Deep Q-learning Reinforcement learning is a sub-field of Machine Learning that offers a powerful paradigm for learning to perform tasks in contexts with very little apriori knowledge of what the optimal strat- egy might be. The paradigm has proven to be very useful in complex situations with lots of input data, such as robotics [22] and video games [20, 27]. Under the paradigm, agents learn to achieve some goal in an environment by freely interacting with it and observing rewards for performing actions in different states (see Figure 5). The Deep Q-learning (DQN) paradigm proposed by Mnih et al. [27] uses a convolutional neural network to learn a function Q (s, a) that represents the quality of being in state s and taking action a. The function is defined as follows: Q (s, a) = r (s, a) + γ max Q (s , a ) (5) where r is the reward conveyed to the agent for taking action a in state s, s is the state resulting from taking that action, andγ is a discount factor for discounting the value of future rewards. When Sven Jandura, Improving a Quantum Compiler; https://medium.com/qiskit/improving-a-quantum-compiler-48410d7a7 Eddie Schoute, Constraints on Quantum Circuits and getting around them; https://medium.com/qiskit/constraints-on- quantum-circuits-and-getting-around-them-7de973bd1a18. Alwin Zulehner, How Computer-Aided Design helped me winning the QISKit Developer Challenge; https://medium.com/ qiskit/how-computer-aided-design-helped-me-winning-the-qiskit-developer-challenge-4b1b60c8930f. ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. Using Reinforcement Learning to Perform bit Qu Routing in antum Qu Compilers 10:7 using a neural network to update the Q function in response to new experiences, the learning equation can be written as follows: Q (s , a ) ← (1− α )Q (s , a ) + α r + γ max Q (s , a ) . (6) t t t t t t+1 t+1 t+1 In general, a neural network for DQN will be designed to take a state vector as input (e.g., an image bitmap) and will have one output for each possible action (e.g., buttons on a games console)— this allows the system to easily select the maximum Q-value from the neural network’s outputs, rather than have to re-run a forward pass of the neural network for each action individually. State, action, and reward formulations depend on the problem type, but a common feature of DQN is to do some feature pre-processing via a feature selection function ϕ—for example, one may wish to convert an image to greyscale and down-sample it if dealing with large RGB images as input (as in Reference [27]). Such feature pre-processing may help the neural network to learn more effectively. 3.3.1 Improvements to DQN. There exist two common improvements to deep Q-learning, Dou- ble Deep Q-learning (DDQN) [39]and Prioritised Experience Replay (PER) [35]. The former helps improve the stability of the learning process by using two neural networks instead of one, while the latter enhances the learning process by replaying more useful experiences with a higher priority. 3.3.2 Action Selection Strategies. DQN methods often use an ϵ-greedy exploration policy dur- ing training (and perhaps also during testing): This means that a random action is taken with probability ϵ, and the action with the highest Q-value is taken with probability 1− ϵ.The valueof ϵ begins at 1 and gradually decays after each learning batch—this means that, initially, the agent often chooses random actions, while as training goes on, it increasingly chooses actions that max- imise the Q-value. 4 QUBIT ROUTING WITH Q-LEARNING In this section, we propose a new DQN paradigm to tackle the problem of routing qubits. The problem fits quite naturally into an RL-based interpretation: The end goal is to schedule a set of CNOT gates, given an initial mapping of logical qubits onto physical nodes, by inserting SWAP gates as necessary. The environment thus consists of a partially scheduled circuit, and the agent can decide to schedule CNOT and SWAP gates as necessary, where physically possible. 4.1 RL Formulation for bit Qu Routing Consider a mapping of logical qubits onto physical nodes, such that each logical qubit effectively “inhabits” a given node at a given timestep. Given an initial such mapping, an RL agent is given the ability to “schedule” gates from the original logical circuit, i.e., CNOTs, but only if the hardware constraints permit—that is, only if the two qubits involved in the logical gate inhabit adjacent physical nodes in the architecture’s topology. The agent is also able to swap the nodes that logical qubits inhabit, with the goal of perhaps resolving such hardware constraints—this corresponds to “scheduling” a SWAP gate. In this situation, “scheduling” a gate means adding it to an ordered list of gates—at the end of the process, we will be left with a complete list of gates that, when parallelised into layers, represent a routed version of the original logical circuit. Under this formulation, the state of the environment at any given timestep consists of a layout of logical qubits on physical nodes, and a partially-routed circuit—that is, a set of gates that have already been scheduled by the agent, and a set of gates that have yet to be routed. This state can be represented by a tuple containing the following elements, for example: ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. 10:8 M. G. Pozzi et al. Fig. 6. A circuit and target topology, and two possible routed circuits. • Qubit locations: a mapping l : N → Q denoting which architectural nodes are currently holding which logical qubits. • Qubit targets: a partial mapping t : Q  Q, with q = t (q ) iff q ’s next interaction in the 2 1 1 circuit is with q , or undefined if q has performed all of its interactions. 2 1 • Circuit progress: a mapping p : Q → N , for a circuit with depth d, with n = p(q) iff q has d+1 completed n interactions so far. A reward signal can be issued to the agent for each CNOT it schedules. Actions can be formulated either as scheduling individual gates/swaps, or scheduling a layer of gates/swaps—in both cases, we assume that CNOTs and SWAPs both take one timestep to complete. We argue that for the problem at hand, the latter action formulation is preferable. Consider the following example. Figure 6 shows a circuit to be routed on a target topology (with given initial layout) and two possible solutions. Figure 6(c) and (d) represent two different solutions that both require two “actions” under the former action formulation, and therefore appear equiva- lent, since they both schedule two gates. However, in reality, c can occur in two timesteps, since the first CNOT and the SWAP can occur in parallel, while c must take three timesteps. c is there- 2 1 fore the optimal solution, in terms of added circuit depth, but the former RL formulation provides no way of telling that this is the optimal choice. As this example demonstrates, an RL system could never hope to be optimal if it relies merely on implicit execution of single quantum gates, since it has no concrete way of minimising cir- cuit depth. It is therefore important for actions to be formulated as sets of parralelisable quantum gates/swaps—in other words, an action is a layer of gates, potentially mixing logical CNOTs from the original circuit together with SWAPs. 4.2 Combinatorial Action Spaces A key challenge of this formulation lies in the fact that the action space is combinatorial—that is, for a connectivity graph with n edges, there are O (2 ) possible parallelisable sets of gates to choose from. It is therefore intractable to have a neural network learn a quality function based on state- action pairs—a one-hot encoding of actions on the output layer of the neural network would be infeasible. To mitigate this issue, we propose a modified learning equation that learns the quality of state transitions and uses a combinatorial optimisation technique (simulated annealing in this case) to search for the action leading to the highest-quality state transition after that. The quality function can therefore be expressed as follows: Q (s , s ) = r + γ max Q (s , env(s , a )), (7) t t+1 t t+1 t+1 t+1 a+1 where the env function yields the resulting state when applying a given action to a given state (the resulting state would be s in this particular case). The DQN model presented in this article t+2 ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. Using Reinforcement Learning to Perform bit Qu Routing in antum Qu Compilers 10:9 is trained to represent the above equation—its (single) output value is thus the “quality” of being in one state and transitioning to another under a given action. Like traditional Q-learning, this method attempts to capture the idea of quality being assigned to a state and an action, but by using the next state instead of the action itself and passing both state vectors as inputs to the neural network, the system is able to overcome the issue of the combinatorial action space. 4.3 Scheduling Gates In our system, CNOT gates are automatically scheduled when their logical qubits inhabit adjacent nodes. Actions can therefore be considered as a mandatory set of CNOT gates, together with a set of SWAP gates that can be chosen by the agent to be performed in the same timestep (as long as they do not involve qubits already locked in a CNOT gate in the current timestep). 4.3.1 Performing the Actions. In standard RL style, the environment takes a state s and action a , and outputs a tuple (s , r ), as follows: t t+1 t (1) Schedule any CNOT gates between adjacent logical qubits—consider these nodes to be “pro- tected,” so that no SWAPs can be scheduled involving them in the current timestep. (2) Calculate the total distance between each pair of mutually-targeting nodes (i.e., gates), which are not in the protected set—call this the total pre-swap distance d . pre (3) Perform the swaps in a , and calculate the total post-swap distance d . t post (4) Compute reward r . Effectively, this amounts to scheduling some gates, scheduling some swaps that don’t conflict with them, and then updating the state in response to this action—this means that gates and swaps are mixed into the same actions, but the gates are mandatory, i.e., gates are performed as soon as their two qubits land next to each other. A fixed gate reward is issued for each pair of mutually targeting qubits that are brought next to each other (i.e., when a gate is made possible). In the absence of other reward signals, the reward for a gate whose qubits are very distant would be discounted to such an extent that it would essentially be lost in the noise. To ensure that the system scales well to quantum architectures with a higher diameter, we also introduce a distance reduction reward (when d < d )—this is just a fixed post pre constant for each qubit that is brought closer to (but not next to) its target. 4.3.2 Simulated Annealing. This is the combinatorial optimisation process used to find actions to perform. The process searches for higher-quality states by first swapping a random edge in the architecture and then probing neighbouring solutions (i.e., actions that are one further edge swap away) and accepting them based on some acceptance probability. Actions that would lead to a non- parallelisable set of SWAPs are immediately disqualified and removed from further consideration, as are those that would conflict with the current set of “protected” nodes, i.e., nodes inhabiting qubits that are currently involved in a CNOT gate. The quality of a given action a, which acts on state s to yield next state s is just Q (s , s ).The 0 x 0 x acceptance probability is then (Q −Q )/t ⎧ x e if Q ≤ Q x x P (Q , Q , t) = , (8) acc x x 1 otherwise “Quality” refers to the value of the Q-function defined above, i.e., the output of the DQN model. ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. 10:10 M. G. Pozzi et al. for qualities Q = Q (s , s ) and Q = Q (s , s ) (i.e., current action and new candidate action), and x 0 x x 0 x current “temperature” t. The temperature decays by a fixed multiplier upon each iteration, until a given minimum temperature is reached. 4.4 State Representation The state tuple described above needs to be condensed into a fixed-length vector that can be learned from readily by a neural network. The processed state representation (or feature selection function ϕ) we propose here is a distance vector d, such that d[i] represents the number of qubits that are a distance of i from their targets. One benefit of this representation is that it scales well—rather than scaling with the number of qubits n, which the original state tuple would have, this state representation now scales with the diameter of the connectivity graph, which is O ( n) for a grid, and may be as little as O (log n) in some cases [17]. This representation is still not injective, since many different scenarios can map onto the same distance vector, hindering the learning process. It is therefore helpful to allow the agent to distinguish situations in which its action choice will end up more or less constrained by the currently scheduled gates—we add another component to our proposed feature selection function, e, such that e[i] represents the number of nodes n that have i edges conforming to the following conditions: (1) The edge neighbours n, and lies along the shortest path to n’s target. (2) The edge does not involve a currently protected node. The system concatenates the above state vectors for s and s andpassesthisnew repre- t t+1 sentation to the neural network. In other words, the system uses a feature selection function Φ(s , s ) = (d , e , d , e ). t t+1 s s s s t t+1 t t+1 4.5 Model Algorithm At a high level, the DQN model combines the above concepts into a qubit routing procedure that works as follows. At each timestep, the model searches for an action to perform by carrying out the simulated annealing process described above—this process executes multiple passes of the neural network, once per candidate action, to search for an action that maximises the neural network’s output value Q. Once such an action, a, is selected, the environment updates the state in response to this action to yield a new state. This process continues until a terminal state is reached, i.e., all of the CNOT gates in the original circuit have been scheduled. This algorithm is generally the same for both training and inference, apart from some minor differences. During training, when an action is selected, a reward signal is observed, and this ex- perience tuple is then saved for replaying later when training on a batch of experiences. During inference, when an action is selected, the reward signal is disregarded, and instead the CNOT and SWAP gates represented by the action are added to the routed circuit. 4.6 Model Structure and Hyperparameters The DQN model presented in this article is a neural network with three 32-node fully connected layers, with ReLU activation functions. The output is a single-node layer with linear activation function. The input vector size is |(d , e , d , e )| = 2· ((d + 1) + (e + 1)), for a connectivity s s s s t t t+1 t+1 graph with furthest distance d between two nodes, and max node degree e. The loss function is Mean Squared Error, and the model uses the Adam optimizer with a learning rate of 0.001. The model is constructed using the keras package, using a tensorflow backend. ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. Using Reinforcement Learning to Perform bit Qu Routing in antum Qu Compilers 10:11 Besides the learning rate, the model uses a set of default hyperparameters. The value of γ in the above equation is 0.6. The ϵ parameter of the model’s ϵ-greedy strategy begins at 1.0, and decays by a factor of 0.9 for each training episode, until it reaches a minimum value of 0.001. The annealer also has a set of default hyperparameters. Its initial temperature t is 60.0, with a minimum temperature of 0.1 and a cooling multiplier for each iteration of 0.95. Our model also makes use of the two mentioned improvements to DQN, namely DDQN and PER. 5 RESULTS In this section, we evaluate our DQN system on a variety of quantum circuits and architectures by comparing it to other routing algorithms in state-of-the-art compilers. The compilers we bench- mark against are CQC’s t|ket, IBM’s Qiskit, and Google’s Cirq. Other compilers exist, but Tan and Cong [38] found t|ket and Qiskit to be the leaders in the space, so this selection is sufficient to demonstrate how our approach compares to the industry standard. The code for our DQN system is available on GitHub [31]. 5.1 Benchmarking Setup When referring to the performance of a given system, we mean how it behaves in terms of circuit depth overhead/ratio—that is, better performance means lower circuit depth overhead/ratio. We instead use the term runtime to refer to how long a given system takes to run. We consider a SWAP gate as a primitive operation taking a single timestep (same time as a CNOT)—this assumption represents the simplest fair method of comparison, and has precedent in the literature [12] (see Discussion for more on this). We have verified that the other compil- ers also output circuits with a mix of SWAPs and CNOTs, rather than performing SWAP decom- position. Throughout the following benchmarks, we have disabled every sort of compiler pass except for the routing process itself, to ensure that our comparison is fair and pertinent to the task at hand. This includes placement routines—we have chosen to use random initial placements instead. For the baseline systems, we downloaded the most recent versions compatible with Python 3.7. Qiskit comes with various routing algorithms—the most recent and performant are StochasticSwap and SabreSwap [24]. Which of the two performs better depends on the benchmark, so for fair comparison, we picked the one with the best CDO/CDR in each case. Where hyperparameters made a difference, we also chose values that maximised performance—in practice, we found that the number of trials for Qiskit’s StochasticSwap algorithm was the only parameter that heavily impacted the results. For CQC’s t|ket, we disabled the use of BRIDGE gates, so it could be fairly compared to our SWAP-only algorithm—we briefly assess the impact of such gates, as well as SWAP decomposition, in the Discussion section below. 5.2 A Word on Runtime It is worth noting that our DQN system requires re-training for each target architecture, and poten- tially a round of hyperparameter optimisations, which critics may view as a drawback. However, in NISQ-era quantum computing, classical (compiler) runtime is not such a concern if quantum run- time (CDO/CDR) can be improved at all. In addition, new quantum architectures are not developed every day, and once an RL agent (or “model”) has been trained for a given quantum architecture, it can be re-used on the same architecture indefinitely, to compile a limitless number of quantum circuits. ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. 10:12 M. G. Pozzi et al. Fig. 7. An example of a 16-qubit single full-layer circuit, and a 4 × 4 grid architecture for it to be executed on. 5.3 Training In RL, it is common practice to train up a few different models and pick the best. For each of the following benchmarks, we trained a series of models for each architecture on randomly-generated sets of circuits, and the models were then run on separate test sets. In situations where there was significant variation in quality between models in training, only the best-quality models were retained and subsequently used in testing. Such cases are clearly identified below, and we still run through the same total number of test circuits, for fairness. 5.4 Single Full-layer Circuits The first benchmark involves single full-layer circuits on increasing grid sizes. More precisely, these are n-qubit circuits, each with disjoint gates. This benchmark represents the worst kind of situation for a routing algorithm to deal with, since the original depth is very low (d = 1) but the number of gates to schedule is maximal for this depth. Figure 7(a) shows an example of such a circuit with 16 qubits. Clearly, single full-layer circuits can be scheduled immediately on grid architectures if the cor- rect initial placement is chosen, so a random placement is used to test the effectiveness of the routing scheme. Figure 7(b) shows an example of a 4 × 4 grid architecture that could be used to execute the circuit represented by Figure 7(a). For the random initial placement shown, only the gate between qubits q and q can be performed immediately—SWAP gates must be inserted to 2 3 make the other gates possible. Figure 8 shows how CDO increases with increasing grid size. For each system, five batches of 100 test circuits were executed and the results were averaged. For the DQN system, the model was retrained on a separate training set also consisting of similar circuits, once per batch, for each different grid size, using a number of training circuits that increased linearly with the number of qubits. Qiskit is clearly the best system for this benchmark, which is unsurprising, since its Stochastic- Swap is the only system here that schedules all of the necessary swaps before scheduling all of the original (logical) gates at once, for each layer, which is an effective strategy for maximally dense layers. However, the number of trials had to be increased significantly from the default value of 20 to reach this level of performance, which some might view as a drawback of the system. We were unable to replicate the results in CQC’s paper [12], so we include their reported results as well as the results we obtained from running t|ket as described above. The DQN system ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. Using Reinforcement Learning to Perform bit Qu Routing in antum Qu Compilers 10:13 Fig. 8. Single full-layer performance on various grid topologies. Qiskit here is StochasticSwap. outperforms the current version of t|ket (although it is unable to outperform their reported results), and its performance scales sublinearly with the number of qubits, which is encouraging. Cirq’s performance is poor, even with a better parameter value—an r of more than 3 rapidly max becomes intractable, with minimal benefit. We therefore exclude Cirq from the benchmarks that follow. 5.5 Multi-layer Circuits Another important type of circuit to consider is the multi-layer circuit. These are circuits composed of a series of N layers with density ρ, such that each layer will have ρ gates. A density of ρ = 1.0 thus yields a series of full layers, and the number of gates in such a circuit is maximal for the given depth. Lower densities naturally lead to a less strict layer structure, reducing the number of gates in a circuit of given depth. Figure 9 shows the performance data for three circuit densities. The system was trained on two- layer circuits with ρ = 1.0, since we found that training on circuits with more layers actually worsened performance. The three best models of five were selected for testing. Qiskit’s StochasticSwap exhibits good performance on full-layer circuits, since it employs its single-layer strategy for each (maximally dense) layer in sequence. Such behaviour is evident from its perfectly constant CDR in Figure 9(a). The DQN system also performs well in this case, outper- forming t|ket by about a third. It comes within about 20% of Qiskit’s performance on circuits with 10 layers. As density is decreased, the performance of the DQN system begins to surpass that of Qiskit on deeper circuits. This is not a surprise, since Qiskit rigidly schedules each layer in sequence and ignores gates in future layers. DQN’s performance improvement is a good sign as we turn towards random circuits below, which mostly have layer densities in the range [0.25, 0.45]. 5.6 Random Circuits Random circuits are a reasonable simulation of real quantum circuits. They are generated by adding gates between random qubits, leading to circuits with low layer densities. Figure 10 shows the performance data for each system on four different quantum architectures. For each datapoint, the systems were executed on five batches of 100 test circuits each, and the results were averaged. For ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. 10:14 M. G. Pozzi et al. Fig. 9. Multi-Layer performance on the IBM Q20 Tokyo with differing circuit densities. Qiskit here is Stochas- ticSwap. each architecture, we trained 16 DQN models with different hyperparameters on separate training sets consisting of random circuits, and retained the highest-quality models for testing. The DQN system has the best performance across all architectures and circuit sizes in the above plots, which is very encouraging. In particular, it is encouraging to see that the DQN system is able to maintain best-in-class performance on larger quantum architectures. For random circuits, average layer density increases with the number of gates in the circuit—despite this, the DQN system’s CDR still remains lowest on random circuits with 1,000 gates, across all architectures. As the grid size increases, hyperparameter optimisation becomes ever more important—in par- ticular, it is important to slow down the annealer’s temperature decay, so that optimal actions can indeed be found. Nevertheless, these results demonstrate that the best DQN models are able to sur- pass the best competitor (Qiskit’s StochasticSwap, here) on architectures of up to about 50 qubits, even as circuit depth increases, which suggests that the DQN system will remain competitive on most near-term quantum architectures and circuits. 5.7 Realistic Test Set As a final benchmark, we sought to test each system on a set of real quantum circuits. We chose the test set of 158 circuits used by Zulehner [2] and filtered out any circuits with a depth of 200 or ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. Using Reinforcement Learning to Perform bit Qu Routing in antum Qu Compilers 10:15 Fig. 10. Each compiler’s performance on random circuits of different sizes, for four quantum architectures. Qiskit here is StochasticSwap. more (due to runtime constraints). The final benchmark set thus consisted of 95 realistic quantum circuits, ranging from 3 to 16 qubits, depths of 5 to 199, and 5 to 240 CNOT gates. We ran each system on four different realistic architectures. For each architecture, we trained 5 DQN models on random circuits with 50 gates, and generated five sets of random initial place- ments (one placement per circuit in each set). We then ran the system on the benchmark set, once per model and initial placement set, yielding a total of 25 runs through each circuit. We chose not to isolate the best models here, to give an indication of average-case performance. For the other systems, we repeated each circuit and initial placement five times with the same parameters. We also used a fixed seed when generating the placements, such that each system would use the same placement sets. Figure 12 shows the mean CDR for each system and quantum architecture. The error bars rep- resent one standard deviation, obtained over the mean CDRs of each model. We only plotted error bars for the DQN system, since the error bars for the other systems were too small to plot—for the former, variance between models is far more significant than variance between different runs of the same model, meaning that most of the variance between runs arises from variation in the quality of models. ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. 10:16 M. G. Pozzi et al. Fig. 11. The connectivity graph of the Rigetti 19Q Acorn [ 12]. The performance gap between the DQN system and the other baselines on the 4 × 4gridand the IBM Q20 Tokyo is significant—as the error bars illustrate, even the worst model on each archi- tecture outperforms t|ket by at least 11% (and the best yields CDRs up to 17% lower). The DQN models are also able to outperform the other baselines by a good margin on the IBM Q16 Rüsch- likon, which is effectively a 2 × 8 grid and thus has worse connectivity than a 4 × 4grid(seethe next section for a discussion of architecture connectivities). The Rigetti 19Q Acorn (Figure 11) is an example of an architecture with sparse connectivity, with nodes of degree either 2 or 3, which poses difficulties for routing algorithms. The DQN system runs into some issues on this architecture—most notably, we found that training stability decreased, and that the performance gap between DQN and its competitors narrowed. Nonetheless, two models were somewhat higher quality than the other three—isolating these two (see Figure 12(d)) yields a performance that is still better than that of t|ket, albeit more marginally than on other architec- tures. The error bar was too small to plot here, since the two models have very similar performance, but the benchmark did still run through each placement set five times, for consistency. In any case, with both models, every run through the test set yielded a lower CDR that t|ket’s average, so we can be reasonably confident that the DQN system is indeed still able to outperform the others on this architecture when training a few models and selecting the best (which is standard practice in RL). Table 1 shows a breakdown of results for the deepest benchmark circuits tested. For the first three architectures, the DQN system outperforms the other systems on every benchmark circuit listed. For the Rigetti 19Q Acorn, the DQN system outperforms on 10 of 15 circuits. For reference, we have included a similar table for gate count—that is, displaying Circuit Gate Ratios (CGRs), the analogue of CDR for gate count. Table 2 shows a breakdown of results for the same benchmark circuits as in the previous table, i.e., the deepest. On average, the DQN approach tends to use more gates, yet has smaller circuit depth ratios—to help shed some light on this char- acteristic, we wrote some code to automatically display diagrams (such as Figure 13) representing the internal state of the RL algorithm at each timestep. From this process, we made the observation that the DQN algorithm sometimes fills up layers with gates that do not achieve much improve- ment in the overall state quality Q—for example, swapping two qubits in one given timestep, only to swap them back in the next. Such behaviour stems from the fact that the DQN method does not currently incorporate a penalty for adding gates to a given layer, and it is therefore free to fill up layers with operations that may not greatly improve the quality of the current state. This issue could be mitigated by adding an appropriate penalty in the RL reward function—this would then allow the DQN method to balance the increase in Q from imminently scheduled CNOTs with the decrease in Q from the addition of SWAPs in the layer. In the case of redundant gates, such a reward would certainly be unfavourable. It is important to bear in mind that minimising CGR was not the goal of this article, as outlined in the Introduction previously, and as such, Table 2 is really only for reference. Furthermore, we would speculate that minimal CDR is unlikely to coincide with minimal CGR—that is, to schedule ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. Using Reinforcement Learning to Perform bit Qu Routing in antum Qu Compilers 10:17 Fig. 12. Each compiler’s performance on a test set of 95 realistic circuits [2], for four quantum architectures. Qiskit here is SabreSwap. all operations in as few timesteps as possible, it will generally be the case that some additional swaps may be needed (compared to an algorithm that simply minimises swap count). Nevertheless, we are confident that our above proposed modification to the DQN method to incorporate an extra gate penalty would help decrease CGR significantly. In fact, we would argue that training an agent to minimise CGR would be relatively straightforward, and we would expect such an agent to still perform well—in this case, actions would simply be single gates, which would remove the need for simulated annealing altogether. 6 DISCUSSION Overall, the DQN system’s performance throughout the benchmarks has been very positive. Per- formance on the layerised benchmarks was good, and the system still fared well against its competi- tors; looking instead at the random and real circuit benchmarks, the DQN system outperforms the other state-of-the-art baselines across all of the quantum architectures we tried. We would argue that such benchmarks (especially the realistic circuits) are the most important, since they most ac- curately represent the type of circuit that a routing algorithm may end up tackling in a real-world context, and it is therefore very encouraging to see how well the DQN system performs here. ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. 10:18 M. G. Pozzi et al. Table 1. CDRs for Circuits in the Benchmark Set with an Original Depth above 100 for Each Architecture and Algorithm Circuit 4 × 4grid IBM Q20 Tokyo IBM Q16 Rüschlikon Rigetti 19Q Acorn Name d DQN Qiskit tket DQN Qiskit tket DQN Qiskit tket DQN Qiskit tket alu-v2_30 199 1.237 1.527 1.488 1.16 1.303 1.232 1.355 1.611 1.626 1.636 1.667 1.634 mod8-10_177 178 1.238 1.513 1.437 1.161 1.321 1.337 1.353 1.542 1.526 1.61 1.658 1.597 rd53_131 175 1.283 1.494 1.479 1.182 1.399 1.319 1.421 1.558 1.605 1.648 1.692 1.682 C17_204 173 1.31 1.526 1.492 1.215 1.366 1.369 1.44 1.611 1.599 1.683 1.688 1.674 alu-v2_31 172 1.268 1.533 1.502 1.154 1.301 1.329 1.392 1.585 1.535 1.637 1.697 1.699 4gt4-v0_73 160 1.277 1.533 1.493 1.166 1.326 1.351 1.366 1.542 1.568 1.627 1.674 1.661 ex3_229 157 1.239 1.497 1.511 1.173 1.278 1.279 1.376 1.606 1.545 1.631 1.688 1.701 cnt3-5_180 148 1.373 1.628 1.599 1.286 1.495 1.48 1.579 1.676 1.569 1.769 1.897 1.801 mod8-10_178 135 1.252 1.503 1.456 1.164 1.32 1.376 1.388 1.559 1.527 1.631 1.658 1.603 decod24-enable_126 134 1.252 1.557 1.51 1.159 1.396 1.361 1.384 1.587 1.509 1.635 1.693 1.637 ham7_104 134 1.245 1.53 1.497 1.179 1.332 1.26 1.37 1.563 1.554 1.618 1.626 1.637 one-two-three-v0_97 116 1.262 1.546 1.517 1.146 1.284 1.381 1.356 1.541 1.572 1.629 1.631 1.643 rd53_135 114 1.289 1.555 1.481 1.199 1.419 1.318 1.42 1.603 1.579 1.656 1.689 1.651 mini-alu_167 111 1.274 1.523 1.494 1.147 1.352 1.396 1.375 1.587 1.544 1.638 1.679 1.686 4gt4-v1_74 108 1.271 1.564 1.506 1.166 1.353 1.296 1.372 1.616 1.68 1.633 1.7 1.754 Table 2. CGRs for Circuits in the Benchmark Set with an Original Depth above 100 for Each Architecture and Algorithm Circuit 4 × 4grid IBM Q20 Tokyo IBM Q16 Rüschlikon Rigetti 19Q Acorn Name n DQN Qiskit tket DQN Qiskit tket DQN Qiskit tket DQN Qiskit tket alu-v2_30 223 3.726 1.593 1.49 3.275 1.296 1.222 3.365 1.656 1.63 3.641 1.7 1.622 mod8-10_177 196 3.59 1.592 1.463 3.212 1.342 1.322 3.288 1.632 1.548 3.548 1.674 1.611 rd53_131 200 4.151 1.58 1.485 3.846 1.455 1.301 3.957 1.642 1.62 4.075 1.684 1.697 C17_204 205 4.11 1.59 1.479 3.815 1.378 1.334 3.869 1.652 1.593 4.139 1.687 1.632 alu-v2_31 198 3.178 1.562 1.463 2.772 1.284 1.289 2.883 1.594 1.499 3.05 1.654 1.594 4gt4-v0_73 179 3.749 1.583 1.487 3.344 1.33 1.335 3.385 1.596 1.593 3.694 1.664 1.641 ex3_229 175 3.681 1.579 1.502 3.259 1.327 1.283 3.388 1.656 1.563 3.645 1.718 1.698 cnt3-5_180 215 4.102 1.618 1.545 4.29 1.378 1.353 4.346 1.657 1.591 5.045 1.767 1.752 mod8-10_178 152 3.726 1.608 1.468 3.351 1.328 1.372 3.45 1.627 1.55 3.735 1.689 1.607 decod24-enable_126 149 3.586 1.6 1.501 3.19 1.395 1.358 3.283 1.635 1.505 3.513 1.704 1.639 ham7_104 149 4.221 1.579 1.494 3.922 1.319 1.251 3.849 1.635 1.562 4.27 1.637 1.634 one-two-three-v0_97 128 3.265 1.607 1.47 2.885 1.276 1.364 2.97 1.636 1.572 3.138 1.68 1.628 rd53_135 134 4.055 1.612 1.531 3.758 1.442 1.272 3.733 1.647 1.604 4.043 1.714 1.67 mini-alu_167 126 3.221 1.576 1.459 2.835 1.346 1.335 2.991 1.63 1.522 3.135 1.685 1.679 4gt4-v1_74 119 3.77 1.626 1.524 3.426 1.36 1.316 3.454 1.686 1.703 3.666 1.744 1.773 Another key point to note is that the DQN approach is very flexible—it has a wealth of hyper- parameters that can be optimised for each specific architecture. However, the approaches used in state-of-the-art compilers tend to have very few hyperparameters and are somewhat fixed. In par- ticular, Qiskit’s strategy of focussing on each layer in sequence is wasteful on low-density circuits, since not all qubits will be involved in a gate in a given layer, so swapping idle qubits is neces- sary to help schedule future gates. In fact, one can obtain a lower bound for the CDR that such a method could achieve on a given architecture by generating a random layer of gates (of a fixed target density) and initial placement, and computing half of the average furthest distance between any pair of qubits involved in a gate. This bounds the number of layers of SWAP gates required to schedule a given layer of logical gates. Doing so for a 7 × 7 grid, using the correct density for random circuits with 1,000 gates, for example, yields a CDR that is still higher than that of the DQN system. This means that even with an infinite number of trials, Qiskit will not be able to outperform the DQN system for this architecture and circuit size. Such a result demonstrates the inherent limitation of treating layers separately when routing. ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. Using Reinforcement Learning to Perform bit Qu Routing in antum Qu Compilers 10:19 Fig. 13. A sample visualisation of the internal state of the RL algorithm at a given timestep, on an IBM Q20 Tokyo. On each physical node, the top number represents the logical qubit inhabiting that node, and the bottom number represents its logical target. As shown here, logical qubits 8 and 9 are currently performing an interaction—red links (such as that between logical qubits 2 and 4) represent ongoing swaps. A target of -1 denotes a logical qubit that has finished all of its interactions. In this timestep, qubit 13 will move to another node, despite the fact that this swap will have no impact on its distance to qubit 11—such a swap is redundant, since the overall value of Q will be the same with or without it, but in the absence of a swap count penalty, the agent cannot distinguish between the two states and is free to choose either action. The DQN system struggles slightly on architectures with poor connectivity, notably the Rigetti 19Q Acorn. One possible explanation is the fact that the DQN system cannot distinguish between situations in which shortest paths cross and those that do not and thus cannot predict upcoming conflicts. On such architectures, there are very few shortest paths between any two nodes (versus, e.g., a grid), so choosing the path that minimises conflict is key. Another problem is when multiple qubits are all waiting to interact with the same one—the system’s state representation has no way of prioritising the movement of such qubits, despite the fact that their interactions are crucial for the progress of the routing process. This drawback is especially hard-hitting on architectures with poor connectivities, where its action choice is heavily constrained—the system might get stuck in local minima, since it cannot tell which qubit is the source of the bottleneck. It is unclear what the future will hold in terms of architecture connectivities. The above points could help motivate future work on the system, especially with respect to very large grid sizes. While the DQN system’s performance was still best-in-class on the grid sizes we tried (which are sufficiently large to indicate near-term performance), breaking ties between shortest paths and tackling the wider issue of qubit priority will likely be essential to unlock better performance on even larger grid architectures, which is especially relevant as we move towards architectures with ever increasing numbers of qubits in the future. 6.1 Applying SWAP Decomposition Throughout the Results section, we considered both CNOT and SWAP gates to take one timestep each. This was the simplest fair method of comparing routing procedures (with precedent in the literature [12]), since it is the most architecture-agnostic—for example, for some quantum tech- nologies, pulse-level optimisation can lead to SWAP gates executing in 1.5 timesteps (by using iSWAPgates)[15], and in future we can expect a wider variety of such optimisations to emerge. However, at the moment, SWAP gates must be performed via decomposition into CNOT gates, ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. 10:20 M. G. Pozzi et al. Fig. 14. Performance of each routing system, calculating CDR after SWAP decomposition into CNOT gates. For random circuits, we once again show Qiskit’s StochasticSwap; for realistic circuits, we once again show Qiskit’s SabreSwap. and they thus take three timesteps to execute (as illustrated in Figure 3(b)). While the DQN system has not been optimised with this in mind, we nonetheless sought to assess each routing system’s performance when performing such decomposition. Figure 14(a) shows each system’s performance on random circuits after performing SWAP de- composition, while Figure 14(b) shows performance on realistic circuits. We also tried enabling BRIDGE gates for t|ket, in both cases, and show these results separately. Encouragingly, the DQN system still outperforms its competitors on random circuits, even on such a large grid size—in fact, the general shape of the graph remains largely unchanged. However, the ordering of the systems is reversed for realistic circuits. In particular, t|ket is the best system when decomposing SWAPs into CNOTs, suggesting that its routing process might be optimised for this particular scenario. The relatively poor performance of DQN here (despite excellent performance when not perform- ing decomposition) could be explained by a variety of factors, perhaps most significantly by the fact that it makes no effort to minimise SWAP count, only depth (as demonstrated by Table 2). When considering SWAPs and CNOTs to take the same amount of time, performing a redundant SWAP carries no penalty in the RL formulation—however, when performing decomposition, an unnecessary depth cost may additionally be incurred from such redundancy, especially on sparse (e.g., realistic) circuits with low CDRs. Furthermore, the DQN system has no way of optimising its action choice specifically with decomposition in mind—awareness that a SWAP takes three timesteps would allow the system to choose SWAPs that allow for “pipelining,” that is, beginning a SWAP while another is ongoing, to minimise depth overhead. To reiterate, DQN’s poor performance relative to t|ket and Qiskit here is not surprising, since in this work, the DQN system was not optimised with SWAP decompositon in mind, nor was min- imising gate count prioritised. However, the RL formulation can certainly be modified to mitigate both of the above issues—much in the same way that this article proposes mixing SWAPs and CNOTs into the same timesteps, future work may well allow SWAP gates to be scheduled with an awareness of their decomposition, thus allowing them to occur “out of time,” enabling decom- positions that yield a lower depth overhead than otherwise possible. Equally, we could minimise gate count by incorporating an added reward (or rather, penalty) signal into the RL formulation to penalise the frivolous addition of gates. BRIDGE gates may also be added as potential actions for the annealer to choose from, and mod- els can be trained on realistic circuits to cope with common patterns that arise therein (rather than ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. Using Reinforcement Learning to Perform bit Qu Routing in antum Qu Compilers 10:21 training still on random circuits, as we do above). We firmly believe that with such improvements, a DQN system will still be able to remain competitive in the most realistic scenarios. Meanwhile, a more static method such as Qiskit’s presents no such opportunities for improvement, and its performance is therefore bounded. Overall, however, DQN’s best-in-class performance on random circuits, even when performing SWAP decomposition, demonstrates that the system is well ahead of its competitors when consid- ering routing as an isolated problem. We therefore believe that there are more gains to be had by adopting an RL approach in quantum compilers more generally, and certainly by implementing the improvements we outline above. 6.2 Dealing with Noise and Variability in bit Qu Errors Another thing worth considering in future work would be the fact that on realistic quantum ma- chines, different qubits/links have differing measurement/gate errors. It is therefore preferable to execute gates along links that have higher fidelity. Such details can be incorporated into the state formulation, for example by extending the state vector to include information about the fidelity of links along which gates are taking place. The reward signal can also be similarly extended to this effect. After such improvements, the DQN system would naturally learn to schedule SWAP gates with variable gate errors in mind, which would be beneficial when compiling quantum circuits for lower-fidelity quantum architectures. 6.3 Other Recommendations for Future Development The feature selection function is the component that we have found most impactful throughout the work—we thoroughly believe that with a more rich state representation, the DQN agent will be able to achieve even higher performance in complex scenarios. The main problem with the current representation is the fact that a lot of information is lost when converting the full state into a mere distance vector. The information about available swaps helps somewhat, but in practice this does little to help break ties when qubits are far away for their targets. A new representation might encode some information about shortest paths between mutually-targeting qubits and their potential for conflict, as well as information about which qubits should be prioritised. Ultimately, there is a balance here between the size of the representation and the information it is able to capture. It is also possible that a massive network with massive amounts of training could learn its own optimal representation, in the true spirit of “deep” learning—we would be curious to see whether this works. Furthermore, many RL methods employ some form of lookahead, in which a series of future episodes are simulated to choose the best action—this could help the DQN system to predict upcoming bottlenecks and react accordingly. Another possible improvement relates to automating the learning process. At the moment, the choice of the number of training episodes is somewhat arbitrary, but it would certainly be more useful (and reliable) to employ a deterministic scheme with some well-formed criteria, such as detecting when the weights of the neural network have converged. Furthermore, the system currently lacks the ability to use BRIDGE gates, or delay gates instead of scheduling them as soon as possible. Extending the RL paradigm to incorporate such charac- teristics is certainly achievable, and can simply be done by adding these as possible actions to the RL formulation. It is also worth noting that t|ket not only uses BRIDGE gates, but it also has optimisation passes to simplify chains of CNOT gates—while we disabled such functionality to perform a fair comparison, it would certainly improve t|ket’s performance when considering cir- cuit depth after SWAP decomposition, a necessary step for current architectures. Adding similar functionality to the DQN system would be a simple yet important task, to remain competitive as a full compilation process (i.e., beyond mere routing). ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. 10:22 M. G. Pozzi et al. One thing we have not considered in this article is the impact of initial placement—instead, we have chosen to use random initial placements throughout. The motivation for this choice is that random initial placements allow the routing algorithms to be evaluated independently from any other optimisations, as well as the fact that random placements effectively simulate a potential state of the system mid-way through execution of a larger quantum circuit. Besides, it is not unrea- sonable to assert that a separate optimised placement routine will benefit each routing algorithm equally, and therefore will not have any effect on their comparison. However, this is something that future research should certainly look at—in fact, there could be ways of applying RL to the task of finding optimal initial placements. For example, one could do so by picking a random placement, constraining the current DQN system to apply only SWAP gates initially until the highest-quality state possible is achieved, and use the final placement as the initial placement of the regular DQN routing procedure. 6.4 Another Word on Runtime At the moment, the system is not very well optimised in terms of runtime—we have always pre- ferred to run the system longer, or use a more exhaustive method, to minimise CDO and CDR. Runtime optimisation would require significant future work, which is why we have shied away from directly comparing the runtime of our system to that of the other baselines in this article. That said, it is important to give at least some indication of the timescales involved. For the re- alistic test set on a 4 × 4 grid, disregarding training time, the DQN system took about 2,400 s to complete one run (of 100 circuits), while Qiskit (StochasticSwap) took about 36 s and t|ket took about 6 s. The DQN system is clearly much slower, but for perspective, Qiskit’s LookaheadSwap routing method [6] (which came second in the Qiskit Developer Challenge) is almost 4 times as slow as the DQN system on the 4 × 4 realistic circuits benchmark, despite only having as good a CDR as Qiskit’s faster StochasticSwap method. Equally, it is worth noting that compiling Tensor- flow to take advantage of SIMD extensions (such as AVX or FMA) and GPUs could help improve the runtime of our method, without touching the code itself. Besides, the runtime of the DQN system can certainly be reduced while still maintaining good performance. One such area for improvement is the annealer—an adaptive scheme with a vari- able number of iterations would help greatly. In fact, other combinatorial optimisation techniques could be used, such as random restart hill-climbing [34]. Another area for optimisation is clearly the size of the neural network used. In practice we found such changes to make little difference, but it is perfectly possible that the layer structure used at the moment is wasteful—a more principled search would be necessary for each quantum architecture. Once again, this would be a worthy time investment, since new quantum architectures are developed infrequently—the time required to de- velop a new architecture is clearly far greater than the time required for such a search. Equally, returning to a single state approach (with some scalability improvements) may prove effective while greatly diminishing training times due to the lack of annealing when replaying past expe- riences. Better parallelisation would also be useful—the literature on this topic includes several methods that could be of use in an RL context [11, 26, 30]. 7 CONCLUSION In this article, we have presented a RL approach to address the problem of routing qubits on near- term quantum architectures. We proposed a modified deep Q-learning formulation, in which ac- tions are sets of parallelisable swaps/gates—the agent uses simulated annealing to select actions from this combinatorial space. We then benchmarked our DQN system against the qubit routing passes of state of the art quantum compilers. ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. Using Reinforcement Learning to Perform bit Qu Routing in antum Qu Compilers 10:23 The key research question throughout has been: Can a DQN approach be used to perform qubit routing in quantum compilers, and, if so, is it able to compete with state-of-the-art approaches? We would say the answer is emphatically, yes. The results demonstrate that a DQN approach is able to surpass the performance of other industry-standard approaches in realistic near-term scenarios, with a level of adaptability that is not possible with other more static approaches, which will be particularly useful as a wider variety of quantum architectures appear in future. Further work is required to maintain best-in-class performance when performing SWAP decomposition, but we are confident that this will be achievable with some modest improvements to the system. Overall, our work demonstrates the value of using an RL approach in the compilation of quantum circuits, and we hope that such an approach can bring further benefits in the space in future. ACKNOWLEDGMENTS The idea of using reinforcement learning (specifically Q-learning) for qubit routing was first pro- posed by a subset of the present authors (namely Herbert and Sengupta) in an arXiv preprint [18], which has not been published elsewhere (i.e., in a journal or the proceedings of a conference). Spe- cial thanks to Silas Dilkes from Cambridge Quantum Computing (CQC) for providing guidance on how to set up t|ket for our benchmarks. REFERENCES [1] 2018. IBM Q Devices and Simulators. Retrieved from https://web.archive.org/web/20181203023515/https://www. research.ibm.com/ibm-q/technology/devices/. [2] 2018. Quantum Circuit Test Set (Zulehner). Retrieved May 2020 from https://iic.jku.at/eda/research/ibm_qx_ mapping/. [3] 2020. Cirq Documentation (accessed for 0.8.0). Retrieved May 2020 from https://cirq.readthedocs.io/en/stable/. [4] 2020. CQC - Our Technology (accessed for pytket 0.5.4). Retrieved May 2020 from https://cambridgequantum.com/ technology/. [5] 2020. IBM Qiskit (accessed for 0.20.0). Retrieved May 2020 from https://qiskit.org. [6] 2020. Jandura’s routing method (LookaheadSwap documentation). Retrieved from https://qiskit.org/documentation/ stubs/qiskit.transpiler.passes.LookaheadSwap.html#qiskit.transpiler.passes.LookaheadSwap. [7] C.G.Almudever,L.Lao,X.Fu, N. Khammassi,I.Ashraf, D. Iorga, S. Varsamopoulos, C. Eichler,A.Wallra,L ff .Geck, A. Kruth, J. Knoch, H. Bluhm, and K. Bertels. 2017. The engineering challenges in quantum computing. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE’17). IEEE, 836–845. https://doi.org/10.23919/ DATE.2017.7927104 [8] Frank Arute, Kunal Arya, Ryan Babbush, Dave Bacon, Joseph C. Bardin, Rami Barends, Rupak Biswas, Sergio Boixo, Fernando G. S. L. Brandao, David A. Buell, Brian Burkett, Yu Chen, Zijun Chen, Ben Chiaro, Roberto Collins, William Courtney, Andrew Dunsworth, Edward Farhi, Brooks Foxen, Austin Fowler, Craig Gidney, Marissa Giustina, Rob Graff, Keith Guerin, Steve Habegger, Matthew P. Harrigan, Michael J. Hartmann, Alan Ho, Markus Hoffmann, Trent Huang, Travis S. Humble, Sergei V. Isakov, Evan Jeffrey, Zhang Jiang, Dvir Kafri, Kostyantyn Kechedzhi, Julian Kelly, Paul V. Klimov, Sergey Knysh, Alexander Korotkov, Fedor Kostritsa, David Landhuis, Mike Lindmark, Erik Lucero, Dmitry Lyakh, Salvatore Mandrà, Jarrod R. McClean, Matthew McEwen, Anthony Megrant, Xiao Mi, Kristel Michielsen, Masoud Mohseni, Josh Mutus, Ofer Naaman, Matthew Neeley, Charles Neill, Murphy Yuezhen Niu, Eric Ostby, Andre Petukhov, John C. Platt, Chris Quintana, Eleanor G. Rieffel, Pedram Roushan, Nicholas C. Rubin, Daniel Sank, Kevin J. Satzinger, Vadim Smelyanskiy, Kevin J. Sung, Matthew D. Trevithick, Amit Vainsencher, Benjamin Villalonga, Theodore White, Z. Jamie Yao, Ping Yeh, Adam Zalcman, Hartmut Neven, and John M. Martinis. 2019. Quantum supremacy using a programmable superconducting processor. Nature 574, 7779 (October 2019), 505–510. https://doi.org/10.1038/s41586-019-1666-5 [9] Andrew M. Childs, Eddie Schoute, and Cem M. Unsal. 2019. Circuit transformations for quantum architectures. In Pro- ceedings of the 14th Conference on the Theory of Quantum Computation, Communication and Cryptography (TQC’19), Leibniz International Proceedings in Informatics, Wim van Dam and Laura Mancinska (Eds.), Vol. 135. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany, 3:1–3:24. https://doi.org/10.4230/LIPIcs.TQC.2019.3 [10] Jerry M. Chow and Jay Gambetta. 2020. Quantum takes flight: Moving from laboratory demonstrations to building systems. Retrieved May 2020 from https://www.ibm.com/blogs/research/2020/01/quantum-volume-32/. ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. 10:24 M. G. Pozzi et al. [11] Alfredo V. Clemente, Humberto N. Castejón, and Arjun Chandra. 2017. Efficient parallel methods for deep reinforce- ment learning. arXiv:1705.04862. Retrieved from http://arxiv.org/abs/1705.04862. [12] Alexander Cowtan, Silas Dilkes, Ross Duncan, Alexandre Krajenbrink, Will Simmons, and Seyon Sivarajah. 2019. On the qubit routing problem. In Leibniz International Proceedings in Informatics. Vol. 135. 5:1–5:32. https://drops. dagstuhl.de/opus/volltexte/2019/10397/. [13] Andrew W. Cross, Lev S. Bishop, John A. Smolin, and Jay M. Gambetta. 2017. Open quantum assembly language. arXiv:1707.03429. Retrieved from http://arxiv.org/abs/1707.03429. [14] D. P. Franke, J. S. Clarke, L. M. K. Vandersypen, and M. Veldhorst. 2019. Rent’s rule and extensibility in quantum computing. Microprocess. Microsyst. 67 (June 2019), 1–7. https://doi.org/10.1016/j.micpro.2019.02.006 [15] Pranav Gokhale, Ali Javadi-Abhari, Nathan Earnest, Yunong Shi, and Frederic T. Chong. 2020. Optimized quantum compilation for near-term algorithms with openpulse. arXiv:2004.11205. Retrieved from http://arxiv.org/abs/2004. [16] Laszlo Gyongyosi and Sandor Imre. 2019. A survey on quantum computing technology. Comput. Sci. Rev. 31 (February 2019), 51–71. https://doi.org/10.1016/j.cosrev.2018.11.002 [17] Steven Herbert. 2020. On the depth overhead incurred when running quantum algorithms on near-term quantum computers with limited qubit connectivity. Quant. Inf. Computat. 20, 9 & 10 (August 2020), 787–806. https://doi.org/ 10.26421/QIC20.9-10-5 [18] Steven Herbert and Akash Sengupta. 2018. Using reinforcement learning to find efficient qubit routing policies for deployment in near-term quantum computers. arXiv:1812.11619. Retrieved from http://arxiv.org/abs/1812.11619. [19] Toshinari Itoko, Rudy Raymond, Takashi Imamichi, and Atsushi Matsuo. 2020. Optimization of quantum circuit mapping using gate transformation and commutation. Integration 70 (2020), 43–50. https://doi.org/10.1016/j.vlsi.2019. 10.004 [20] Alice Karnsund. 2019. DQN Tackling the Game of Candy Crush Friends Saga: A Reinforcement Learning Approach. Retrieved from http://www.diva-portal.org/smash/record.jsf?pid=diva2%3A1368129&dswid=-15%8. [21] Sumeet Khatri, Ryan LaRose, Alexander Poremba, Lukasz Cincio, Andrew T. Sornborger, and Patrick J. Coles. 2019. Quantum-assisted quantum compiling. Quantum 3 (May 2019), 140. https://doi.org/10.22331/q-2019-05-13-140 [22] Jens Kober, J. Andrew Bagnell, and Jan Peters. 2013. Reinforcement learning in robotics: A survey. Int. J. Robot. Res. 32, 11 (September 2013), 1238–1274. https://doi.org/10.1177/0278364913495721 [23] B. S. Landman and R. L. Russo. 1971. On a pin versus block relationship for partitions of logic graphs. IEEE Trans. Comput. C-20, 12 (December 1971), 1469–1479. https://doi.org/10.1109/T-C.1971.223159 [24] Gushu Li, Yufei Ding, and Yuan Xie. 2019. Tackling the qubit mapping problem for NISQ-era quantum devices. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’19), Iris Bahar, Maurice Herlihy, Emmett Witchel, and Alvin R. Lebeck (Eds.). ACM, 1001–1014. https://doi.org/10.1145/3297858.3304023 [25] Margaret Martonosi and Martin Roetteler. 2019. Next steps in quantum computing: Computer science’s role. arXiv:1903.10541. Retrieved from http://arxiv.org/abs/1903.10541. [26] Volodymyr Mnih, Adria Puigdomenech Badia, Lehdi Mirza, Alex Graves, Tim Harley, Timothy P. Lillicrap, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML’16), Vol. 48. 1928–1937. https://proceedings.mlr.press/v48/ mniha16.html. [27] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning. arXiv:1312.5602. Retrieved from http://arxiv.org/ abs/1312.5602. [28] Matthias Möller and Cornelis Vuik. 2017. On the impact of quantum computing technology on future developments in high-performance scientific computing. Ethics Inf. Technol. 19, 4 (December 2017), 253–269. https://doi.org/10. 1007/s10676-017-9438-0 [29] Prakash Murali, Jonathan M. Baker, Ali Javadi-Abhari, Frederic T. Chong, and Margaret Martonosi. 2019. Noise- adaptive compiler mappings for noisy intermediate-scale quantum computers. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’19). Association for Computing Machinery, New York, NY, 1015–1029. https://doi.org/10.1145/3297858.3304075 [30] Arun Nair, Praveen Srinivasan, Sam Blackwell, Cagdas Alcicek, Rory Fearon, Alessandro De Maria, Vedavyas Panneershelvam, Mustafa Suleyman, Charles Beattie, Stig Petersen, Shane Legg, Volodymyr Mnih, Koray Kavukcuoglu, and David Silver. 2015. Massively parallel methods for deep reinforcement learning. arXiv:1507.04296. Retrieved from http://arxiv.org/abs/1507.04296. [31] Matteo Pozzi. 2020. Qubit Routing with Reinforcement Learning (GitHub Repository). Retrieved November 2020 from https://github.com/Macro206/qubit-routing-with-rl. ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022. Using Reinforcement Learning to Perform bit Qu Routing in antum Qu Compilers 10:25 [32] John Preskill. 2012. Quantum computing and the entanglement frontier. arXiv:1203.5813. Retrieved from http://arxiv. org/abs/1203.5813. [33] John Preskill. 2018. Quantum computing in the NISQ era and beyond. Quantum 2 (2018), 79. https://arxiv.org/abs/ 1801.00862 [34] Stuart Russell and Peter Norvig. 2010. Artificial Intelligence A Modern Approach, Third Edition . https://doi.org/10.1017/ S0269888900007724 [35] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. 2016. Prioritized experience replay. In Proceedings of the 4th International Conference on Learning Representations (ICLR’16), Conference Track Proceedings. https://arxiv. org/abs/1511.05952. [36] Eddie Schoute. 2019. Circuit Transformations for Quantum Architectures—Compiler Code (GitLab Repository). Re- trieved May 2020 from https://gitlab.umiacs.umd.edu/amchilds/arct/-/blob/master/arct/compiler.py. [37] Richard S. Sutton and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction (2nd ed.). The MIT Press. [38] Bochen Tan and Jason Cong. 2021. Optimality study of existing quantum computing layout synthesis tools. IEEE Trans. Comput. 70, 9 (2021), 1363–1373. https://doi.org/10.1109/TC.2020.3009140 [39] Hado Van Hasselt, Arthur Guez, and David Silver. 2016. Deep reinforcement learning with double Q-Learning. In Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI’16) . 2094–2100. [40] Robert Wille, Lukas Burgholzer, and Alwin Zulehner. 2019. Mapping quantum circuits to IBM QX architectures using the minimal number of SWAP and h operations. In Proceedings of the 56th Annual Design Automation Conference (DAC’19). Association for Computing Machinery, New York, NY, Article 142, 6 pages. https://doi.org/10.1145/3316781. [41] Alwin Zulehner. 2018. Quantum Information Software Kit (QISKit)—Compiler Code (GitHub Repository, fork). Re- trieved May 2020 from https://github.com/azulehner/qiskit-sdk-py/blob/mapping/qiskit/mapper/%_mapping.py. [42] Alwin Zulehner, Alexandru Paler, and Robert Wille. 2019. An efficient methodology for mapping quantum circuits to the IBM QX architectures. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 38, 7 (2019), 1226–1236. https://doi.org/ 10.1109/TCAD.2018.2846658 [43] Alwin Zulehner and Robert Wille. 2019. Compiling SU(4) quantum circuits to IBM QX architectures. In Proceedings of the 24th Asia and South Pacific Design Automation Conference (ASPDAC’19) . Association for Computing Machinery, New York, NY, 185–190. https://doi.org/10.1145/3287624.3287704 Received November 2020; revised February 2022; accepted February 2022 ACM Transactions on Quantum Computing, Vol. 3, No. 2, Article 10. Publication date: May 2022.

Journal

ACM Transactions on Quantum ComputingAssociation for Computing Machinery

Published: May 16, 2022

Keywords: qubit routing

There are no references for this article.