1.4 MB

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Fusion of Array Operations at Runtime Mads R. B. Kristensen, Simon A. F. Lund, Troels Blum, and James Avery Niels Bohr Institute, University of Copenhagen, Denmark madsbk/saﬂ/blum/avery @nbi.ku.dk { } Abstract—We address the problem of fusing array operations based on criteria such as shape compatibility, data reusability, #deﬁne N1000 #deﬁne N1000 and communication. We formulate the problem as a graph double A[N], B[N], T[N]; double A[N], B[N], T[N]; partition problem that is general enough to handle loop fusion, int j=N; for(int i=0;i<N;++i) combinator fusion, and other types of subroutines. for(int i=0;i<N;++i) 6 foTr([iin]t=iB=[0i;]i*<AN[;i+]+;i) foTr([iin]t=iB=[0i;]i*<AN[;i+]+;i) A[i] +=T[i]; 1 I. INTRODUCTION A[i] +=T[--j]; 0 (a) Two forward iterating 2 loops. (b) A forward and a re- Array operation fusion is a program transformation that verseiteratingloop. n combines, or fuses, multiple array operations into a kernel of a operations.Whenitisapplicable,thetechniquecandrastically for(int i=0;i<N;++i){ for(int i=0;i<N;++i){ J improve cache utilization through temporal data locality and T[i] =B[i]* A[i]; double t=B[i]*A[i]; 1 A[i] +=T[i]; A[i] +=t; 2 enables other program transformations such as streaming and } } array contraction [5]. In scalar programming languages, such (c) Loop fusion: the two (d) Array contraction: the C] as C, array operation fusion typically corresponds to loop loops from Fig. 1a fused temporaryarrayTfromFig. fusion where multiple computation loops are combined into a intoone. 1c is contracted into the D scalart. single loop. The effect is a reduction of array traversals (Fig. s. 1).Similarly,infunctionalprogramminglanguagesittypically Fig.1. LoopfusionandarraycontractioninC. c corresponds to fusing individual combinators. In array pro- [ directed edges and minimizes the cost of the partition.”1. We gramming languages, such as HPF [11] and ZPL [2], fusing call this problem the Weighted Subroutine Partition problem, 2 array operations are crucial, since a program written in these abbreviated WSP. v languages will consist almost exclusively of array operations. 0 The general formulation is applicable to a broad range of Lewisetal.demonstratesaexecutiontimeimprovementofup 0 optimization objectives. The cost function can penalize any to 400% when optimizing for array contraction at the array 4 aspect of the partitions, e.g. data accesses, memory use, com- 5 rather than the loop level [10]. munication bandwidth, and/or communication latency. The 0 However,notallfusionsofoperationsareallowed.Consider only requirement to the cost function is monotonicity: . 1 the two loops in Fig. 1b; since the second loop traverses 0 the result from the ﬁrst loop in reverse, we must compute • Everything else equal, it must be cost neutral or a cost advantage to place two subroutines within the same 6 the complete result of the ﬁrst loop before continuing to the 1 partition block. second loop, preventing fusion. Clever analysis sometimes : v allows transforming the program into a form that is amenable Similarly, the deﬁnition of partition legality is ﬂexible. Xi to fusion, but such analysis is outside the scope of the present • Anyaspectofapairofsubroutinescanmakethemillegal work.Throughouttheremainderofthispaper,weassumethat to have in the same partition block, such as preventing r a any such optimizations have already been performed. mixing of sequential and parallel loops, different array Deciding which operations to fuse together is the same as shapes, or access patterns. ﬁndingapartition ofthesetofoperations inwhichtheblocks • Subroutinesmayhavedependenciesthatimposeapartial obey the same execution dependency order as the individual order. Then a legal partition must observe this order, operations,andinwhichnoblockcontainstwooperationsthat i.e. must not introduce cycles. may not be fused. Out of all such partitions, we want to ﬁnd Theremainderofthepaperisstructuredasfollows:InSec- one that enables us to save the most computation or memory. tion 3, we formally deﬁne the Weighted Subroutine Partition It is not an easy problem, in part because fusibility is not problem,whichuniﬁesarrayoperation-,loop-,andcombinator transitive. That is, even when it is legal to fuse subroutines fusion, and prove that it is NP-hard. Section 4 shows how x,y and y,z, it may be illegal for all three of x,y,z to be WSP is used to solve array operation fusion for the Bohrium executed together. Thus, one local decision can have global automatic parallelization framework, and gives a correctness consequences on future possible partitions. proof.InSection5,wedescribeabranch-and-boundalgorithm Theproblemcanbestatedinaquitegeneralway:“Givena mixedgraph,ﬁndalegalpartitionofverticesthatcutsallnon- 1SeeSec.IIIforthedeﬁnitionofalegalpartitionandlegalcostfunction. that computes an optimal solution, as well as two approxima- Consider the WLF example in Fig. 2, which consist of tion algorithms that compute good results rapidly enough to six loops and three arrays A,B,C of size 1. The objective use in JIT-compilation. All the algorithms are implemented in is to maximize data locality, represented by weight edges Bohrium,andworkforanychoiceofmonotoniccostfunction, connectingtheloopsthataccessthesamearrays.Fig.2bshows whichallowsustocomparedirectlywithotherfusionschemes the optimal WLF solution to the example, which reduces the from the literature. Section 6 shows measurements performed totalweightfrom13to3.However,theactualnumberofarray on15benchmarkprograms,comparingboththeoptimaltothe accesses is only reduced from 10 to 7. A better strategy is to approximation schemes, and to three other fusion schemes. fuse loop 1-2 (Fig. 2c), which will reduce the actual number of array accesses from 10 to 4. II. RELATEDWORK The problem with the WLF formulation here is that all the The WSP problem presented in this paper generalizes loops that read the same data must be pair-wise connected the Weighted Loop Fusion (WLF) problem ﬁrst described with a weight, leading to over-estimating potential data reuse. by Kennedy in [6] (by the name Fusion for Reuse). The In the Weighted Subroutine Partition formulation, we work method aims to maximize data locality through loop fusion with partitions instead of individualmerges, and assign a cost (corresponding to the Max Locality cost model in Section to a partition as a whole. The cost-savings of a merge is then VI-A). The WLF problem is described as a graph problem the difference in cost between the partitions before and after where vertices represent computation loops, directed edges merging, allowing accurate descriptions of data-reuse through represent data dependencies between loops, and undirected the costs function. edgesrepresentdatasharingbetweenloops.Edgesthatconnect fusible loops have a non-negative weight that represents the III. THEWEIGHTEDSUBROUTINEPARTITIONPROBLEM cost saving associated with fusion of the two loops. Edges The Weighted Subroutine Partition (WSP) problem is an that connect non-fusible loops are marked as fuse-preventing. extensionoftheTheWeightedLoopFusionProblem[6]where Now, the objective is to ﬁnd a partition of the vertices into weincludetheweightfunctionintheproblemformulation.In blocks such that no block has vertices connected with fuse- this section, we will formally deﬁne the WSP problem and preventing edges and that minimize the weight sum of edges show that it is NP-hard. that connects vertices in different blocks. Megiddo et al. have shown that it is possible to formulate Deﬁnition 1 (WSP graph). A WSP graph is a triplet G = the WLF problem as integer linear programming (ILP) [13]. (V,E ,E ) such that (V,E ) is a directed acyclic graph d f d Based on the WLF graph, the idea is to transform the edges describing dependency order, and (V,E ) is an undirected f into linear constraints that implement the dependency and graph of forbidden edges. fusibility between the vertices and transform weights into ILP objective variables. The values of the objective variables are Deﬁnition 2 (WSP order). A WSP graph, G = (V,Ed,Ef), either the values of the weights when the associated vertices induces a partial order <d on V as follows: v <d v(cid:48) iff there are in different partitions or zero when in the same partition. exists a path from v to v(cid:48) in (V,Ed). Since (V,Ed) is acyclic, The objective of the ILP is then to minimize the value of the this partial order is strict. objective variables . The problem is NP-hard, but the hope is Deﬁnition 3 (Partitions). A partition of a set V is a set that with an efﬁcient ILP solver, such as lp-solve [1], and P = B ,B ,...,B such that V is the disjoint union of 1 2 k a modest problem size it might be practical as a compile time { } the blocks B ,...,B . The set Π of all partitions of V is 1 k V optimization. Darte et al. [4] proved that the WLF problem partially ordered as P P(cid:48) iff B P B(cid:48) P(cid:48): B B(cid:48), is NP-hard through a reduction from multiway cut [3]. Fur- i.e. if each block in P i≤s a subse∀t of∈a bl∃ock i∈n P(cid:48). ⊆ thermore, since maximizing data locality may not maximize the number of array contractions(Fig. 21), they introduce an The set of partitions ΠV is a lattice with bottom and top ILP formulation with the sole objective of maximizing the elements = V and = V . The successors to ⊥ { } (cid:62) {{ }} number of array contractions (the Max Contract cost model a partition P in the partition order are those partitions that in Section VI-A). are identical to P except for merging two of the blocks. Robinson et al. [14] describe an ILP scheme that combines Conversely,splittingablockresultsinapredecessor.Wewrite the objectives of Megiddo and Darte: both maximizing data P P(cid:48) ifP(cid:48) isasuccessortoP.Thisdeﬁnesabinarymerge ≺ locality and array contractions while giving priority to data operator: locality (corresponds to our Robinson cost model). Deﬁnition 4 (Block merge operator). Given a partition P = However, optimization using WLF has a signiﬁcant limita- B ,B ,... , deﬁne P/(B ,B )= B B ,... to be the tion:itonlyallowsstaticedgeweights.Thatis,whenbuilding { 1 2 } 1 2 { 1∪ 2 } successor to P in which B and B are merged and all other the WLF graph the values of edge weights are assigned once 1 2 blocks are left the same. andforall.Thislimitationisthemainreasonthatweneededto developtheWeightedSubroutinePartitionformalism,because Deﬁnition 5 (Legal partition). Given a WSP graph, G = static edge weights are in fact inappropriate for accurate (V,E ,E ), we say that the partition P Π is legal when d f V ∈ measurement of data locality. the following holds for every block B P: ∈ (a) (b) (c) Fig.2. AWLFexamplewheretheobjectiveistomaximizedatalocality.(a)showstheinitialgraph.(b)showsapartitionwhereloop1isinoneblockand loops2-6areinanotherblock.(c)showsapartitionwhereloops1-2areinoneblockandloops3-6areinanotherblock. 1) (cid:64)v ,v B :(v ,v ) E ,i.e. noblockcontainsboth solutions to the MWC problem are the multiway cuts of 1 2 1 2 f ∈ ∈ endpoints of a forbidden edge. minimal total weight. 2) If v <d v <d v and v ,v B then v B, i.e. the 1 2 3 1 3 ∈ 2 ∈ Theorem 1. The WSP problem is NP-hard for graphs G = directed edges between blocks must not form cycles. (V,E ,E ) that have a chain of three or more edges in E . d f f Deﬁnition 6 (WSP cost). Given a partition, P, of vertices in Proof. We prove NP-hardness through a reduction from mul- a WSP graph, a cost function cost(P) returns the cost of the tiway cut. Given an MWC-instance, (V,E,S,w), we build partition and respects the following conditions: a WSP-instance as follows. Let G = (V,E ,E ), E = d f f 1) cost(P) 0 (s ,s ) : 1 i < j k , and E = . Deﬁne the cut 2) P P(cid:48) ≥= cost(P) cost(P(cid:48)) {of ai pajrtition ≤as the set o≤f ed}ges that cdonne∅ct the blocks: ≤ ⇒ ≥ Deﬁnition 7 (WSP problem). Given a WSP graph, G = cut(P)= (u,v) E (cid:64)B P :(u,v) B f { ∈ | ∈ ∈ } (V,E ,E ),andacostfunction,cost(P),theWSPproblemis d f The cuts of the legal WSP partitions Πˆ are exactly the set theproblemofﬁndingalegalpartition,P∗,ofV withminimal V of multiway cuts: cost: P∗ argmincost(P) (1) • The set of directed edges in Ed is empty, which makes ∈ Def. 2 and Def. 5(2) trivially satisﬁed. P∈ΠˆV • The fuse-preventing edges Ef connect each terminal in where ΠˆV denotes the set of legal partitions of V. S and no other vertices. Hence, by Def. 5(1), ΠˆV are exactly those partitions for which no block contains two A. Complexity terminals. Let now the cost function be the total weight of the cut: In order to prove that the WSP problem is NP-hard, we (cid:88) performareductionfromtheMultiwayCutProblem[3],which cost(P)= w(u,v) Dahlhaus et al. has shown is NP-hard for ﬁxed k 3. (u,v)∈cut(P) ≥ Deﬁnition 8 (Multiway Cut). Given a tuple (V,E,S,w) This is a valid WSP cost function (by Def. 6): it is non- consisting of a graph (V,E), a terminal set S = s ,...,s negative, and if P P(cid:48) in the partition order, then cut(P) of vertices, and a non-negative weight w(u,v) fo{r 1each edkg}e cut(P(cid:48)), whereby c≤ost(P) cost(P(cid:48)). Since cost(P) is th⊇e ≥ (u,v) E, a multiway cut is an edge set E(cid:48) the removal MWCtotalweight,Eq.(1)givesthemultiwaycutsofminimal of whic∈h leaves each terminal in separate components. The total weight, concluding the proof. stride, and dimensionality [9]. In the following, when we refer 1 COPYA,0 toanarray,wemeananarrayview;whenwerefertoidentical 1 import bohriumasbh 2 2 COPYB,0 arrays, we mean identical array views that points to the same 3 COPYD,0 3 def synthetic(): basearray;andwhenwerefertooverlappingarrays,wemean 4 COPYE,0 4 A=bh.zeros(4) 5 ADDA,A,D[:-1] array views that points to some of the same elements in a 5 B=bh.zeros(4) 6 D=bh.zeros(5) 6 COPYA,D[:-1] common base array. 7 ADDB,B,E[:-1] 7 E=bh.zeros(5) 8 A+=D[:-1] 8 COPYB,E[:-1] Fig. 3a shows a Python application that uses Bohrium as 9 MULT,A,B 9 A[:] =D[:-1] a drop-in replacement for NumPy. The application allocates 10 MAXD[1:],T,E[1:] 10 B+=E[:-1] 11 MINE[1:],T, D[1:] and initiates four arrays (line 4-7), manipulates those arrays 11 B[:] =E[:-1] 12 T=A*B 1123 DDEELLAB through array operations (line 8-14), and prints the content of 13 bh.maximum(T,E[1:],out=D[1:]) one of the arrays (line 16). 14 DELE 14 bh.minimum(T,D[1:],out=E[1:]) 15 DELT 15 return D As Bohrium is language agnostic, it translates the Python 16 SYNCD 16 print synthetic() 17 DELD array operations into bytecode (Fig. 3b) that the Bohrium (a) backendcan execute2.In thecaseof Python,thePython array (b) operationsandtheBohriumarraybytecodeisalmostinone-to- Fig. 3. A Python application that utilizes the Bohrium runtime system. one mapping. The ﬁrst bytecode operand is the output array In order to demonstrate various challenges and trade-offs, the application is synthetic. (a) shows the Python code and (b) shows the corresponding and the remaining operands are either input arrays or input Bohriumarraybytecode. literals.Sincethereisnoscopeinthebytecode,Bohriumuses DEL to destroy arrays and SYNC to move array data into the IV. WSPUSEDTOOPTIMIZEARRAYOPERATIONFUSION addressspaceofthefrontendlanguage–inthiscasetriggered INBOHRIUM by the Python print statement (Fig. 3a, line 16). There is Stating the WSP problem formulation in a general way noexplicitbytecodeforconstructingarrays;onﬁrstencounter, allows a great deal of ﬂexibility, as long as the cost function Bohrium constructs them implicitly. is monotonic. In this section, we use WSP to solve a concrete In the next phase, Bohrium partitions the list of array optimization problem, demonstrating its real world use. The operations into blocks that consists of fusible array operations concreteproblemisanoptimizationphasewithintheBohrium – the FAO problem. As long as the preceding constraints runtime system [9] in which a set of array operations are betweenthearrayoperationsarepreserved,Bohriumisfreeto partitioned into computation kernels – the Fusion of Array reorder them as it sees ﬁt, making code optimizations based Operations (FAO) problem: on data locality, array contraction, and streaming possible. Deﬁnition 9. Given a set of array operations, A, equipped In the ﬁnal phase, the hardware speciﬁc backend imple- with a strict partial order imposed by the data dependencies mentation JIT-compiles each block of array operations and between them, (A,<d), ﬁnd a partition, P, of A for which: executes them. 1) All operations within a block in P are fusible (Def. 11) 2) For all blocks, B P, if a <d a <d a and a ,a B 1) Fusibility: In order to utilize data-parallelism, Bohrium 1 2 3 1 3 ∈ ∈ thena B.(I.e.thepartitionobeysdependencyorder). and most other array programming languages and libraries 2 ∈ 3) The cost of the partition (Def. 13) is minimal. require data-parallelism of array operations that are to be ex- ecuted together. The property ensures that the runtime system In the following, we will provide a brief description of can calculate each output element independently without any Bohrium and show that the WSP problem solves the FAO communicationbetweenthreadsorprocessors.InBohrium,all problem (Theorem 2). array operation must have this property. A. Fusion of Array Operations in Bohrium We ﬁrst introduce some operations that keep track of Bohrium is a computation backend for array programming memory allocation, deallocation, reads, and writes: languages and libraries that supports a range of languages, such as Python, C++, and .NET, and a range of computer ar- Deﬁnition 10. Given an array operation f, the notation in[f] chitectures,suchasCPU,GPU,andclustersofthese.Theidea denotes the set of arrays that f reads; out[f] denotes the set is to decouple the domain speciﬁc frontend implementation of arrays that f writes; new[f] denotes the set of new arrays fromthehardwarespeciﬁcbackendimplementationinorderto that f allocates; and del[f] denotes the set of arrays that f provide a high-productivity and high-performance framework. deletes (or de-allocates). SimilartoNumPy[15],aBohriumarrayoperationoperates Furthermore, given a set of array operations, B, we deﬁne on a set of inputs and produces a set of outputs [9]. Both input and output operands are views of arrays. An array view is a structured way to observe the whole or parts of an underlying base array. A base array is always a contiguous 2ForadetaileddescriptionofthisPython-to-bytecodetranslationwerefer one-dimensional array whereas views can have any shape, topreviouswork[8],[7]. the following: single temporary register variable per parallel com- puting thread. Consider the program transformation out[B] out[f] ≡f∪∈B from Fig. 1a to 1d, in which, beside loop fusion, the temporaryarrayT isreplacedbythescalarvariablet. in[B] in[f] ≡f∪∈B In this case, the transformation reduces the accessed new[B] new[f] elements with 3N and memory requirement by N ≡f∪∈B elements. del[B] del[f] Inordertoutilizetheseoptimizationtechniques,weintroduce ≡f∪∈B a WSP cost function that penalizes memory accesses from ext[B] (in[B] new[B]) (out[B] del[B]) ≡ \ (cid:116) \ different partition blocks. For simplicity, we will not differen- Here, ext[B] gives the set of external data accesses. “ ” tiate between reads and writes, and we will not count access (cid:116) is disjoint union: arrays that are both read and written are to literals or register variables. countedtwice.DELandSYNCarecountedashavingnoinput Deﬁnition 13. In bohrium, the cost of a partition, P = or output. B ,B ,...,B , of array operations is given by: 1 2 k { } This allows us to formulate the data-parallelism property (cid:88) cost(P)= ext[B] (3) that determines when array operation fusion is allowed: (cid:107) (cid:107) B∈P Deﬁnition11. ABohriumarrayoperation,f,isdataparallel, where the length x is the total number of bytes accessed by i.e., each output element can be calculated independently, (cid:107) (cid:107) the set of arrays in x. when the following holds: The Bohrium cost-savings when merging two partition i in[f], o,o(cid:48) out[f]: blocks depends only on the blocks: ∀ ∈ ∀ ∈ (i o= i=o) (o o(cid:48) = o=o(cid:48)) (2) Proposition 1 (Merge-savings). Let P be a partition and ∩ ∅∨ ∧ ∩ ∅∨ P(cid:48) = P/(B ,B ) be its successor derived by merging B Inotherwords,ifaninputandanoutputortwooutputarrays 1 2 1 and B . Using the cost function of Def. 13, the difference in overlaps, they must be identical. 2 cost between the two partitions is: Fusing array operation must preserve data-parallelism: cost(P) cost(P(cid:48))= ext[B ] ext[B ] 1 2 − (cid:107) ∩ (cid:107) Deﬁnition 12. In Bohrium, two array operations, f and f(cid:48), + new[B ] in[B ] 1 2 are said to be fusible when the following holds: (cid:107) ∩ (cid:107) + out[B ] del[B ] (4) 1 2 (cid:107) ∩ (cid:107) i(cid:48) in[f(cid:48)], o out[f]: i(cid:48) o= i(cid:48) =o (1) Since this cost reduction depends only on B and B , we ∀ ∈ ∀ ∈ ∩ ∅∨ 1 2 o(cid:48) out[f(cid:48)], o out[f]: o(cid:48) o= o(cid:48) =o (2) deﬁneafunction,saving(B ,B ),thatcountsthesavingsfrom ∀ ∈ ∀ ∈ ∩ ∅∨ 1 2 o(cid:48) out[f(cid:48)], i in[f]: o(cid:48) i= o(cid:48) =i (3) merging B1 and B2, which is independent of the rest of the ∀ ∈ ∀ ∈ ∩ ∅∨ partitions. It follows from Deﬁnition 11 that fusible operations are Proof. If P = B ,B ,..., and P(cid:48) = B B ,... , then thosethatcanbeexecutedtogetherwithoutlosingindependent 1 2 1 2 { } { ∪ } the reduction in cost is data-parallelism. In addition to the data-parallelism property, the current cost(P) cost(P(cid:48))= ext(B ) + ext(B ) ext(B B ) 1 2 1 2 implementation of Bohrium also requires that the length and − (cid:107) (cid:107) (cid:107) (cid:107)−(cid:107) ∪ (cid:107) since all other blocks are the same. By using the fact that B dimensionality of the fusible array operations are the same. 1 must be executed before B , whereby in[B ] new[B ] = 2) Cost Model: The motivation of fusing array operations 2 1 ∩ 2 ∅ and del[B ] out[B ] = , as well as the new’s and del’s is to reduce the overall execution time. To accomplish this, 1 ∩ 2 ∅ being disjoint, direct calculation yields Eq. 4. Bohrium implements two techniques: Data LWochaleintya kernel accesses an array multiple times, Note that Prop. 1 directly implies that the cost function of Bohrium will only read and/or write to that array Def. 13 is positive and monotonically decreasing, as required once,avoidingaccesstomainmemory.Considerthe by Def. 6. We next show how the problem can be formulated two for-loops in Fig. 1a that each traverse A and T. as a WSP instance. Fusing the loops avoids one traversal of A and one 3) Constructing a WSP-problem from Bohrium bytecode: traversal of T (Fig. 1c). Furthermore, the compiler Given a list A of Bohrium array operations, a WSP problem can reduce the access to the main memory by 2N G=(V,Ed,Ef) is constructed as follows. elements since it can keep the last read element of 1) The data dependencies between array operations deﬁne A and T in register. a partial order: a<d a(cid:48) iff a(cid:48) must be executed before a. Array CWohnetrnacatinonarray is created and destroyed within a 2) Each array operation a A deﬁnes a vertex v(a) V. ∈ ∈ single partition block, Bohrium will not allocate the 3) The dependency graph E has an edge (v(a),v(a(cid:48))) for d arraymemory,butcalculatetheresultin-placeinone each pair a,a(cid:48) A with a<d a(cid:48). ∈ A. Partition graphs and chains of block merges Allthreealgorithmsworkondatastructurescalledpartition graphs, deﬁned as follows: Deﬁnition 14 (Partition graph). Given a graph G = (V,E) and a partition P of V, the corresponding partition graph is the graph Gˆ(P) = (P,Eˆ(P)) that has an edge (B,B(cid:48)) if there is an edge (u,v) E with u B and v B(cid:48). That is, ∈ ∈ ∈ the vertices are the blocks, connected by the edges that cross block boundaries. From this we build the state needed in WSP computations: Deﬁnition 15 (WSP state). Given a WSP-instance G = (V,E ,E ,cost) and a partition P, the WSP state is the d f partition graph Gˆ(P) = (P,Eˆ (P),Eˆ (P)) together with a d f complete weighted graph Eˆ (P) with weights w(B ,B ) = w 1 2 cost(P) cost(P/(B ,B )). 1 2 − Notice that w(B ,B ) = saving(B ,B ) for the Bohrium 1 2 1 2 cost function, as shown in Prop. 1, and does not require a full cost calculation. Deﬁnition 16 (Merge operator on partition graphs). We ex- tend the merge operator of Def. 4 to partition graphs as Gˆ(P)/(B ,B ) = Gˆ(P/(B ,B )). This acts exactly as a 1 2 1 2 vertex contraction on the partition graph. The merge operator is commutative in the sense that the Fig.4. ApartitiongraphofthePythonapplicationinFig.3.Forillustrative proposes,thegraphdoesnotincludeignoredweightedges(cf.Fig.5). order in a sequence of successive vertex contractions doesn’t affect the result [16]. An auxiliary function, MERGE, is used 4) The fuse-prevention graph E has an edge (v(a),v(a(cid:48))) f in each algorithm to update the state. for each non-fusible pair a,a(cid:48) A. ∈ Deﬁnition 17. Let S =(Gˆ,Eˆ ) be a WSP state. We deﬁne The cost function is as in Def. 13, but note that it can be w calculated incrementally using Prop. 1. MERGE((S,u,v))=(Gˆ/(u,v),Eˆw(cid:48)) The complexity of this transformation is O(V2) since we may have to check all pairs of array operations for depende- where Eˆw(cid:48) is the updated weight graph on the edges incident cies, fusibility, and cost-saving, all of which is O(1). Fig. 4 to the new vertex z =u v. ∪ shows the trivial partition, , of the Python example, where ⊥ The complexity of MERGE is dominated by the weight up- every array operation has its own block. The cost is 94. date, which requires a saving computations per edge incident 4) WSPsolvesFusionofArrayOperations: Finally,wecan to the merged vertex, and is bounded by (cid:0)V2(cid:1). We next show that a solution to the WSP problem also is a solution to need a local condition for when a merge isOallowed: the FAO problem. Lemma 1 (Legal merge). Let P = P/(B ,B ) be the 1,2 1 2 Theorem 2. WSP solves Fusion of Array Operations. successor to a legal partition P Πˆ , derived by merging V blocks B and B . Then P Πˆ∈ if and only if Proof. It is clear from Def. 5 and the construction above 1 2 1,2 ∈ V that the legal partitions Πˆ are exactly all those that fulﬁll 1) (B1,B2) / Eˆf(P), and V ∈ 2) there is no path of length 2 from B to B in the Properties (1) and (2) of Def. 9. Thus, Def. 7 yields a global 1 2 partition graph Eˆ (P). ≥ minimum for all such partitions, fulﬁlling also Def. 9(3). d Proof. Recall that Πˆ is the subset of partitions in Π that V V satisfyDef.5.BecauseP islegal,noblockcontainsanedgein E . Hence P obeys Def. 5(1) if and only if no two vertices V. ALGORITHMS f 1,2 u B and v B are connected in E , or equivalently, 1 2 f In this section, we present an exact algorithm for ﬁnding (B∈,B ) / Eˆ (P∈). 1 2 f an optimal solution to WSP (with exponential worst-case Similar∈ly, by assumption, there are no cycles in Eˆ (P). d executiontime),andtwofastalgorithmsthatﬁndapproximate Thus, P violates Def. 5(2) if and only if Eˆ (P) contains a 1,2 d solutions. We use the Python application shown in Fig. 3 to path B B(cid:48) B , forming the cycle B B(cid:48) 1 2 1,2 demonstrate the results of each partition algorithm. B→ in E→ˆ (·P··→) (where B =B B ). → → 1,2 d 1,2 1,2 1 2 ···→ ∪ 1: functionLEGAL(G,e) 1: functionFINDCANDIDATE(G) 2: (u,v)←e 2: for(v,u)←Ew[G]do 3: l←lengthoflongestpathbetweenuandv inEd[G] 3: ifnotLEGAL(G,(u,v))then 4: ifl=1then 4: Removeedge(u,v)fromEw 5: returnfalse 5: endif 6: else 6: endfor 7: returntrue 7: for(v,u)←Ew[G]do 8: endif 8: ifdegreeislessthan2foreitheruorvwhenonlycountingedges 9: endfunction inEw[G]then 9: ifθ[u]=θ[v]then 10: return(u,v) Fig.5. Ahelpfunctionthetdetermineswhethertheweightedge,e∈Ew[G] 11: endif 12: endif Proposition 2 (Reachability through legal merges). Given 13: endfor 14: return(NIL,NIL) two legal partitions P < P(cid:48), there exists a successor chain 15: endfunction P P P P(cid:48) entirely contained in Πˆ , i.e. 1 2 V ≺ ≺ ≺ ··· ≺ corresponding only to legal block merges. 1: functionUNINTRUSIVE(G) Proof. AsuccessorchainP P1 Pn−1 P(cid:48) always 2: while(u,v)←FINDCANDIDATE(G)(cid:54)=(NIL,NIL)do exists in the total set of parti≺tions Π≺·,·a·n≺d all suc≺h chains are 3: G←MERGE(G,u,v) V 4: endwhile of the same length n. Any such chain contains no partition 5: returnG that violates Def. 5(1): each step is a merge, so once a fuse- 6: endfunction preventing edge is placed inside a block, it would be included also in a block from P(cid:48). Hence we only need to worry about Fig.6. Theunintrusivemergealgorithmthatonlymergeunintrusivelyfusible Def. 5(2). vertices. Wenowshowbyinductionthatasuccessorchainconsisting Theorem 3. Given a partition graph Gˆ, let z = u v be of only legal partitions exists. First, if P ≺ P(cid:48), it is trivially the merged vertex in Gˆ/(u,v). The vertices u and∪v are so. Assume now that the statement is true for all n N, and ≤ unintrusively fusible whenever: consider P <P(cid:48) of distance N +1. Pick any successor chain from P to P(cid:48). If any step violates 1) θ[u]=θ[v]=θ[z], i.e. the non-fusibles are unchanged. Def. 5(2), then let Pi+1 = Pi/(B1,B2) be the ﬁrst partition 2) Either u or v is a pendant vertex in Eˆd, i.e. the degree in the chain that does so. Then there is a path B B of either u or v must be 1. 1 B2 in the transitive reduction of Eˆd(Pi). Bec→ause P→(cid:48) Proof. IfCondition1issatisﬁed,anymergethatisdisallowed ··· → satisﬁes Def. 5(2), B B B is contained in a block from 1 2 at a further stage due to Def. 5(1) would be disallowed also ∪ ∪ P(cid:48),wherebyP(cid:48) P /(B ,B)<P(cid:48).Thismergeintroduces i+1 ≡ i 1 without the merge. Similarly, merging a pendant vertex with no cycles, because the path is in the transitive reduction. Now its parent does not affect the possiblity of introducing cycles let P(cid:48) P(cid:48) P(cid:48) be a legal successor chain of i+1 ≺ i+2 ≺ ··· ≺ through future merges (Def. 5(1)). Finally, since the cost length N i, known to exist by hypothesis. Then P P 1 function is monotonic, the merge cannot adversely affect a − ≺ ≺ P P(cid:48) P(cid:48) is a length-N +1 successor ··· ≺ i ≺ i+1 ≺ ··· ≺ future cost. chain consisting of only legal partitions, concluding the proof by induction. Fig. 6 shows the unintrusive partitioning algorithm. It uses a helper function, FINDCANDIDATE, to ﬁnd two vertices that In particular, the optimal solutions can be reached in this areunintrusivelyfusible.ThecomplexityofFINDCANDIDATE way from the bottom partition = v , v ,..., v , ⊥ {{ 1} { 2} { n}} is (E(E+V)), which dominates the while-loop in UNIN- which we will use in the design of the algorithms. O TRUSIVE, whereby the overall complexity of the unintrusive B. Unintrusive Partition Algorithm merge algorithm is (cid:0)E2(E+V)(cid:1). Note that there is little O need to further optimize UNINTRUSIVE since we will only In order to reduce the size of the partition graph to be use it as a preconditioner for the optimal solution, which will analyzed, we apply an unintrusive strategy where we merge dominate the computation time. vertices that are guaranteed to be part of an optimal solution. Fig.9showsanunintrusivepartitionofthePythonexample Consider the two vertices, a,e, in Fig. 4. The only beneﬁcial with a partition cost of 70. However, the signiﬁcant improve- merge possibility a has is with e, so if a is merged in the ment is the reduction of the number of weight edges in the optimal solution, it is with e. Now, since fusing a,e will not graph. As we shall see next, in order to ﬁnd an optimal graph impose any restriction to future possible vertex merges in the partition in practical time, the number of weight edges in the graph,thetwoverticesaresaidtobeunintrusivelyfusible.We graph must be modest. formalize this property using the non-fusible sets: C. Greedy Partition Algorithm Deﬁnition18(θ,non-fusibleset). Thenon-fusibleset,θ[b]for a block b is the set of blocks connected with b in Eˆ through Fig.7showsagreedymergealgorithm.Itusesthefunction d a path containing a non-fusible edge. FIND-HEAVIEST to ﬁnd the edge in Ew with the greatest 1: functionGREEDY(G) 2: whileEw[G](cid:54)=∅do 3: (u,v)←FIND-HEAVIEST(Ew[G]) 4: ifLEGAL(G,(u,v))then 5: G←MERGE(G,u,v) 6: else 7: Removeedge(u,v)fromEw 8: endif 9: endwhile 10: returnG 11: endfunction Fig. 7. The greedy merge algorithm that greedily merges the vertices connectedwiththeheaviestweightedgeinG. Fig.9. ApartitiongraphoftheunintrusivemergeofthegraphinFig.4. Fig.8. ApartitiongraphofthegreedymergeofthegraphinFig.4. weight and either remove it or merge over it. Note that FIND-HEAVIEST must search through Ew in each iteration since MERGE might change the weights. Thenumberofiterationsinthewhileloop(line2)is (E) O Fig.10. Abranch-and-boundsearchtreeoftheunintrusivelymergedpartition since at least one weight edge is removed in each iteration graph (Fig. 9). Each vertex lists a sequences of vertex merges that build a either explicitly (line 7) or implicitly by MERGE (line 5). The speciﬁcgraphpartition.Thegrayedoutareaindicatesthepartofthesearch complexity of ﬁnding the heaviest (line 3) is (E), calling treethatadepth-ﬁrst-searchcanskipbecauseofthecostbound. LEGAL is (E+V),andcalling MERGE is O(cid:0)V2(cid:1)thusthe an optimal partition. The blocks of the unintrusive partition overall comOplexity is (cid:0)V2E(cid:1). O will be the vertices in our initial partition graph. Second, a O Fig. 8 shows a greedy partition of the Python example. The good suboptimal solution is computed. We use the greedy partition cost is 58, which is a signiﬁcant improvement over algorithm for this purpose, but any scheme will do. We now no merge. However, it is not the optimal partitioning, as we start a search rooted in the -partition where everything is shall see later. one block. This has the low(cid:62)est cost, but will in general be illegal. Each recursion step cuts a weight edge that has not D. Optimal Partition Algorithm been considered before, if it yields a cost that is strictly lower BecausetheWSPproblemisNP-hard,wecannotingeneral thanthecurrentlybestpartitionG (ifthecostishigherthan min hope to solve it exactly in polynomial time. However, we for G , no further splitting will yield a better partition, and min may be able to solve the problems within reasonable time itssearchsubtreecanbeignored).Ifwereachalegalpartition, in common cases given a carefully chosen search strategy thiswillbethenewbestcandidate,andnofurthersplittingwill through the 2E possible partitions. For this purpose, we have yieldabetterone.Whentheworkqueueisempty,G holds min implemented a branch-and-bound algorithm, exploiting the an optimal solution to WSP. monotonicity of the partition cost (Def. 6(2)). It is shown in Fig. 11 shows the implementation, Fig. 10 shows an exam- Fig. 11, and proceeds as follows: ple of a branch-and-bound search tree, and Fig. 12 shows an Before starting, the largest unintrusive partition is found. optimal partition of the Python example with a partition cost This is the largest partition that we can ensure is included in of 38. 1: functionMERGEBYMASK(G,M) 2: f ←true (cid:46)Flagthatindicatesfusibility 3: fori←0to|Ew[G]|−1do 4: ifMi=1then 5: (u,v)←thei’thedgeinEw[G] 6: ifnotFUSIBLE(G,u,v)then 7: f ←false 8: endif 9: G←MERGE(G,u,v) 10: endif 11: endfor 12: return(G,f) 13: endfunction 1: functionOPTIMAL(G) 2: G←UNINTRUSIVE(G) 3: for(v,u)←|Ew[G]do 4: ifnotLEGAL(G,(u,v))then 5: Removeedge(u,v)fromEw 6: endif 7: endfor 8: Gmin←GREEDY(G) (cid:46)Goodguess Fig.13. ApartitiongraphofaLinearpartitionofthePythonexample(Fig. 9: M0..|Ew[G]|←1 (cid:46)FillarrayM 3). 10: o←0 (cid:46)Themaskoffset 11: Q←∅ E. Linear Merge 12: ENQUEUE(Q,(M,o)) 13: whileQ(cid:54)=∅do For completeness, we also implement a partition algorithm 14: (M,o)←DEQUEUE(Q) that does not use a graph representation. In this na¨ıve ap- 15: (G(cid:48),f)←MERGEBYMASK(G,M) proach, we simply go through the array operation list and add 16: ifcost(G(cid:48))<cost(Gmin)then 17: iff andG(cid:48) isacyclicthen each array operation to the current partition block unless the 18: Gmin←G(cid:48) (cid:46)Newbestpartitioning arrayoperationsmakesthecurrentblockillegal,inwhichcase 19: endif weaddthearrayoperationtoanewpartitionblock,whichthen 20: endif 21: fori←oto|M|−1do becomes the current one. The asymptotic complexity of this 22: M(cid:48)←M algorithmis (cid:0)n2(cid:1)wherenisthenumberofarrayoperations. 23: M(cid:48)←0 O i Fig. 13 show that result of partitioning the Python example 24: ENQUEUE(Q,(M(cid:48),i+1)) 25: endfor with a cost of 58. 26: endwhile 27: returnGmin F. Merge Cache 28: endfunction In order to amortize the execution time of applying the merge algorithms, Bohrium implements a merge cache of Fig.11. Theoptimalmergealgorithmthatoptimallymergestheverticesin previously found partitions of array operation lists. It is often G.Thefunction,cost(G),returnsthepartitioncostofthepartitiongraphG. the case that scientiﬁc applications use large calculation loops such that an iteration in the loop corresponds to a list of array operations. Since the loop contains many iterations, the cache can amortize the overall execution time time. VI. EVALUATION In this section, we will evaluate the different partition algorithm both theoretically and practically. We execute a range of scientiﬁc Python benchmarks, which are part of an open source benchmark tool and suite named Benchpress3. Table I shows the speciﬁc benchmarks that we uses and Table II speciﬁes the host machine. When reporting execution times,weusetheresultsofthemeanof10identicalexecutions as well as error bars that shows two standard deviations from the mean. We would like to point out that even though we are using benchmarks implemented in pure Python/NumPy, the 3Available at http://benchpress.bh107.org. For reproducibility, the exact version can be obtained from the source code repository at https://github. Fig.12. ApartitiongraphoftheoptimalmergeofthegraphinFig.4. com/bh107/benchpress.gitrevisionb6e9b83. Benchmark Inputsize(in64bitﬂoats) Iterations BlackScholes 1.5×106 20 1011 × No-Fusion Linear Greedy Optimal GameofLife 108 20 6 HeatEquation 1.44×108 20 LeibnitzPI 108 20 5 GaussElimination 2800 2799 LUFactorization 2800 2799 ost4 C MonteCarloPI 108 20 n 2S7haPllooiwntWStaetnecril 41..208274×5×101707 2200 artitio3 Rosenbrock 2×108 20 P2 Successiveover-relaxation 1.44×108 20 NBody 6000 20 1 NBodyNice 40plantes,2×106asteroids 20 LWaattteicre-IcBeoSltizmmualnantioDnB3EQN1C9HMA36TR..A34K×7B5A1L×0PE51P0LI6ICATIONS 2200 0 LeibnitzPI BlackScholes MonteCarloPI GameofLife GaussElimination HeatEquation LUFactorization 27PointStencil ShallowWater Rosenbrock LatticeBoltzmann NBody NBodyNice SOR Water-IceSimulation Processor: IntelCorei7-3770 Clock: 3.4GHz #Cores: 4 Fig. 14. Theoretical cost of the different partition algorithms. NB: the last Peakperformance: 108.8GFLOPS ﬁve benchmarks, Lattice Boltzmann, NBody, NBody Nice, SOR, Water-Ice L3Cache: 16MB Simulation,donotshowanoptimalsolution. Memory: 128GBDDR3 Operatingsystem: UbuntuLinux14.04.2LTS Software: GCCv4.8.4,Pythonv2.7.6,NumPy1.8.2 No-Fusion Linear Greedy Optimal 40 TABLEII SYSTEMSPECIFICATIONS ds35 n o30 Sec performance is comparable to traditional high-performance n25 i languages such as C and Fortran. This is because Bohrium Time20 overloadsNumPyarrayoperations[7]inordertoJITcompile n o15 and execute them in parallel seamlessly [12]. uti Theoretical Partition Cost: Fig. 14 shows that theoretical Exec10 partition cost (Def. 13) of the four different partition algo- 5 rithms previously presented. Please note that the last ﬁve 0 btbLthheoaeenuttcnbiahcdsoemsuaonBalcgdriokoalcstrteazidtdnmhomascnenuaonttort9cisshs9ho.o29ltvrw69ee69e.4a%s,Fnwaooorrhefpeitctxtihohmaeomaissllpeaslsareoigrm,lceuhtphtfiletooyrnrest.eeooaTouarhwrclihasabryrtgia.rseenebceehvoc-efaanunthsdiee-f LeibnitzPI BlackScholes MonteCarloPI GameofLife GaussElimination HeatEquation LUFactorization 27PointStencil ShallowWater Rosenbrock LatticeBoltzmann NBody NBodyNice SOR Water-IceSimulation As expected, we observe that the three algorithms that do Fig. 15. Execution time of the different partition algorithms using a warm fusion,Linear,Greedy,andOptimal,haveasigniﬁcantsmaller cache. cost than the non-fusing algorithm Singleton. The difference between Linear and Greedy is signiﬁcant in some of the greaterpracticalexecutiontime.Thisisbecausetheexecution benchmarks but the difference between greedy and optimal becomes computation bound rather than memory bound thus does almost not exist. a further reduction in memory accesses does not improve Practical Execution Time: In order to evaluate the full performance. Similarly, in the 27 Point Stencil benchmark the picture, we do three execution time measurements: one with theoretical partition cost is identical for Linear, Greedy, and a warm fuse cache, one with a cold fuse cache, and one with Optimal, but in practice Optimal is marginally better. This is nofuse cache.Fig.15shows theexecutiontime whenusinga an artifact of our cost model, which deﬁne the cost of reads warmfusecachethuswecancomparethetheoreticalpartition and writes identically. cost with the practical execution time without the overhead of With the cold fuse cache, the partition algorithm runs once running the partition algorithm. Looking at Fig. 14 and Fig. in the ﬁrst iteration of the computation. The results show that 15, it is evident that our cost model, which is a measurement 20iterations,whichmostofthebenchmarksuses,isenoughto of unique array accesses (Def. 13), compares well to the amortize the partition overhead (Fig. 16). Whereas, if we run practical execution time result in this speciﬁc benchmark with no fuse cache, i.e. we execute the partition algorithm setup. However, there are some outliers – the Monte Carlo Pi in each iteration (Fig. 17), the Linear partition algorithm benchmark has a theoretical partition cost of 1 when using outperforms both the Greedy and Optimal algorithm because the Greedy and Optimal algorithm but has a signiﬁcantly of its smaller time complexity.

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.