[{"@context":"http:\/\/schema.org\/","@type":"BlogPosting","@id":"https:\/\/wiki.edu.vn\/en\/wiki19\/parallel-breadth-first-search-wikipedia\/#BlogPosting","mainEntityOfPage":"https:\/\/wiki.edu.vn\/en\/wiki19\/parallel-breadth-first-search-wikipedia\/","headline":"Parallel breadth-first search &#8211; Wikipedia","name":"Parallel breadth-first search &#8211; Wikipedia","description":"The breadth-first-search algorithm is a way to explore the vertices of a graph layer by layer. It is a basic","datePublished":"2019-03-05","dateModified":"2019-03-05","author":{"@type":"Person","@id":"https:\/\/wiki.edu.vn\/en\/wiki19\/author\/lordneo\/#Person","name":"lordneo","url":"https:\/\/wiki.edu.vn\/en\/wiki19\/author\/lordneo\/","image":{"@type":"ImageObject","@id":"https:\/\/secure.gravatar.com\/avatar\/c9645c498c9701c88b89b8537773dd7c?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/c9645c498c9701c88b89b8537773dd7c?s=96&d=mm&r=g","height":96,"width":96}},"publisher":{"@type":"Organization","name":"Enzyklop\u00e4die","logo":{"@type":"ImageObject","@id":"https:\/\/wiki.edu.vn\/wiki4\/wp-content\/uploads\/2023\/08\/download.jpg","url":"https:\/\/wiki.edu.vn\/wiki4\/wp-content\/uploads\/2023\/08\/download.jpg","width":600,"height":60}},"image":{"@type":"ImageObject","@id":"https:\/\/upload.wikimedia.org\/wikipedia\/en\/thumb\/b\/be\/PRAM-Model.png\/220px-PRAM-Model.png","url":"https:\/\/upload.wikimedia.org\/wikipedia\/en\/thumb\/b\/be\/PRAM-Model.png\/220px-PRAM-Model.png","height":"85","width":"220"},"url":"https:\/\/wiki.edu.vn\/en\/wiki19\/parallel-breadth-first-search-wikipedia\/","about":["Wiki"],"wordCount":6321,"articleBody":"The breadth-first-search algorithm is a way to explore the vertices of a graph layer by layer. It is a basic algorithm in graph theory which can be used as a part of other graph algorithms. For instance, BFS is used by Dinic&#8217;s algorithm to find maximum flow in a graph. Moreover, BFS is also one of the kernel algorithms in Graph500 benchmark, which is a benchmark for data-intensive supercomputing problems.[1] This article discusses the possibility of speeding up BFS through the use of parallel computing.Serial breadth-first search[edit]In the conventional sequential BFS algorithm, two data structures are created to store the frontier and the next frontier. The frontier contains all vertices that have the same distance (also called &#8220;level&#8221;) from the source vertex, these vertices need to be explored in BFS. Every neighbor of these vertices will be checked, some of these neighbors which are not explored yet will be discovered and put into the next frontier.  At the beginning of the BFS algorithm, a given source vertex s is the only vertex in the frontier. All direct neighbors of s are visited in the first step, which form the next frontier. After each layer-traversal, the &#8220;next frontier&#8221; is switched to the frontier and new vertices will be stored in the new next frontier. The following pseudo-code outlines the idea of it, in which the data structures for the frontier and next frontier are called FS and NS respectively.1    define bfs_sequential(graph(V,E), source s):2        for all v in V do3            d[v] = -1;4        d[s] = 0; level = 1; FS = {}; NS = {};5        push(s, FS);6        while FS !empty do7            for u in FS do 8                for each neighbour v of u do 9                    if d[v] = -1 then10                       push(v, NS);11                       d[v] = level;12           FS = NS, NS = {}, level = level + 1;First step of parallelization[edit]As a simple and intuitive solution, the classic Parallel Random Access Machine (PRAM) approach is just an extension of the sequential algorithm that is shown above. The two for-loops (line 7 and line 8) can be executed in parallel. The update of the next frontier (line 10) and the increase of distance (line 11) need to be atomic. Atomic operations are program operations that can only run entirely without interruption and pause.   However, there are two problems in this simple parallelization. Firstly, the distance-checking (line 9) and distance-updating operations (line 11) introduce two benign races. The reason of race is that a neighbor of one vertex can also be the neighbor of another vertex in the frontier. As a result, the distance of this neighbor may be examined and updated more than one time. Although these races waste resource and lead to unnecessary overhead, with the help of synchronization, they don&#8217;t influence the correctness of BFS, so these races are benign. Secondly, in spite of the speedup of each layer-traversal due to parallel processing, a barrier synchronization is needed after every layer in order to completely discover all neighbor vertices in the frontier. This layer-by-layer synchronization indicates that the steps of needed communication equals the longest distance between two vertices, O(d), where O is the big O notation and d is the graph diameter.This simple parallelization&#8217;s asymptotic complexity is same as sequential algorithm in the worst case, but some optimizations can be made to achieve better BFS parallelization, for example:Mitigating barrier synchronization. Barrier synchronization is necessary after each layer-traversal to ensure the correctness of parallel BFS. As a result, reducing the cost of barrier synchronization is an effective way to speed up parallel BFS.Load-balancing for neighbor discovering. Because there is a barrier synchronization after each layer-traversal, every processing entity must wait until the last of them finish its work. Therefore, the parallel entity which has the most neighbors decides the time consumption of this layer. With the optimization of load-balancing, the time of layer-traversal can be reduced.Improving the locality of memory references. In parallel system with distributed memory, remote memory reference are getting data from other processing entities, which has usually extra communication cost compared to local memory reference. Thus, local memory reference is faster than remote memory reference. By designing a better data structure or improving the organization of data can lead to more local memory references and reduce the communications needed for remote memory references.Parallel BFS with shared memory[edit]Compared to parallel BFS with distributed memory, shared memory provides higher memory-bandwidth and lower latency. Because all processors share the memory together, all of them have the direct access to it. Thus, the developers don&#8217;t need to program message passing process, which is necessary for distributed memory to get data from remote local memory. Therefore, the overhead of messages is avoided.[2]  However, the number of vertices in each layer and the number of neighbors of each vertex are shown to be highly irregular, which leads to highly irregular memory accesses and work distribution of BFS. In parallel BFS, this feature reduces the benefits from parallelization due to unbalanced load. As a result it is very important to make the parallel BFS on shared memory load-balanced. Moreover, exploring the data-locality can also speed up parallel process.Many parallel BFS algorithms on shared memory can be divided into two types: container centric approaches and vertex centric approaches.[3] In the container centric approach, two data structures are created to store the current frontier and the next vertex frontier. The next vertex frontier is switched to the current frontier at the last of each step. There is a trade-off between the cost for synchronization and data locality according to the place where the data is stored. These two data structures can be held in every processing entity (such as thread) which supports data locality but needs extra load balancing mechanisms. Alternatively, they can be global to provide implicit load balancing, where special data structures are used for concurrent access from processing entities. But then those processing entities will work concurrently and more effort are required for synchronization.Besides, data organization of containers can be optimized. The typical data structure in serial BFS and some parallel BFS is FIFO Queue, as it is simple and fast where insertion and delete operation costs only constant time.Another alternative is the bag-structure.[4] The insertion operation in a bag takes O(logn) time in the worst-case, whereas it takes only constant amortized time which is as fast as FIFO. Furthermore, union of two bags takes \u0398(lgn) time where n is the number of elements in the smaller bag. The bag-split operation also takes \u0398(lgn) time. With the help of bag-structure, a certain number of vertices (according to granularity parameter) are stored in one bag and the bag-structure becomes the basic parallel entity. Moreover, the reducer can be combined with the bag-structure to write vertices in parallel and traverse them efficiently.The vertex centric approach treats vertex as parallel entity\uff0cwhich enables parallel iteration. Each vertex is assigned to a parallel entity. This vertex centric approach might only work well if the graph depth is very low. Graph depth in BFS is defined as the maximum distance of any vertex in the graph to the source vertex. Therefore, the vertex centric approach is well-suited for GPUs if every thread is mapped to exactly one vertex.[3]Parallel BFS with distributed memory[edit]In the distributed memory model, each processing entity has its own memory. Because of this, processing entities must send and receive messages to each other to share its local data or get access to remote data.  A distributed memory model.1-D partitioning[edit]1D partitioning is the simplest way to combine the parallel BFS with distributed memory. It is based on vertex partition. Load balancing is still an important issue for data partition, which determines how we can benefit from parallelization. In other words, each processor with distributed memory (e.g. processor) should be in charge of approximately same number of vertices and their outgoing edges. For the implementation of data storage, each processor can store an adjacency matrix of its local vertices, in which each row for each vertex is a row of outgoing edges represented by destination vertex indices.Different from shared memory BFS, the neighbor vertex from one processor may be stored in another processor. As a result, each processor is responsible to tell those processors about traversal status through sending them messages. Moreover, each processor should also deal with the messages from all other processors to construct its local next vertex frontier. Obviously, one all-to-all communication (which means each entity has different messages for all others) is necessary in each step when exchanging the current frontier and the next vertex frontier.The following pseudo-code of a 1-D distributed memory BFS[5] was originally designed for IBM BlueGene\/L systems, which have a 3D torus network architecture. Because the synchronization is the main extra cost for parallelized BFS, the authors of this paper also developed a scalable all-to-all communication based on point-to-point communications. After that, they also reduced the number of point-to-point communication, taking advantage of its high-bandwidth torus network.The main steps of BFS traversal in the following algorithm are:processor view (line 8): construct the frontier FS with vertices from local storageglobal view (line 10\u201311): terminate the traversal if FS from all processors are emptyprocessor view (line 13): construct the next frontier based on the neighbors vertex of its FS, although some of their neighbors may be stored in other processorsglobal view (line 15\u201318): run an all-to-all communication to let each processor know, which local vertices should be put into its local next frontier NSprocessor view (line 20\u201322): receive messages from all other processors, update the distance value of their local vertices in the current frontier, change its NS to FS1    define 1_D_distributed_memory_BFS( graph(V,E), source s):2        \/\/normal initialization3        for all v in V do4            d[v] = -1;5        d[s] = 0; level = 0; FS = {}; NS = {};6        \/\/begin BFS traversal7        while True do:8            FS = {the set of local vertices with level}9            \/\/all vertices traversed10           if FS = {} for all processors then:11               terminate the while loop12           \/\/construct the NS based on local vertices in current frontier13           NS = {neighbors of vertices in FS, both local and not local vertices}14           \/\/synchronization: all-to-all communication15           for 0 "},{"@context":"http:\/\/schema.org\/","@type":"BreadcrumbList","itemListElement":[{"@type":"ListItem","position":1,"item":{"@id":"https:\/\/wiki.edu.vn\/en\/wiki19\/#breadcrumbitem","name":"Enzyklop\u00e4die"}},{"@type":"ListItem","position":2,"item":{"@id":"https:\/\/wiki.edu.vn\/en\/wiki19\/parallel-breadth-first-search-wikipedia\/#breadcrumbitem","name":"Parallel breadth-first search &#8211; Wikipedia"}}]}]