Parallel Searching
In general, when parallelizing a sequential algorithm we have at least the following main problems: We have to find some way of dividing the work that has to be done, so that it can be done in parallel We want to keep all available processors as busy as possible doing useful work at all times。
There will be some waste processor time related to the startup of the parallel algorithm Not all processors will be doing useful work right from the start We would like to minimize this waste。
There will be some waste related to splitting a problem up into sub-problems to be searched in parallel,We would like to minimize this waste too
There will be some waste related to combining the solutions to sub-problems solved in parallel into a solution for the original problem Of course, we would like to minimize this waste as well,In the following, we will address these problems in the specific case of parallelizing the alpha beta algorithm with the modifications and heuristics described earlier
6.1 Parallel Alpha beta
The alpha beta algorithm is inherently sequential in nature; the tree is tra-versed in a certain order and knowledge about boundaries on the real value is propagated forward through the search Still, opportunities for parallelism do exist Many subtrees have to be searched even if we have near perfect move ordering Searching diferent subtrees in parallel is one way to obtain parallelism We split up the searching of all the legal moves of a node into several threads 。There are problems with this approach, though, that prevents us from using it trivially Some problems are equivalent to the general problems mentioned earlier, some are specific to this domain:
If a node contains a move generating a cutoff, the remaining moves would not get searched in the sequential alpha beta algorithm,In the parallel algorithm, if such a node gets split some or all of these remaining moves will be partially or fully searched We want as few as possible of these cases, and we want to be able to stop the superfluous searches as quickly as possible when we detect them
Time spent splitting is not spent searching If it takes a very long time to split, average search speed will be slow .We want splitting to be as fast as possible。 Some global data structures, most notably the hash table, have access issues that we have to deal with in the parallel case If two threads write to the same hash table entry at the same time, the result can be an inconsistent entry
Even if splitting is fast, it will take time 。We have to be careful not to do a split in cases where we could have searched the node in the same or smaller amount of time without splitting 。Not only will this waste time, it will also tie up threads that could possibly be used better elsewhere We will address these points in the following sections。
6.2 Alpha beta Revisited
In [13], Knuth & Moore analyzed the properties of randomly and perfectly ordered game trees。Since we can achieve near perfect move ordering in the case of chess, the properties of perfectly ordered trees are interesting to us。Knuth & Moore found that nodes could be divided into three categories:
Type 1
All the nodes on the principal variation (PV) are type 1 nodes。 This means that the root node is a type 1 node 。The first child of a type 1 node is also a type 1 node, because we assume a perfectly ordered tree 。The rest of the children of a type 1 node, are type 2 nodes ,Because it is a node on the principal variation, no cutoffs can occur; all of the children of a type 1 node have to be searched
Type 2
A type 2 node is a child of a type 1 node or a type 3 node ,A type 2 node is a node where a cutoff occurs Since we assume perfect move ordering, we will only have to search one child of a type 2 node The children of type 2 nodes are type 3 nodes
Type 3
Type 3 nodes are children of type 2 nodes,No cutoffs will occur, they require all of their children to be searched。Move ordering is irrelevant at these nodes because they all have to be searched, and none of their children are good enough to raise alpha; they will all be searched with the same bounds。In [16], nodes of type 1, 2 and 3 are called PV, CUT and ALL nodes respectively。These names describe in a concise manner, the diference of the node types,The categories outline a basic strategy for doing alpha beta search in parallel.Firstly, it is clear that type 2 (CUT) nodes are not suited for splitting,When the first child has been searched, a cutoff is produced, and any searches in parallel of the rest of the children is wasted work, Secondly, Type 3 nodes are excellent candidates for splitting,All their children have to be searched, and we can gain a lot by doing this in parallel, Thirdly, since all children of type 1 nodes must also be searched, these seem to be good candidates for splitting too,But they difer from type 3 nodes in an important way: In type 3 nodes, alpha and beta stays the same throughout searching of the children, In type 1 nodes, alpha and beta bounds have to be established by searching the first child before the rest of the children can be searched with these bounds, This means that the first child of a type 1 node should be searched sequentially,After this has been done and bounds have been established, we can split the node searching the rest of the children in parallel. Of course, we are not searching perfectly ordered trees. But because of the the good move ordering heuristics in general and the hash table in particular, in type 1 nodes, we will in most cases actually search the best move first. And, as mentioned earlier, in about 90% of the nodes where a cutoff occurs (type 2 nodes), it happens already after the first move, as in the perfect case, Nodes of type 3 will have to be searched thoroughly in any case
6.3 Young Brothers Wait Concept
A basic algorithm that takes the above knowledge into account is called Young Brothers Wait Concept or YBWC [8] ,It works by always searching the first child sequentially,It then searches the rest of the children in parallel. This fits type 1 nodes perfectly, It also fits type 2 (CUT) nodes perfectly, because there will be a cutoff after the first child has been searched, It does not fit type 3 nodes as perfectly, The first child will be searched sequentially when we could instead search all the children in parallel,This is of course not optimal, but it is a simple splitting strategy that works fairly well, We can try to improve on this by guessing the type of a node and act accordingly,But it might not be easy to guess correctly often enough for it to be an advantage.If we guess wrong at a type 2 (CUT) node, we will search a lot of superfluous nodes
6.4 Splitting With Bitboards
The time it takes to perform a split will depend on various implementational and architectural details. A very important part of splitting, though, is to transfer information about the node from the splitting thread to the idle threads that will help search the node, Necessary information to transfer is the board state, the side to move, the en passant square, the castling rights, the half move clock to be able to detect draw by the 50 moves rule, and enough of the game history to enable us to detect repetitions。Memory transfers are never free, so we want to minimize the amount of information transferred。The bitboard implementation is in direct opposition here because it takes up much more memory to represent the board state, than the simple representation does。In the simple case, the board state consists of 64 squares usually represented as bytes, taking a total of 64 bytes 。 In the bitboard implementation, we have this array too, and in addition we have the bitboards: 12 bitboards for the pieces, 2 occupied bitboards, one for each color, and 3 rotated occupied bitboardsIn total: 17 bitboards of 8 bytes (64 bits) each。This means that we have to transfer 136 bytes more than with the simple representation,It is difcult to assess in advance whether bit boards will be an overall advance or a drawback in a parallel chess programSplitting will take longer, but faster search and evaluation may balance this out. We will try to answer this question in the next chapter on experiments
6.5 Global Data Structures
In the sequential case, the hash table and the data structures for the killer heuristic and the history heuristic are globally available to the search function at all times.In the parallel case, we have to decide what to do with the access issues that surface with these structures.Multiple simultaneous writes to these may have disastrous efects for the program
6.5.1 The Hash Table
There are many diferent ways to deal with the hash table in the parallel case.One is to distribute the hash table among diferent threads so that each thread is responsible for a part of the table,Another idea is to keep the table global and then ensure mutually exclusive access from diferent threads with mutex locks,There are problems with these solutions, though In the distributed case, inter thread communication when one thread probes an entry that belongs to another thread, is slow,The mutex solution also introduces a performance penalty because mutex locks are slow on most architectures,A recent idea [11], and the idea we use, is to keep the hash table global in shared memory and ignore the multiple simultaneous writes issue, and instead do a cheap but efcient form of error detection on the hash table entries: Each hash table entry is 128 bits in size,We can think of this as consisting of two 64bit partsThe first part is the hash key of the node about which information is stored, the second part is the actual information about the nodeWhen storing a new entry, instead of just storing both parts normally, we store only the second 64bit part normallyThe first 64bit part is XORed with the second 64bit part before storingWhen the table is later probed and this entry is found, we can check the consistency of the entry: The second part of the entry is just read normallyThe first part of the entry is read and then XORed with the second partThis way, if the entry is consistent, we get the original first part because XORing with the same value twice is the same as doing nothingThe original first part was the hash key of the position about which information is stored in the entryIf the hash key of the position that is probing the hash table matches the hash key stored in the entry, we have a hash hit and the information in the entry can be usedIf it does not match, it can either be because the information was stored from a di#erent position hashing to the same slot or because the entry has been corrupted by simultaneous writesIn either case, we have a hash miss and we do not use the stored informationA potential problem could be that too many nodes are corrupted, and while we do not use corrupted entries due to the error detection, the number of useful entries could become so low that the hash table is not fulfilling its purposeOn the other hand, corrupted entries will be overwritten with consistent nodes sooner or later so the problem may not be seriousWe will return to this question in the next chapter
6.5.2 Killers And History
The data structures used by the killer heuristic and the history heuristic face the same multiple simultaneous writes issue as does the hash tableThe di#erence is that while a corrupted hash table entry could make the search return an untrue value or even crash the program if it attempts to perform a nonsensical best move from the hash table, the killer and history data is only used for move ordering purposesThe worst that can happen is bad move orderingOn the other hand, this can be bad enoughWe do not want alpha beta (or PVS) to degenerate into plain minimaxWhat we try is this:
The killer heuristic data structures are kept local to each threadThis means that each thread only writes in its own data structures, but also that we do not get the benefit of having results from other threadsWe could copy the killer heuristic data when splitting, but this would mean even more splitting time overhead, so we do not do thisIn fact, we do not even reset the killer data structures in the helper threadsThis means that they will contain bogus move ordering information from the last time this thread was active, in another part of the treeWe do this because limited testing has indicated that it does not a#ect move ordering in any serious way, and we would rather not waste time on this if it is not necessaryThe killer heuristic data structures of these threads will quickly be filled with useful information as search progressesThe history heuristic data structures are kept globally, with risks for po?tentially corrupt entriesThis could potentially a#ect move ordering, but the history heuristic is the least important move ordering heuristic, and limited testing has indicated that this solution does not a#ect move ordering in any serious way eitherWe will return to these question of move ordering in the next chapter
6.6 Design
The overall algorithmic design of our parallel implementation is YBWCBut there are still many design decisions to take at the more detailed levelIn this chapter we will describe our parallel design in more detail, without bothering too much with implementation details
6.6.1 Inter thread Relationship
The YBWC idea can be used in a variety di#erent waysThe parallel model could be message passing, pipes, master/slave and so onIn our case we use interacting peersAll threads can split a node and deploy other threads to help outOf course, a main thread sets up data structures and creates the other threads at the the start of the programAnd this main thread is always the one starting the search at the rootBut during search, all threads are equalThis avoids centralized bottlenecks in the parallel infrastructure
6.6.2 Parallel Operation
The important parallel operations of the program is as follows: When the program starts, data structures are initialized and all threads are startedAll threads except the main thread are going directly to an idle loop where they wait for workThe main thread eventually calls the alpha beta searchWe then follow the YBWC idea at every nodeThe first child is searched sequentiallyWhen the first child has been searched, it is time to splitTo avoid wasting too much time splitting instead of searching, we only split if the remaining depth of the subtree is 3 plies or moreEven with extensions and quiescence search, it is still faster to search subtrees with remaining depth of two plies or less, sequentiallySplitting is done by copying all necessary information about the node into so璫alled split blocksEach idle thread gets a split block to work onThe splitting thread is marking itself idle when splitting so it gets a split block tooA thread that has been assigned a split block calls a special search function that does essentially as it would in the sequential case at a normal nodeIt repeatedly searches a childThe di#erence is that instead of generating legal moves, it asks the original splitting node for a move to searchIt then just gets the next move in the listAll threads participating in this split does this, until there are no more moves leftSearching a child is done by performing the move and then calling the regular search functionFrom here on, the thread acts exactly the same way as the main thread described aboveMost importantly, it itself can split at some point, if there are idle threadsThe helping threads can split too, and so on recursivelyWhen a thread requests a new move to search, and finds that there are no more, it returns its findings to the splitting nodeThe splitting node remembers the best move (and its value) found yet by any helper threadIn this way, all the children of the node gets searched in parallel and the best move and value is foundThe helping thread returns to the idle loop waiting for new workNote that the subtrees of the di#erent children can vary greatly in size so one helper thread can finish long before anotherSince it goes back into the idle loop, it can help out other helper threads that are not finished searching, when these splitFinally, we have to deal with the unfortunate case of finding a cuto# in a helper threadIdeally, we should never find a cuto# in a node that has been splitBut when it happens we want to stop the other helper threads as quickly as possibleWe do this by signaling stop to all the other helper threads before returningThe helper threads may have split at some point too so they tell their helper threads to stop, and so on recursively