Editing Chapter 5
This commit is contained in:
parent
29c3b6b4c3
commit
863d44cfab
52
Chapter5.tex
52
Chapter5.tex
@ -9,7 +9,7 @@ types of data will need to be sent and received. Support and mechanisms vary bas
|
||||
on the MPI implementation, but most fundamental data types such as integers, doubles,
|
||||
characters, and Booleans are incorporated into the MPI implementation. While this does
|
||||
simplify some of the messages that need to be sent and received in the MPI approaches of
|
||||
attack graph generation, it does not cover the vast majority of them.
|
||||
attack and compliance graph generation, it does not cover the vast majority of them when using RAGE.
|
||||
|
||||
RAGE implements many custom classes and structs that are used throughout the generation process.
|
||||
Qualities, topologies, network states, and exploits are a few such examples. Rather than breaking
|
||||
@ -18,8 +18,8 @@ most of this. RAGE already incorporates Boost graph libraries for auxiliary supp
|
||||
extended this further to utilize the serialization libraries also provided by Boost. These
|
||||
libraries also include support for serializing all STL classes, and many of the RAGE
|
||||
classes have members that make use of the STL classes. One additional advantage of the Boost
|
||||
library approach is that many of the RAGE class members are nested. For example, the NetworkState
|
||||
class has a member vector of Quality classes. When serializing the NetworkState class, boost will
|
||||
library approach is that many of the RAGE classes are nested. For example, the NetworkState
|
||||
class has a member vector of Quality classes, and the Quality class has a Keyvalue class as a member. When serializing the NetworkState class, Boost will
|
||||
recursively serialize all members, including the custom class members, assuming they also have
|
||||
serialization functions.
|
||||
|
||||
@ -27,11 +27,10 @@ When using the serialization libraries, this work opted to use the intrusive rou
|
||||
class instances are altered directly. This was preferable to the non-intrusive approach, since
|
||||
the class instances were able to be altered with relative ease, and many of the class instances
|
||||
did not expose enough information for the non-intrusive approach to be viable.
|
||||
\TUsubsection{Data Consistency}
|
||||
|
||||
\TUsection{Tasking Approach} \label{sec:Tasking-Approach}
|
||||
\TUsubsection{Introduction to the Tasking Approach}
|
||||
The high-level overview of the compliance graph generation process can be broken down into six main tasks.
|
||||
The high-level overview of the attack and compliance graph generation process can be broken down into six main tasks.
|
||||
These tasks are described in Figure \ref{fig:tasks}. Prior works such as that seen by the
|
||||
authors of \cite{li_concurrency_2019}, \cite{9150145}, and \cite{7087377} work to parallelize the graph generation using
|
||||
OpenMP, CUDA, and hyper-graph partitioning. This approach, however, utilizes Message Passing Interface (MPI)
|
||||
@ -41,14 +40,14 @@ attack and compliance graph generation.
|
||||
\begin{figure}[htp]
|
||||
\includegraphics[width=\linewidth]{"./Chapter5_img/horiz_task.drawio.png"}
|
||||
\vspace{.2truein} \centerline{}
|
||||
\caption{Task Overview of the Attack Graph Generation Process}
|
||||
\caption{Task Overview of the Attack and Compliance Graph Generation Process}
|
||||
\label{fig:tasks}
|
||||
\end{figure}
|
||||
|
||||
\TUsubsection{Algorithm Design}
|
||||
The design of the tasking approach is to leverage a pipeline structure with the six tasks and MPI nodes. Each stage of the pipeline will pass the necessary data to the next stage through various MPI messages, where the next stage's nodes will receive the data and execute their tasks. The pipeline is considered fully saturated when each task has a dedicated node. When there are less nodes than tasks, some nodes will processing multiple tasks. When there are more nodes than tasks, additional nodes will be assigned to Tasks 1 and 2. Timings were collected in the serial approach for various networks that displayed more time requirements for Tasks 1 and 2, with larger network sizes requiring vastly more time to be taken in Tasks 1 and 2. As a result, additional nodes are assigned to Tasks 1 and 2. Node allocation can be seen in Figure \ref{fig:node-alloc}.
|
||||
The design of the tasking approach is to leverage a pipeline structure with the six tasks and MPI nodes. After its completion, each stage of the pipeline will pass the necessary data to the next stage through various MPI messages, where the next stage's nodes will receive the data and execute their tasks. The pipeline is considered fully saturated when each task has a dedicated node solely for executing work for that task. When there are less nodes than tasks, some nodes will process multiple tasks. When there are more nodes than tasks, additional nodes will be assigned to Tasks 1 and 2. Timings were collected in the serial approach for various networks that displayed more time requirements for Tasks 1 and 2, with larger network sizes requiring vastly more time to be taken in Tasks 1 and 2. As a result, additional nodes are assigned to Tasks 1 and 2. Node allocation can be seen in Figure \ref{fig:node-alloc}.
|
||||
|
||||
For determining which tasks should be handled by the root note, a few considerations were made. Minimizing communication cost and avoiding unnecessary complexity were the main two considerations. In the serial approach, the frontier queue was the primary data structure for the majority of the execution. Rather than using a distributed queue or passing multiple sub-queues between nodes, the minimal option is to pass states individually. This approach also assists in reducing the complexity. Managing multiple frontier queues would require duplication checks, multiple nodes requesting data from and storing data into the database, and devising a strategy to maintain proper queue ordering, all of which would also increase the communication cost. As a result, the root node will be dedicated to Tasks 0 and 3.
|
||||
For determining which tasks should be handled by the root note, a few considerations were made, where minimizing communication cost and avoiding unnecessary complexity were the main two considerations. In the serial approach, the frontier queue was the primary data structure for the majority of the generation process. Rather than using a distributed queue or passing multiple sub-queues between nodes, the minimum cost option is to pass states individually. This approach also assists in reducing the complexity. Managing multiple frontier queues would require duplication checks, multiple nodes requesting data from and storing data into the database, and devising a strategy to maintain proper queue ordering, all of which would also increase the communication cost. As a result, the root node will be dedicated to Tasks 0 and 3.
|
||||
|
||||
\begin{figure}[htp]
|
||||
\includegraphics[width=\linewidth]{"./Chapter5_img/node-alloc.png"}
|
||||
@ -58,7 +57,7 @@ For determining which tasks should be handled by the root note, a few considerat
|
||||
\end{figure}
|
||||
|
||||
\TUsubsubsection{Communication Structure}
|
||||
The underlying communication structure for the tasking approach relies on a pseudo-ring structure. As seen in \ref{fig:node-alloc}, nodes n$_2$, n$_3$, and n$_4$ are derived from the previous task's greatest node rank. To keep the development abstract, a custom send function would check the world size before sending. If the rank of the node that would receive a message is greater than the world size and therefore does not exist, the rank would then be looped around and corrected. After the rank correction, the MPI Send function was then invoked with the proper node rank.
|
||||
The underlying communication structure for the tasking approach relies on a pseudo-ring structure. As seen in Figure \ref{fig:node-alloc}, nodes n$_2$, n$_3$, and n$_4$ are derived from the previous task's greatest node rank. To keep the development abstract, a custom send function checks the world size before sending. If the rank of the node that would receive a message is greater than the world size and therefore does not exist, the rank would then be ``looped around" and corrected to fit within the world size constraints. After the rank correction, the MPI Send function was then invoked with the proper node rank.
|
||||
|
||||
\TUsubsubsection{Task 0}
|
||||
Task 0 is performed by the root node, and is a conditional task; it is not guaranteed to be executed at each pipeline iteration. Task 0 is only executed when the frontier is empty, but the database still holds unexplored states. This occurs when there are memory constraints, and database storage is performed during execution to offload the demand, as discussed in Section \ref{sec:db-stor}. After the completion of Task 0, the frontier has a state popped, and the root node sends the state to n$_1$. If the frontier is empty, the root node sends the finalize signal to all nodes.
|
||||
@ -73,7 +72,7 @@ Task 1 begins by distributing the workload between nodes based on the local task
|
||||
\label{fig:Task1-Data-Dist}
|
||||
\end{figure}
|
||||
|
||||
Once the computation work of Task 1 is completed, each node must send their compiled applicable exploit list to Task 2. Rather than merging all lists and splitting them back out in Task 2, each node in Task 1 will send an applicable exploit list to at most one node allocated to Task 2. Based on the allocation of nodes seen in Figure \ref{fig:node-alloc}, there are 2 potential cases: the number of nodes allocated to Task 1 is equal to the number of nodes allocated to Task 2, or the number of nodes allocated to Task 1 is one greater than the number of nodes allocated to Task 2. For the first case, each node in Task 1 sends the applicable exploit list to its global rank+n$_1$). This case can be seen in Figure \ref{fig:Task1-Case1}. For the second case, since there are more nodes allocated to Task 1 than Task 2, node n$_1$ scatters its partial applicable exploit list in the local Task 1 communicator, and all other Task 1 nodes follow the same pattern seen in the first case. This second case can be seen in Figure \ref{fig:Task1-Case2}.
|
||||
Once the computation work of Task 1 is completed, each node must send their compiled applicable exploit list to Task 2. Rather than merging all lists and splitting them back out in Task 2, each node in Task 1 will send an applicable exploit list to at most one node allocated to Task 2. Based on the allocation of nodes seen in Figure \ref{fig:node-alloc}, there are 2 potential cases: the number of nodes allocated to Task 1 is equal to the number of nodes allocated to Task 2, or the number of nodes allocated to Task 1 is one greater than the number of nodes allocated to Task 2. For the first case, each node in Task 1 sends the applicable exploit list to its global rank+n$_1$. This case can be seen in Figure \ref{fig:Task1-Case1}. For the second case, since there are more nodes allocated to Task 1 than Task 2, node n$_1$ scatters its partial applicable exploit list in the local Task 1 communicator, and all other Task 1 nodes follow the same pattern seen in the first case. This second case can be seen in Figure \ref{fig:Task1-Case2}.
|
||||
|
||||
\begin{figure}[htp]
|
||||
\includegraphics[width=\linewidth]{"./Chapter5_img/Task1-Case1.png"}
|
||||
@ -90,13 +89,13 @@ Once the computation work of Task 1 is completed, each node must send their comp
|
||||
\end{figure}
|
||||
|
||||
\TUsubsubsection{Task 2}
|
||||
Each node in Task 2 iterates through the received partial applicable exploit list and creates new states with edges to the current state. However, Synchronous Firing work is performed during this process, and syncing multiple exploits that could be distributed across multiple nodes leads to additional overhead and complexity. To prevent these difficulties, each node checks its partial applicable exploit list for exploits that are part of a group, removes these exploits from its list, and sends a new partial list to the Task 2 local communicator root. Since the Task 2 local root now contains all group exploits, it can execute the Synchronous Firing work without additional communication or synchronization between other MPI nodes in the Task 2 stage. Other than the additional setup steps required for Synchronous Firing for the local root, all work performed during this task by all MPI nodes is that seen from the Synchronous Firing figure (Figure \ref{fig:sync-fire}).
|
||||
Each node in Task 2 iterates through the received partial applicable exploit list and creates new states with edges to the current state. However, synchronous firing work is performed during this task, and syncing multiple exploits that could be distributed across multiple nodes leads to additional overhead and complexity. To prevent these difficulties, each node checks its partial applicable exploit list for exploits that are part of a group, removes these exploits from its list, and sends the exploits belonging to a group to the Task 2 local communicator root. Since the Task 2 local root now contains all group exploits, it can execute the Synchronous Firing work without additional communication or synchronization between other MPI nodes in the Task 2 stage. Other than the additional setup steps required for Synchronous Firing for the local root, all work performed during this task by all MPI nodes is that seen from the Synchronous Firing figure (Figure \ref{fig:sync-fire}).
|
||||
|
||||
\TUsubsubsection{Task 3}
|
||||
Task 3 is performed only by the root node, and no division of work is necessary. The root node will continuously check for new states until the Task 2 finalize signal is detected. This task consists of setting the new state's ID, adding it to the frontier, adding its information to the instance, and inserting information into the hash map. When the root node has processed all states and has received the Task 2 finalize signal, it will complete Task 3 by sending the instance and/or frontier to Task 4 and/or 5, respectively if applicable, then proceeds to Task 0.
|
||||
Task 3 is performed only by the root node, and no division of work is necessary. The root node will continuously check for new states until the Task 2 finalize signal is detected. This task consists of setting the new state's ID, adding it to the frontier, adding its information to the instance, and inserting information into the hash map. When the root node has processed all states and has received the Task 2 finalize signal, it will complete Task 3 by sending the instance and/or frontier to Task 4 and/or 5, respectively if applicable, then proceed to Task 0.
|
||||
|
||||
\TUsubsubsection{Task 4 and Task 5} \label{sec:T4T5}
|
||||
Intermediate database operations, though not frequent and may never occur for small graphs, are lengthy and time-consuming when they do occur. As discussed in Section \ref{sec:db-stor}, the two main memory consumers are the frontier and the instance, both of which are contained by the root node. Since the database storage requests are blocking, the pipeline would halt for a lengthy period of time while waiting for the root node to finish potentially two large storages. Tasks 4 and 5 work to alleviate the stall by executing independently of the regular pipeline execution flow. Since Tasks 4 and 5 do not send any data, no other tasks must wait for these tasks to complete. The root node can then asynchronously send the frontier and instance to the appropriate nodes as needed, clear its memory, and continue execution without delay. After initial testing, it was determined that the communication cost of the asynchronous sending of data for Tasks 4 and 5 is less than the time requirement of a database storage operation performed by the root node.
|
||||
Intermediate database operations, though not frequent and may never occur for small graphs, are lengthy and time-consuming when they do occur. As discussed in Section \ref{sec:db-stor}, the two main memory consumers are the frontier and the instance, both of which are contained by the root node's memory. Since the database storage requests are blocking, the pipeline would halt for a lengthy period of time while waiting for the root node to finish potentially two large storages. Tasks 4 and 5 work to alleviate the stall by executing independently of the regular pipeline execution flow. Since Tasks 4 and 5 do not send any data, no other tasks must wait for these tasks to complete. The root node can then asynchronously send the frontier and instance to the appropriate nodes as needed, clear its memory, and continue execution without delay. After initial testing, it was determined that the communication cost of the asynchronous sending of data for Tasks 4 and 5 is less than the time requirement of a database storage operation if performed by the root node.
|
||||
|
||||
\TUsubsubsection{MPI Tags} \label{sec:tasking-tag}
|
||||
To ensure that the intended message is received by each node, the MPI message envelopes have their tag fields specified. When a node sends a message, it specifies a tag that corresponds with the data and intent for which it is sent. The tag values were arbitrarily chosen, and tags can be added to the existing list or removed as desired. When receiving a message, a node can specify to only look for messages that have an envelope with a matching tag field. Not only do tags ensure that nodes are receiving the correct messages, they also reduce complexity for program design. Table \ref{table:tasking-tag} displays the list of tags used for the MPI Tasking approach.
|
||||
@ -128,10 +127,10 @@ To ensure that the intended message is received by each node, the MPI message en
|
||||
\end{table}
|
||||
|
||||
\TUsubsection{Performance Expectations and Use Cases} \label{Task-perf-expec}
|
||||
Due to the amount of communication between nodes to distribute the necessary data through all stages of the tasking pipeline, this approach is not expected to outperform the serial approach in all cases. This tasking approach was specifically designed to reduce the computation time when the generation of each individual state increases in time. This approach does not offer any guarantees of processing through the frontier at an increased rate; it's main objective is to distribute the workload of individual state generation. As discussed in Section \ref{sec:Intro}, the amount of entries in the National Vulnerability database and any custom vulnerability testing to ensure adequate examination of all assets in the network sums to large number of exploits in the exploit list. Likewise for compliance graphs and compliance examinations, Section \ref{sec:CG-diff} discussed that the number of compliance checks for SOX, HIPAA, GDPR, PCI DSS, and/or any other regulatory compliance also sums to a large number of exploits in the exploit list. Since the generation of each state is largely dependent on the number of exploits present in the exploit list, this approach is best-suited for when the exploit list grows in size.
|
||||
Due to the amount of communication between nodes to distribute the necessary data through all stages of the tasking pipeline, this approach is not expected to outperform the serial approach in all cases. This tasking approach was specifically designed to reduce the computation time when the generation of each individual state increases in time. This approach does not offer any guarantees of processing through the frontier at an increased rate; it's main objective is to distribute the workload of individual state generation. As discussed in Section \ref{sec:Intro}, the amount of entries in the National Vulnerability database and any custom vulnerability testing to ensure adequate examination of all assets in the network sums to large number of exploits in the exploit list. Likewise for compliance graphs and compliance examinations, Section \ref{sec:CG-diff} discussed that the number of compliance checks for SOX, HIPAA, GDPR, PCI DSS, and/or any other regulatory compliance also sums to a large number of compliance checks in the exploit list. Since the generation of each state is largely dependent on the number of exploits present in the exploit list, this approach is best-suited for when the exploit list grows in size. As will be later discussed, it is also hypothesized that this approach is well-suited when many database operations occur.
|
||||
|
||||
\TUsubsection{Results}
|
||||
A series of tests were conducted on the platform described at the beginning of Section \ref{sec:test-platform}, and results were collected in regards to the effect of the MPI Tasking approach on increasing sizes of exploit lists for a varying number of nodes. The exploit list initially began with 6 items, and each test scaled the number of exploits by a factor of 2. The final test was with an exploit list with 49,512 entries. If all of the items in these exploit lists were applicable, the runtime would be too great for feasible testing due to the state space explosion. To prevent state-space explosion but still gather valid results, each exploit list in the tests contained 6 exploits that could be applicable, and all remaining exploits were not applicable. The exploits were created in a fashion similar to that seen in Figure \ref{fig:NA-exp}. By creating a multitude of not applicable exploits, testing can safely be conducted by ensuring state space explosion would not occur while still forcing the Tasking Approach to examine all possible exploits.
|
||||
A series of tests were conducted on the platform described at the beginning of Section \ref{sec:test-platform}, and results were collected in regards to the effect of the MPI Tasking approach on increasing sizes of exploit lists for a varying number of nodes. The exploit list initially began with 6 items, and each test scaled the number of exploits by a factor of 2. The final test was with an exploit list with 49,512 entries. If all of the items in these exploit lists were applicable, the runtime would be too great for feasible testing due to the state space explosion. To prevent state-space explosion but still gather valid results, each exploit list in the tests contained 6 exploits that could be applicable, and all remaining exploits were not applicable. The not applicable exploits were created in a fashion similar to that seen in Figure \ref{fig:NA-exp}. By creating a multitude of not applicable exploits, testing can safely be conducted by ensuring state space explosion would not occur while still observing the effectiveness of the tasking approach.
|
||||
|
||||
The results of the Tasking Approach can be seen in Figure \ref{fig:Spd-Eff-Task}. In terms of speedup, when the number of entries in the exploit list is small, the serial approach has better performance. This is expected due to the communication cost requiring more time than it does to generate a state, as discussed in Section \ref{sec:Task-perf-expec}. However, as the number of items in the exploit list increase, the Tasking Approach quickly begins to outperform the serial approach. It is notable that even when the tasking pipeline is not fully saturated (when there are less compute nodes assigned than tasks), the performance is still approximately equal to that of the serial approach. The other noticeable feature is that as more compute nodes are assigned, the speedup continues to increase.
|
||||
|
||||
@ -155,10 +154,10 @@ In terms of efficiency, 2 compute nodes offer the greatest value since the speed
|
||||
|
||||
\TUsection{Subgraphing Approach}
|
||||
\TUsubsection{Introduction to the Subgraphing Approach}
|
||||
As opposed to the Tasking Approach described in Section \ref{sec:Tasking-Approach}, this Section introduces the Subgraphing Approach as a means of reducing runtime by frontier distribution though subgraphing. Section \ref{sec:db-stor} discusses that the frontier is expanded at a rate faster than it can process. This approach attempts to distribute the frontier by assigning MPI nodes a starting state, and each node will generate a subgraph up to a designated depth-limit, where each node will then return their generated subgraph to a merging process. The author of \cite{li_concurrency_2019} presented an alternative method of frontier processing by utilizing OpenMP on a shared-memory system to assign each thread an individual state to explore that would then pass through a critical section. This approach is intended for a distributed system, and additionally differs in that each node will explore multiple states to form a subgraph, rather than exploring one individual state. This approach was implemented with two versions, and both collected results to draw conclusions in regards to speedup, efficiency, and scalability for attack graph and compliance graph generation.
|
||||
As opposed to the Tasking Approach described in Section \ref{sec:Tasking-Approach}, this Section introduces the Subgraphing Approach as a means of reducing runtime by frontier distribution though subgraphing. Section \ref{sec:db-stor} discusses that the frontier is expanded at a rate faster than can be processed. This approach attempts to distribute the frontier by assigning MPI nodes a starting state, and each node will generate a subgraph up to a designated depth-limit, where each node will then return their generated subgraph to a merging process. The author of \cite{li_concurrency_2019} presented an alternative method of frontier processing by utilizing OpenMP on a shared-memory system to assign each thread an individual state to explore that would then pass through a critical section. This approach is intended for a distributed system, and additionally differs in that each node will explore multiple states to form a subgraph, rather than exploring one individual state. This approach was implemented with two versions, and both collected results to draw conclusions in regards to speedup, efficiency, and scalability for attack graph and compliance graph generation.
|
||||
|
||||
\TUsubsection{Algorithm Design}
|
||||
The design of the subgraphing approach is devised of three main components: worker nodes, the root node, and a database node. The original design did not include a database node, but testing through the implementation of the tasking approach discussed in \ref{sec:T4T5} led to the inclusion of a dedicated database node. The overall design is predicated on the root node distributing data to all worker nodes, and merging the worker nodes' subgraphs. Figure \ref{fig:subg} displays an example graph that utilizes three worker nodes with a depth limit of 3. Each worker node corresponds to a different node color in the figure. Each worker node explored a varying number of states, but did not proceed to explore a depth that exceeded the specified depth limit of 3. The final cluster of four nodes at the bottom of the graph represents one of the three worker node exploring the final states, while the other two nodes wait for further instruction. The following three subsections describe each component in further detail.
|
||||
The design of the subgraphing approach consists of three main components: worker nodes, the root node, and a database node. The original design did not include a database node, but testing through the implementation of the tasking approach discussed in \ref{sec:T4T5} led to the inclusion of a dedicated database node. The overall design is predicated on the root node distributing data to all worker nodes and merging the worker nodes' subgraphs. Figure \ref{fig:subg} displays an example graph that utilizes three worker nodes with a depth limit of 3. Each worker node corresponds to a different graph state color in the figure. Each worker node explored a varying number of states, but did not proceed to explore a depth that exceeded the specified depth limit of 3. The final cluster of four nodes at the bottom of the graph represents one of the three worker node exploring the final states, while the other two nodes wait for further instruction. The following three subsections describe each component in further detail.
|
||||
|
||||
\begin{figure}[htp]
|
||||
\centering
|
||||
@ -169,7 +168,7 @@ The design of the subgraphing approach is devised of three main components: work
|
||||
\end{figure}
|
||||
|
||||
\TUsubsubsection{Worker Nodes}
|
||||
Each worker node will start each iteration of its while loop by checking for the finalize message or for a new state to receive. If no message is available to receive, it will continue to wait until one is available. Each worker node will continue this process until the finalize message is received, where it will exit the generation process. When a worker node receives a new state, it will follow the original graph generation process with a modification in the exploration approach. Since other nodes will also be generating subgraphs, using a breadth-first search is not the ideal option since the amount of duplicate work would be increased. Instead, the graph generation will utilize a depth-first search approach, where each node will explore up to a specified depth level. This depth level is specified before the generation process begins, so additional tuning can be performed as desired. After either the depth limit is reached, or no other states are reachable to be explored, the node will return its local frontier (if the node still has unexplored states it did not explore) and its generated subgraph. The node will then proceed to wait for a new message from the root node.
|
||||
Each worker node will start each iteration of its loop by checking for the finalize message or for a new state to receive. If no message is available to receive, it will continue to wait until one is available. Each worker node will continue this process until the finalize message is received, where it will exit the generation process. When a worker node receives a new state, it will follow the original graph generation process with a modification in the exploration approach. Since other nodes will also be generating subgraphs, using a breadth-first search is unideal since the amount of duplicate work would be increased. Instead, the graph generation will utilize a depth-first search approach, where each node will explore up to a specified depth level. This depth level is specified by the user at before the generation process begins, so additional tuning can be performed as desired. After either the depth limit is reached, or no other states are available to be explored, the node will return its local frontier (if the node still has unexplored states it did not explore) and its generated subgraph. The node will then proceed to wait for a new message from the root node.
|
||||
|
||||
\TUsubsubsection{Root Node}
|
||||
The root node is responsible for two main portions of the subgraphing approach - the distribution of work and the merging of results. The subgraphing approach begins by distributing the only state that is known at the beginning of the generation process to a single node. Once the node returns, two functions may occur. If the node indicates that there are still more states to explore, then the worker node's frontier gets merged with the root node's frontier. If the node discovered new states, then its subgraph gets merged with the root node's graph. This process occurs as long as there are more states to explore. The frontier merging process takes the worker node's frontier and emplaces it to the back of the root node's frontier, with an additional marker. Since each worker node follows a depth-first search, if two worker nodes pop from the front of the queue, there is a high likelihood that the worker nodes are exploring the same segment of the graph, resulting in more duplicate work. To prevent this, each worker node attempts to explore the same segment of the graph throughout the generation process. When the frontiers are merged, the root node includes an additional marker for each worker node to indicate where the end of a node's frontier is in relation to the overall root frontier. When distributing work to a worker node, the root node will pull a state from the queue at a position equal to the node marker. If no marker is found for a worker node and the queue is not empty, then a random state is pulled from the queue instead. Figure \ref{fig:front-merg} displays the frontier merging process along with the data distribution.
|
||||
@ -210,7 +209,7 @@ Similar to Section \ref{sec:tasking-tag} that discussed the usage of MPI Tags fo
|
||||
\end{table}
|
||||
|
||||
\TUsubsection{Performance Expectations and Use Cases} \label{sec:perf_expec_subg}
|
||||
This approach is intended to reduce runtime when the frontier grows at a rate faster than it can be explored. However, since this approach is designed for distributed systems, there is no guarantee that speedup can be achieved when duplicate work is performed. Not only is there wasted time by the worker nodes when duplicate work occurs, but duplicate work also contributes to a longer runtime due to increased communication between nodes to adequately explore the graph, and also leads to increased numbers of merging calls by the root node. The ideal scenario for the subgraphing approach is when the graph is sparse, and the graph more aligns with a tree structure where each node only has one parent. When the graph is sparse, there is a lower likelihood of duplicate work occurring, since worker nodes have a lower probability of exploring a graph node that connects to a different graph node that has been (or will be) explored by another worker node. If each graph node were only able to have one parent, there is a lower likelihood that worker nodes would explore the same graph cluster.
|
||||
This approach is intended to reduce runtime when the frontier grows at a rate faster than it can be explored. However, since this approach is designed for distributed systems, there is no guarantee that speedup can be achieved when duplicate work is performed. Not only is there wasted time by the worker nodes when duplicate work occurs, but duplicate work also contributes to increased communication between nodes to adequately explore the graph, and also leads to an increased number of merging calls by the root node. The ideal scenario for the subgraphing approach is when the graph is sparse, and the graph aligns more with a N-Ary tree structure where each node only has one parent. When the graph is sparse, there is a lower likelihood of duplicate work occurring, since worker nodes have a lower probability of exploring a graph state that connects to a different graph state that has been (or will be) explored by another worker node. If each graph state was able to have only one parent, there is a lower likelihood that worker nodes would explore the same graph cluster.
|
||||
|
||||
\TUsubsection{Results}
|
||||
A series of tests were conducted on the platform described at the beginning of Section \ref{sec:test-platform}, and results were collected in regards to the effect of the MPI Subgraphing approach on the 4 example networks discussed in \ref{sec:Sync-Test}. All tests used synchronous firing. The initial results are seen in Figure \ref{fig:Subg_base}, which displays the results in terms of their runtimes. Only the serial runtime from the 2 Service test is displayed for conciseness. The results in terms of speedup and efficiency are seen in Figure \ref{fig:Subg_SE}.
|
||||
@ -223,13 +222,14 @@ A series of tests were conducted on the platform described at the beginning of S
|
||||
\end{figure}
|
||||
|
||||
\begin{figure}[htp]
|
||||
\includegraphics[width=\linewidth]{"./Chapter5_img/Subg_SE_base.png"}
|
||||
\vspace{.2truein} \centerline{}
|
||||
\caption{First iteration results of MPI Subgraphing in terms of Speedup and Efficiency}
|
||||
\label{fig:Subg_SE}
|
||||
\centering
|
||||
\includegraphics[width=\linewidth]{"./Chapter5_img/no_DHT_Spd.png"}
|
||||
\includegraphics[width=\linewidth]{"./Chapter5_img/no_DHT_eff.png"}
|
||||
\caption{First iteration results of MPI Subgraphing in terms of Speedup and Efficiency}
|
||||
\label{fig:Subg_SE}
|
||||
\end{figure}
|
||||
|
||||
As noted from the Figures, the performance from this approach appears quite poor. The serial approach has greater performance in all cases, and the resulting speedups for all 4 service tests are below 1.0 when using the subgraphing approach. Likewise, the efficiency continues to worsen as more compute nodes are added to the system. There are a few reasons for this poor performance. Figure \ref{fig:subg} illustrates an example graph that is considered favorable to this approach in that branches are relatively distinct, and the graph is not fully connected. As a result in this example graph, each compute node is working on independent graph structures that do not overlap, and all work is distinct. This graph can quickly lead to issues through a few modifications. Figure \ref{fig:subg_mod} utilizes the same example graph from Figure \ref{fig:subg}, but adds two edges between the outside branches. With this arrangement, the 1st and 3rd compute nodes will perform work that overlaps with the work performed by the 2nd compute node. Both compute node 1 and compute node 3 will explore State 5, and depending on the depth limit, compute nodes 1 and 3 will continue to explore State 5's children, leading to almost all of compute 2's work being duplicated. This duplicate work occurs at an alarming rate in the service tests that were performed. Figure \ref{fig:subg_dup} illustrates that there is an extraordinarily large amount of duplicate work occurring in the testing, which substantially increases the runtime of this approach. The duplicate work, as discussed in Section \ref{sec:perf_expec_subg}, not only wastes compute nodes' times and leads to a longer exploration process, but it also requires the root node to perform more merging and cleanup work. When using mpiP (a light-weight MPI profiler provided by Lawrence Livermore National Laboratory), it is seen that this extra merging and cleanup work performed by the root causes additional delays in distributing data, and the compute nodes spent a combined 35\% total application runtime just waiting to receive more data from the root node in the 1 service test.
|
||||
As noted from the Figures, the performance from this approach appears quite poor. The serial approach has greater performance in all cases, and the resulting speedups for all 4 service tests are below 1.0 when using the subgraphing approach. Likewise, the efficiency continues to worsen as more compute nodes are added to the system. There are a few reasons for this poor performance. Figure \ref{fig:subg} illustrates an example graph that is considered favorable to this approach in that branches are relatively distinct, and the graph is not fully connected. As a result in this example graph, each compute node is working on independent graph structures that do not overlap, and all work is distinct. This graph can quickly lead to issues through a few modifications. Figure \ref{fig:subg_mod} utilizes the same example graph from Figure \ref{fig:subg}, but adds two edges between the outside branches. With this arrangement, the 1st and 3rd compute nodes will perform work that overlaps with the work performed by the 2nd compute node. Both compute node 1 and compute node 3 will explore State 5, and depending on the depth limit, compute nodes 1 and 3 will continue to explore State 5's children, leading to almost all of compute 2's work being duplicated twice. This duplicate work occurs at an alarming rate in the service tests that were performed. Figure \ref{fig:subg_dup} illustrates that there is an extraordinarily large amount of duplicate work occurring in the testing, which substantially increases the runtime of this approach. The duplicate work, as discussed in Section \ref{sec:perf_expec_subg}, not only wastes compute nodes' times and leads to a longer exploration process, but it also requires the root node to perform more merging and cleanup work. When using mpiP (a light-weight MPI profiler provided by Lawrence Livermore National Laboratory) \cite{lawrence_livermore_national_laboratory_mpip_nodate}, it was measured that this extra merging and cleanup work performed by the root causes additional delays in distributing data, and the compute nodes spent a combined 35{\%} total application runtime just waiting to receive more data from the root node in the 1 service test.
|
||||
|
||||
\begin{figure}[htp]
|
||||
\includegraphics[width=\linewidth]{"./Chapter5_img/dup.drawio.png"}
|
||||
@ -245,7 +245,7 @@ As noted from the Figures, the performance from this approach appears quite poor
|
||||
\label{fig:subg_dup}
|
||||
\end{figure}
|
||||
|
||||
To minimize the duplicate work performed, a second approach using a distributed hash table (DHT) was attempted. With a DHT, each compute node would check to ensure that they were not duplicating work. This would limit the work needed by the root node, but each worker node would need to search the DHT. Using a DHT would increase the communication overhead, but if the communication overhead was less than the time taken for duplicate work or was minimal enough to still process the frontier at a greater rate than the serial approach, then the distributed hash table would be considered advantageous. Rather than devising a unique strategy for a distributed hash table, this work made use of the Berkely Container Library (BCL), which is open-source and provides distributed data structures with easy-to-use interfaces. Since BCL is a header-only library, it allowed for minimal code alterations, and primarily just needed to be dropped into the system. Testing was repeated with an identical setup to the approach without BCL. The results in terms of speedup and efficiency are seen in Figure \ref{fig:subg_DHT_Spd}. Results in terms of runtime between the DHT approach and the base approach are seen in Figure \ref{fig:subg_DHT_base}.
|
||||
To minimize the duplicate work performed, a second approach using a distributed hash table (DHT) was attempted. With a DHT, each compute node would check to ensure that they were not duplicating work. This would limit the work needed by the root node, but each worker node would need to search the DHT. Using a DHT would increase the communication overhead, but if the communication overhead was less than the time taken for duplicate work or was minimal enough to still process the frontier at a greater rate than the serial approach, then the distributed hash table would be considered advantageous. Rather than devising a unique strategy for a distributed hash table, this work made use of the Berkely Container Library (BCL), which is open-source and provides distributed data structures with easy-to-use interfaces. Since BCL is a header-only library, it allowed for minimal code alterations, and primarily just needed to be dropped into the system. Testing was repeated with an identical setup to the approach without a DHT. The results in terms of speedup and efficiency are seen in Figure \ref{fig:subg_DHT_Spd}. Results in terms of runtime between the DHT approach and the base approach are seen in Figure \ref{fig:subg_DHT_base}.
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
@ -262,4 +262,4 @@ To minimize the duplicate work performed, a second approach using a distributed
|
||||
\label{fig:subg_DHT_base}
|
||||
\end{figure}
|
||||
|
||||
Implementing the DHT did prevent duplicate work, but the communication cost from repeated DHT queries by each worker node was far greater than the serial approach, and was also greater than the first approach for MPI subgraphing without the DHT. As a result, the MPI subgraphing approach is not viable as it stands. Future improvements or entire reworking will need to be performed, and this is discussed further in Section \ref{sec:FW}.
|
||||
Implementing the DHT did prevent duplicate work, but the communication cost from repeated DHT queries by each worker node was far greater than the serial approach, and was also greater than the first approach for MPI Subgraphing without the DHT. As a result, the MPI Subgraphing approach is not viable as it stands. Future improvements or entire reworking will need to be performed, and this is discussed further in Section \ref{sec:FW}.
|
||||
|
||||
BIN
Chapter5_img/no_DHT_Spd.png
Normal file
BIN
Chapter5_img/no_DHT_Spd.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 106 KiB |
BIN
Chapter5_img/no_DHT_eff.png
Normal file
BIN
Chapter5_img/no_DHT_eff.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 107 KiB |
@ -2,6 +2,7 @@
|
||||
|
||||
\TUsection{Future Work} \label{sec:FW}
|
||||
%Sync Fire more assets
|
||||
%Large networks for the Tasking approach to give Task 2 a realistic workload
|
||||
%Blending OpenMP and MPI
|
||||
%Subgraphing Work
|
||||
%DHT alts
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user