MSThesis/Chapter5.tex

\TUchapter{Utilization OF MESSAGE PASSING INTERFACE}

\TUsection{Introduction to MPI Utilization for Attack Graph Generation}

\TUsection{Necessary Components}
\TUsubsection{Serialization}
In order to distribute workloads across nodes in a distributed system, various
types of data will need to be sent and received. Support and mechanisms vary based
on the MPI implementation, but most fundamental data types such as integers, doubles,
characters, and Booleans are incorporated into the MPI implementation. While this does
simplify some of the messages that need to be sent and received in the MPI approaches of
attack graph generation, it does not cover the vast majority of them.

RAGE implements many custom classes and structs that are used throughout the generation process.
Qualities, topologies, network states, and exploits are a few such examples. Rather than breaking
each of these down into fundamental types manually, serialization functions are leveraged to handle
most of this. RAGE already incorporates Boost graph libraries for auxiliary support, so this work
extended this further to utilize the serialization libraries also provided by Boost. These
libraries also include support for serializing all STL classes, and many of the RAGE
classes have members that make use of the STL classes. One additional advantage of the Boost
library approach is that many of the RAGE class members are nested. For example, the NetworkState
class has a member vector of Quality classes. When serializing the NetworkState class, boost will
recursively serialize all members, including the custom class members, assuming they also have
serialization functions.

When using the serialization libraries, this work opted to use the intrusive route, where the
class instances are altered directly. This was preferable to the non-intrusive approach, since
the class instances were able to be altered with relative ease, and many of the class instances
did not expose enough information for the non-intrusive approach to be viable.
\TUsubsection{Data Consistency}

\TUsection{Tasking Approach}
\TUsubsection{Introduction to the Tasking Approach}
The high-level overview of the compliance graph generation process can be broken down into six main tasks.
These tasks are described in Figure \ref{fig:tasks}. Prior works such as that seen by the
authors of \cite{li_concurrency_2019}, \cite{9150145}, and \cite{7087377} work to parallelize the graph generation using
OpenMP, CUDA, and hyper-graph partitioning. This approach, however, utilizes Message Passing Interface (MPI)
to distribute the six identified tasks of RAGE to examine the effect on speedup, efficiency, and scalability for
attack and compliance graph generation.

\begin{figure}[htp]
    \includegraphics[width=\linewidth]{"./Chapter5_img/horiz_task.drawio.png"}
    \vspace{.2truein} \centerline{}
        \caption{Task Overview of the Attack Graph Generation Process}
        \label{fig:tasks}
\end{figure}

\TUsubsection{Algorithm Design}
The design of the tasking approach is to leverage a pipeline structure with the six tasks and MPI nodes. Each stage of the pipeline will pass the necessary data to the next stage through various MPI messages, where the next stage's nodes will receive the data and execute their tasks. The pipeline is considered fully saturated when each task has a dedicated node. When there are less nodes than tasks, some nodes will processing multiple tasks. When there are more nodes than tasks, additional nodes will be assigned to Tasks 1 and 2. Timings were collected in the serial approach for various networks that displayed more time requirements for Tasks 1 and 2, with larger network sizes requiring vastly more time to be taken in Tasks 1 and 2. As a result, additional nodes are assigned to Tasks 1 and 2. Node allocation can be seen in Figure \ref{fig:node-alloc}.

For determining which tasks should be handled by the root note, a few considerations were made. Minimizing communication cost and avoiding unnecessary complexity were the main two considerations. In the serial approach, the frontier queue was the primary data structure for the majority of the execution. Rather than using a distributed queue or passing multiple sub-queues between nodes, the minimal option is to pass states individually. This approach also assists in reducing the complexity. Managing multiple frontier queues would require duplication checks, multiple nodes requesting data from and storing data into the database, and devising a strategy to maintain proper queue ordering, all of which would also increase the communication cost. As a result, the root node will be dedicated to Tasks 0 and 3.

\begin{figure}[htp]
    \includegraphics[width=\linewidth]{"./Chapter5_img/node-alloc.png"}
    \vspace{.2truein} \centerline{}
        \caption{Node Allocation for each Task}
        \label{fig:node-alloc}
\end{figure}

\TUsubsubsection{Communication Structure}

\TUsubsubsection{Task 0}
Task 0 is performed by the root node, and is a conditional task; it is not guaranteed to be executed at each pipeline iteration. Task 0 is only executed when the frontier is empty, but the database still holds unexplored states. This occurs when there are memory constraints, and database storage is performed during execution to offload the demand, as discussed in Section \ref{sec:db-stor}. After the completion of Task 0, the frontier has a state popped, and the root node sends the state to n$_1$. If the frontier is empty, the root node sends the finalize signal to all nodes.
\TUsubsubsection{Task 1}
Task 1 begins by distributing the workload between nodes based on the local task communicator rank. Rather than splitting the exploit list at the root node and sending sub-lists to each node allocated to Task 1, each node checks its local communicator rank and performs a modulo operation with the number of nodes allocated to determine whether it should proceed with the current iteration of the exploit loop. Since the exploit list is static, each node has the exploit list initialized prior to the generation process, and communication cost can be avoided from sending sub-lists to each node. Each node in Task 1 works to compile a reduced exploit list that is applicable to the current network state. A breakdown of the Task 1 distribution can be seen in Figure \ref{fig:Task1-Data-Dist}.

\begin{figure}[htp]
    \includegraphics[width=\linewidth]{"./Chapter5_img/Task1-Data-Dist.png"}
    \vspace{.2truein} \centerline{}
        \caption{Data Distribution of Task One}
        \label{fig:Task1-Data-Dist}
\end{figure}

Once the computation work of Task 1 is completed, each node must send their compiled applicable exploit list to Task 2. Rather than merging all lists and splitting them back out in Task 2, each node in Task 1 will send an applicable exploit list to at most one node allocated to Task 2. Based on the allocation of nodes seen in Figure \ref{fig:node-alloc}, there are 2 potential cases: the number of nodes allocated to Task 1 is equal to the number of nodes allocated to Task 2, or the number of nodes allocated to Task 1 is one greater than the number of nodes allocated to Task 2. For the first case, each node in Task 1 sends the applicable exploit list to its global rank+n$_1$). This case can be seen in Figure \ref{fig:Task1-Case1}. For the second case, since there are more nodes allocated to Task 1 than Task 2, node n$_1$ scatters its partial applicable exploit list in the local Task 1 communicator, and all other Task 1 nodes follow the same pattern seen in the first case. This second case can be seen in Figure \ref{fig:Task1-Case2}.

\begin{figure}[htp]
    \includegraphics[width=\linewidth]{"./Chapter5_img/Task1-Case1.png"}
    \vspace{.2truein} \centerline{}
        \caption{Communication From Task 1 to Task 2 when the Number of Nodes Allocated is Equal}
        \label{fig:Task1-Case1}
\end{figure}

\begin{figure}[htp]
    \includegraphics[width=\linewidth]{"./Chapter5_img/Task1-Case2.png"}
    \vspace{.2truein} \centerline{}
        \caption{Communication From Task 1 to Task 2 when Task 1 Has More Nodes Allocated}
        \label{fig:Task1-Case2}
\end{figure}

\TUsubsubsection{Task 2}
Each node in Task 2 iterates through the received partial applicable exploit list and creates new states with edges to the current state. However, Synchronous Firing work is performed during this process, and syncing multiple exploits that could be distributed across multiple nodes leads to additional overhead and complexity. To prevent these difficulties, each node checks its partial applicable exploit list for exploits that are part of a group, removes these exploits from its list, and sends a new partial list to the Task 2 local communicator root. Since the Task 2 local root now contains all group exploits, it can execute the Synchronous Firing work without additional communication or synchronization between other MPI nodes in the Task 2 stage. Other than the additional setup steps required for Synchronous Firing for the local root, all work performed during this task by all MPI nodes is that seen from the Synchronous Firing figure (Figure \ref{fig:sync-fire}).
\TUsubsubsection{Task 3}
Task 3 is performed only by the root node, and no division of work is necessary. The root node will continuously check for new states until the Task 2 finalize signal is detected. This task consists of setting the new state's ID, adding it to the frontier, adding its information to the instance, and inserting information into the hash map. When the root node has processed all states and has received the Task 2 finalize signal, it will complete Task 3 by sending the instance and/or frontier to Task 4 and/or 5, respectively if applicable, then proceeds to Task 0.

\TUsubsubsection{Task 4 and Task 5}
Intermediate database operations, though not frequent and may never occur for small graphs, are lengthy and time-consuming when they do occur. As discussed in Section \ref{sec:db-stor}, the two main memory consumers are the frontier and the instance, both of which are contained by the root node. Since the database storage requests are blocking, the pipeline would halt for a lengthy period of time while waiting for the root node to finish potentially two large storages. Tasks 4 and 5 work to alleviate the stall by executing independently of the regular pipeline execution flow. Since Tasks 4 and 5 do not send any data, no other tasks must wait for these tasks to complete. The root node can then asynchronously send the frontier and instance to the appropriate nodes as needed, clear its memory, and continue execution without delay.
\TUsubsubsection{MPI Tags}

\TUsubsection{Performance Expectations}

\TUsubsection{Results}
Communication cost of asynchronous send for T4 and T5 is less than the time requirement of a database storage by root.

\TUsection{Subgraphing Approach}
\TUsubsection{Introduction to the Subgraphing Approach}

\TUsubsection{Algorithm Design}
\TUsubsubsection{Communication Structure}
\TUsubsubsection{Worker Nodes}
\TUsubsubsection{Root Node}
\TUsubsubsection{Database Node}

\TUsubsection{Performance Expectations}