AG-CG-Network-Science/Diss_Version/Schrick-Noah_CG-Network-Theory.tex

\documentclass[conference]{IEEEtran}
\RequirePackage{setspace}
\usepackage{graphicx} % Images
\graphicspath{ {./images/} }

\usepackage{float} % Table captions on top
\floatstyle{plaintop}
\restylefloat{table}

\usepackage{ifpdf} % Detect PDF or DVI mode
\usepackage{babel} % Bibliography
\usepackage{dsfont} % mathbb

\usepackage[utf8]{inputenc}
\usepackage{indentfirst}
\setlength{\parskip}{\baselineskip}

% Table of Contents/Figure Spacing
\usepackage[titles]{tocloft}
\cftsetindents{figure}{0em}{3.5em}
\cftsetindents{table}{0em}{3.5em}

\usepackage{dsfont} % mathbb
\usepackage{amsmath}
\usepackage[linesnumbered,commentsnumbered,ruled,vlined]{algorithm2e}

\begin{document}

\title{
    Compliance Graph Analysis Using Network Science and Structure Variations
}

\author{
    \IEEEauthorblockN{Noah L. Schrick}
    \IEEEauthorblockA{
        \textit{Tandy School of Computer Science} \\
        \textit{The University of Tulsa}\\
            Tulsa, USA \\
            noah-schrick@utulsa.edu
    }

    \and

    \IEEEauthorblockN{Peter J. Hawrylak}
    \IEEEauthorblockA{
        \textit{Tandy School of Computer Science} \\
        \textit{The University of Tulsa}\\
            Tulsa, USA \\
            peter-hawrylak@utulsa.edu
    }

    \and

    \IEEEauthorblockN{Brett A. McKinney}
    \IEEEauthorblockA{
        \textit{Tandy School of Computer Science} \\
        \textit{The University of Tulsa}\\
            Tulsa, USA \\
            brett-mckinney@utulsa.edu
    }
}

\maketitle

\begin{abstract}
    Compliance graphs are generated graphs (or networks) that represent systems' compliance or regulation standings  at present, with expected changes, or both. These graphs are generated as directed acyclic graphs (DAGs), and can be used to identify possible correction or mitigation schemes for environments necessitating compliance to mandates or regulations.
    DAGs complicate the analysis process due to their underlying graph structures and asymmetry. This work presents network science centralities compatible with DAGs, and structure variations as a means to analyze three different example compliance graphs. Each centrality measure and structural change offers a unique importance ranking that can be used for prioritizing correction or mitigation schemes.
\end{abstract}

\begin{IEEEkeywords}
Attack Graph; Compliance Graph; Cybersecurity; Compliance and Regulation; Network Theory; Centrality;
\end{IEEEkeywords}

\section{Introduction} \label{sec:Intro}
Compliance graphs are an alternate form of attack graphs, utilized specifically for examining compliance and regulation statuses of systems. Like attack graphs, compliance graphs can be used to determine all ways that systems may fall out of compliance or violate regulations, or highlight the ways in which violations are already present. These graphs are notably useful for cyber-physical systems due to the increased need for compliance. As the authors of \cite{j_hale_compliance_nodate}, \cite{baloyi_guidelines_2019}, and \cite{allman_complying_2006} discuss, cyber-physical systems have seen greater usage, especially in areas such as critical infrastructure and Internet of Things. The challenge of cyber-physical systems lies not only in the demand for cybersecurity of these systems, but also the concern for safe, stable, and undamaged equipment. The industry in which these devices are used can lead to additional compliance guidelines that must be followed, increasing the complexity required for examining compliance statuses. Compliance graphs are promising tools that can aid in minimizing the overhead caused by these systems and the regulations they must follow.
The state-space explosion and large-scale nature of compliance graphs leads to additional overhead for analysis approaches. Simplistic, initial approaches for compliance graph analysis quickly results in difficulties in terms of spatial and runtime complexities. Comparing every edge of every node in a graph containing upwards of hundreds of millions of nodes and edges makes analysis techniques with exponential complexities in either spatial or runtime terms largely infeasible. Brute-force tactics, manual evaluations, and consideration of all permutations will yield lackluster output, or may fail to complete in a reasonable manner of time. To reduce the problem space, prioritization of nodes can be performed as a pre-processing step for further analysis works. A compliance graph can undergo an initial analysis process to determine which nodes or edges should undergo a more rigorous investigation. In addition, the structure of the compliance graph can be altered to limit or refine the goal of the analysis process. This work will proceed as follows: Section \ref{sec:corr-priorities} will present the prioritization metrics through a Network Science lens. Section \ref{sec:graph-xform} will present graph transformations for altering and refining compliance graphs to other structures. Section \ref{sec:cent-res} will present and discuss the centrality results, and Section \ref{sec:xform-res} will present and discuss the transformation results.

\section{Related Works} \label{sec:rel-works}
Compliance graphs have yet to be formally investigated for analysis purposes. However, compliance graphs share many similarities to attack graphs. As Section \ref{sec:Intro} discusses, attack and compliance graphs are both directed acyclic graphs (DAGs) that exhaustively walk through all changes in a system or set of systems. Attack graphs examine the cybersecurity postures, while compliance graphs examine compliance or regulation standings. These graphs are generated and processed similarly, but are focused and refined on different fields of interest. Many researchers have developed or applied analysis techniques to attack graphs in order to analyze various features and reveal information regarding common trends or possible corrections. These techniques, though applied to attack graphs, are capable of being applied to compliance graphs if slight modifications were made. This Section highlights a few of the research routes that have been undertaken in order to accomplish these goals. This Section categorizes these related investigations to highlight the availability and novelty of the research methods for this dissertation.

After the generation of graphs, it is reasonable to visualize possible violation paths an environment may endure. However, assuming not all paths can be removed, deciding which paths to choose is a cumbersome difficulty. One analysis technique to overcome this difficulty is through using minimization. Minimization can be employed to determine if a given security countermeasure increases the security or regulatory standing of a network \cite{Jha2002TwoFA}. Given a security countermeasure or correction scheme and the generated graph, if the proposed option prevents a transition from one graph state to another, the connecting edge is removed. After repeating for all possible edge removals, if the number of attacker or violation goal states has decreased, then the security countermeasure or correction scheme does improve the network. If the number of goal states remains the same, then the security countermeasure or correction scheme is not sufficient enough to improve the network's standing with regard to security or compliance.

Another technique for minimization analysis is identifying the smallest subset of security countermeasures or correction schemes that produce a desired network threshold \cite{Jha2002TwoFA}. However, the authors of \cite{Jha2002TwoFA} discuss that determining this is an NP-complete problem. This approach becomes increasingly infeasible as the size of the graphs grow with high numbers of possible attack vectors or regulatory violation conditions. If the minimum  subset was known, then a set of countermeasures or correction schemes could be processed to identify the smallest number of resolutions that would prevent all attacks or violations in the minimum set.

Though the minimum subset is an NP-Complete problem, approximations can still be derived, and various works have taken approaches toward the cost minimization problem. The authors of \cite{10.1016/j.comcom.2006.06.01837} presented an approach utilizing disjunctive normal forms. In this approach, countermeasures and correction schemes are represented as disjunctive clauses. The authors of \cite{Islam2008AHA36} developed a heuristic approach for attacks of multiple steps. In this approach, a cost is associated with the beginning nodes, with partial costs being distributed and propagated through state transitions. The authors of \cite{10.1109/IAS.2008.38} implemented a graphical approach for cost minimization analysis. Boolean functions and Shannon decomposition were leveraged with the use of source and sink nodes.

Regarding network science approaches, the authors of \cite{GCAI-2018:Analysis_of_Attack_Graph} use betweenness centrality specifically for logical attack graphs. Using the importance derived from the centrality results, the authors were able to employ a correction scheme with greater efficiency as compared to prioritizing a shortest-path approach. The author of \cite{ming_diss} presents three centrality measures that were applied to various attack graphs. The centrality measures implemented were Katz, K-path Edge, and Adapted PageRank, with the authors of \cite{10.1145/3491257} expanding on the Adapted PageRank approach. Each of these centrality measures are applicable to the directed format of compliance graphs, and conclusions were drawn by the author of \cite{ming_diss} regarding patching schemes for preventing exploits in attack graphs. As an approach for avoiding complex eigenvalues, the authors of \cite{Guo2017HermitianAM} present work examining directed, undirected, and mixed graphs using its Hermitian adjacency matrix. Other works, such as that discussed by the author of \cite{Mieghem2018DirectedGA}, include mathematical manipulation of directed graph spectra (originally presented by the author of \cite{Brualdi2010SpectraOD}) with Schur's Theorem to bound eigenvalues and allow for explicit computation, which can then be used for additional analysis metrics.

\section{Example Networks} \label{sec:example-networks}

\section{Identifying Correction Priority Through Network Centralities} \label{sec:corr-priorities}
In order to generate a correction scheme from a compliance graph, correction priorities first need to be obtained. For a correction scheme to be useful, it is imperative to tailor it around the most important concerns for a system or set of systems. Given a prior knowledge network and a compliance graph consisting of nodes and edges, it is possible to ascertain importance based on various information. Though nodes flagged as ``in violation" have importance, compliant upstream nodes and edges may have greater importance. Figure \ref{fig:topo-ex} illustrates an example subgraph of a compliance graph. An initial approach could be to assign nodes 67, 71, 74, and 75 (the shaded, ``in violation" nodes) the highest priority. If nodes 74 and 75 were to be independently addressed, both edge ``e" and edge ``b" would need to be prevented. However, though node 73 is compliant, removing this node from generation would prevent nodes 74 and 75 from their generations. Based on constraints and other factors present in a prior knowledge network, edge ``a" could have insignificant impact on available resources. Preventing edge ``a" would additionally prevent node 71 (another shaded, ``in violation" node) from its generation. Obtaining a priority list using a prior knowledge network and topographical information makes a network science approach very appealing. This Section discusses using centrality and graph transformations as a means of obtaining a correction priority. Figure \ref{fig:obj3} displays this approach as it relates to obtaining the correction priority.

\begin{figure}[htp]
    \includegraphics[width=\linewidth]{"./images/Topographical.png"}
    \centering
    \vspace{.2truein} \centerline{}
        \caption[Example Subgraph of a Compliance Graph]{Example Subgraph of a Compliance Graph. Prioritization of flagged nodes (shaded nodes) is one approach at minimizing severity in a system or set of systems. However, accounting for topographical information and upstream nodes that are in compliance can serve as better approaches for minimizing severity.}
        \label{fig:topo-ex}
\end{figure}

\begin{figure}[htp]
    \includegraphics[width=\linewidth]{"./images/Obj3.png"}
    \centering
    \vspace{.2truein} \centerline{}
        \caption[Obtaining a Correction Priority]{Obtaining a Correction Priority. A correction priority can be obtained through a compliance graph and a prior knowledge network. By using graph transformations and network centralities, importance can be assigned to nodes to serve as a correction priority.}
        \label{fig:obj3}
\end{figure}

\subsection{Introduction to Network Centralities} \label{sec:net-cents}
Within the field of network science, centralities are often used to determine the importance of a node or edge of a graph (or network). Various centrality metrics assign importance based on differing characteristics of a graph or network, such as the ability a node or edge has for transferring information, or the influence they may have on other, local or global nodes or edges. The author of \cite{PMID:30064421} provides a survey of centrality measures, and discusses how various centrality measures have been implemented in order to determine node importance in networks. By determining the importance of nodes, various conclusions can be drawn regarding the network, and how to identify noteworthy hubs. In the case of compliance graphs, conclusions can be drawn regarding the prioritization of patching or correction schemes. If one node directs to many other nodes, a mitigation enforcement may be considered imperative to prevent further opportunities for compliance violation. This work discusses centrality measures across various structural changes, and contextualizes their applications to compliance graphs.

\subsubsection{Network Centralities for Directed Graphs} \label{sec:NC-dir-challenges}
Compliance graphs, like attack graphs, are directed acyclic graphs, and analysis of directed graphs is notably more involved compared to their undirected counterparts. The primary contributor to the increased difficulty is due to the asymmetric adjacency matrix present in directed graphs. Figure \ref{fig:symm-adj} displays an undirected graph with a symmetric adjacency matrix, and Figure \ref{fig:asymm-adj} displays a directed form of the same graph with an asymmetric adjacency matrix. With undirected graphs, simplifications can be made in the analysis process both computationally and conceptually. Since the ``in" degrees are equal to the ``out" degrees, less work is required both in terms of parsing the adjacency matrix, but also in terms of determining importance of nodes. The author of \cite{newman2010networks} discusses that common analysis techniques such as eigenvector centrality is often unapplicable to directed acyclic graphs. As the author of \cite{Mieghem2018DirectedGA} discusses, the difficulty of directed graphs also extends to the graph Laplacian, where the definition for asymmetric adjacency matrices is not uniquely defined, and is based on either row or column sums computing to zero, but both cannot. The author of \cite{Mieghem2018DirectedGA} continues to discuss that directed graphs lead to complex eigenvalues, and can lead to adjacency matrices that are unable to be diagonalized. These challenges require different approaches for typical clustering or centrality measures.


\begin{figure}[htp]
    \includegraphics[scale=0.5]{"./images/Symm.png"}
    \centering
    \vspace{.2truein} \centerline{}
        \caption[Undirected Graph and its Symmetric Adjacency Matrix]{For undirected graphs, the resulting adjacency matrix is symmetric. For all nodes that have a connecting edge, the corresponding cell in the matrix is marked with a ``1". This value is present both when traversing by row and by column. All nodes that do not have a connecting edge have a value of ``0" in their corresponding cell. This value is also present both when traversing by row and by column. Therefore, the halves of the matrix across its diagonal are mirrored.}
        \label{fig:symm-adj}
\end{figure}

\begin{figure}[htp]
    \includegraphics[scale=0.5]{"./images/Asymm.png"}
    \centering
    \vspace{.2truein} \centerline{}
        \caption[Directed Graph and its Asymmetric Adjacency Matrix]{For directed graphs, the resulting adjacency matrix is asymmetric. Node ``A" has a directed edge to Node ``B", and therefore a value of ``1" is found in the corresponding cell. However, since this edge is not reciprocated, there is a value of ``0" when examining the cell between Node ``B" and Node ``A". This behavior is repeated for other, similar node relationships in this example graph. The halves of the matrix across its diagonal are not mirrored. }
        \label{fig:asymm-adj}
\end{figure}

While Section \ref{sec:rel-works} discusses a few options for approaching the challenges with directed graphs, this work opted for centrality metrics that are natively compatible with compliance graphs. Specifically, this work employed centrality metrics that are compatible with directed graphs and assign importance to nodes, rather than edges. Recall that for compliance graphs, nodes represent the state of an environment, with all relevant asset information encoded within the node. Edges (exploits) represent the changes that can be made to a system state. When generating a compliance graph, the edge specification is performed in advance, where the exploit is known, or the behavior (such as for zero-day modeling) can be described. The edges can have probability values assigned through likelihood analysis, can be linked to known CVEs, or through historical data. Since the unique set of exploits is orders of magnitudes less than the total number of edges in a graph, coupled with existing techniques for assigning edge values, and given the contextualization of edges in a compliance graph, assigning importance to edges is far less beneficial than examining prioritization at the node level. Since the goal of compliance graph analysis is securitization and limiting regulation violation seen at a node level, centrality metrics that assign importance to nodes were chosen over metrics that assign importance to edges. Section \ref{sec:FW} discusses future avenues that include assigning importance to edges in addition to nodes.

\subsection{Degree} \label{sec:degree}
Degree centrality is a trivial, localized measure of node importance based on the number of edges that a node has. In an undirected graph, the degree centrality is predicated solely on the number of edges. However, in the case of a directed graph, a distinction is drawn with a degree centrality oriented on the number of edges entering a node, and another measure focused on the number of edges leaving a node. Both of these cases provide useful information for compliance graphs. When a node has a large number of other nodes it directs to, this node may be prioritized since it creates further opportunity for violation. When a node has a large number of edges pointing to it, this node may be prioritized since the probability that systems may enter this state is higher due to the increased number of possibilities that a system change could lead to this state.

Degree centrality for the example networks presented in Section \ref{sec:example-networks} was implemented in R. The attack and compliance graph generator, RAGE, outputs the graph in a Graphviz \cite{Graphviz} DOT file. R's igraph \cite{igraph} package includes functionality to import from a DOT file into an igraph network, and this functionality was used for this purpose. Once the graph was imported, the built-in igraph function for degree centrality was called, and its output was stored in a list. With standard approaches, degree centrality has a spacial and time complexity of $\mathcal{O}(n^2)$. Standard approaches compute degree centrality through the adjacency matrix representation of a graph, which is a \textit{n} x \textit{n} matrix. Apart from the ease-of-use that igraph provides, it also includes optimizations for commonly-used graph operations. Due to the compliance graph structure, the graph can be stored in a column-compressed format as a sparse matrix. This sparse matrix is able to significantly reduce the memory footprint. Rather than storing an integer (or Boolean) value for each node-to-node connection (in the standard a \textit{n} x \textit{n} matrix), a series of reduced columns can be used, which reduces the spatial complexity to $\mathcal{O}(K)$, where \textit{K} is the number of nonzero elements. Operations on a sparse matrix likewise have reduced time complexities, including degree centrality. When using a sparse matrix with igraph's degree centrality function, time complexity is bound by $\mathcal{O}(n*d)$, where \textit{d} is the average degree.

\subsection{Betweenness}\label{sec:between}
Betweenness centrality ranks node importance based on its ability to transfer information in a network. For all pairs of nodes in a network, a shortest path is determined. A node that is in this shortest path is considered to have importance. The total betweenness centrality is based on the number of shortest paths that pass through a given node. For compliance graphs, the shortest paths are useful to identify the quickest way (least number of steps) that systems may fall out of compliance. By prioritizing the nodes that fall in the highest number of shortest paths, correction schemes can be employed to prolong or prevent systems from falling out of compliance.

Betweenness centrality is given in Equation \ref{eq:between}, where \textit{i} and \textit{j} are two different, individual nodes in the network, $\sigma_{ij}$ is the total number of shortest paths from \textit{i} to \textit{j}, and $\sigma _{ij}(v)$ is the number of shortest paths that include a node \textit{v}.

\begin{equation}
    \sum_{i \neq i \neq v} \frac{\sigma_{ij}(v)}{\sigma_{ij}}
    \label{eq:between}
\end{equation}

The implementation details for betweenness centrality are largely similar to the details described in Section \ref{sec:degree} for degree centrality. Betweenness centrality was computed using igraph's betweenness function, due to the ease-of-use and graph algorithm optimizations offered by the library. For the applications described in Section \ref{sec:example-networks}, the compliance graphs are unweighted. The igraph library implements Brande's algorithm \cite{brandes} for the centrality computation, which provides drastic improvements over other algorithms for computing betweenness. Using the igraph package, and coupled with the unweighted nature of the given compliance graphs, the time complexity of betweenness centrality is $\mathcal{O}(|n|*|e|)$, where \textit{n} is the number of nodes, and \textit{e} is the number of edges, and the spatial complexity is $\mathcal{O}(n+e)$.

\subsection{Katz} \label{sec:katz}
Katz centrality was first introduced by the author of \cite{Katz}, and measures the importance of nodes through all paths in a network. Katz centrality varies in that its centrality measure is not limited to solely the shortest path between any two given nodes. The original work by the author defines Katz as seen in Equation \ref{eq:Katz}, where \textit{i} and \textit{j} are nodes in the network, \textit{n} is the total number of nodes in the network, \textit{A} is the adjacency matrix, and $\alpha$ is an attenuation factor and has a value between 0 and 1. A value of 1 is assigned if node \textit{i} is connected to node \textit{j}.

\begin{equation}
    C_{\mathrm {Katz} }(i)=\sum _{k=1}^{\infty }\sum _{j=1}^{n}\alpha ^{k}(A^{k})_{ji}
    \label{eq:Katz}
\end{equation}

Later works have expanded on the original Katz to include a $\beta$ vector that allows for additional scaling in the instance that prior knowledge of the network exists. The modified equation implemented by the authors of \cite{ModKatz} can be seen in Equation \ref{eq:mod_katz}.

\begin{equation}
    \vec{x} = \left(I - \alpha A \right)^{-1}\vec{\beta}
    \label{eq:mod_katz}
\end{equation}

For compliance graphs, Katz centrality represents the total number of paths that exist from a given node to any other downstream nodes, and is scaled based on the attenuation factor as well as the prior knowledge vector $\beta$. When the Katz centrality of a given node is high, prioritizing a correction scheme for the node would be useful to prevent opportunity of future compliance violations that may be many steps ahead, but still reachable from the current state. Additional weighting and scaling can be applied to nodes known in advance to have greater importance through the $\beta$ vector, and through tuning the attenuation factor to give greater weight to local or global reach of nodes.

For Katz centrality, difficulties can be encountered when computing the eigenvalues. When using the igraph ``eigen" function, \textit{$n^2$} memory is required. This function calls upon a row summing helper function, which requires an intermediate matrix to be held in memory. For large-scale graphs, this quickly becomes problematic. In addition, computing eigenvalues is bound by a time complexity of $\mathcal{O}(n^3)$, which is cumbersome for the large-scale compliance and attack graphs that are generated \cite{laug}. igraph does make use of the LAPACK \cite{laug} routines, which reduces the time complexity to $\mathcal{O}(n^2)$. However, for the compliance graphs that are generated in this work, there are a few ways to work around the time and spatial complexity. Directed acyclic graphs (DAGs) have special properties in their spectral analysis. As the authors of \cite{stankovic2023fourier} and \cite{seifert2023causal} discuss, the eigenvalues of the adjacency matrix of a DAG are 0. As a result, a column vector of 0s of size \textit{n} can be initialized in place of an eigenvalue computation. For computing Katz centrality, a custom method that follows the original approach of \cite{Katz} was created. Though the eigenvalue computation can be omitted, Katz centrality is still bound by matrix multiplication, which has a direct definitional time complexity of $\mathcal{O}(n^3)$ \cite{MACEDO2016999}. Recent works have shown that the time complexity can theoretically be reduced to $\mathcal{O}(n^{2.371552})$ \cite{williams2023new}, however the implemented, tested, and confirmed algorithm for a time complexity of $\mathcal{O}(n^{2.3728596})$ \cite{alman2020refined} is more common in practice.

For this implementation, since sparse matrices are employed, the time complexity can be reduced even further. Though the eigenvalue vector is initialized to zero, its values are updated to match the adjacency matrix values each iteration. The adjacency matrix will be denoted as \textit{A}, and the eigenvalue vector will be denoted as \textit{B}. For each of these matrices, \textit{a} shall represent the number of nonzero elements in \textit{A}, and \textit{b} shall represent the number of nonzero elements in \textit{B}. Rather than the definitional time complexity being represented as $\mathcal{O}(n*n*n)$, the definitional time complexity for sparse matrices can be represented as $\mathcal{O}(a*b*n)$. Regarding parameters, $\alpha$ was set to 0.5 to allow for a balance in short and long distance edge traversals. The $\beta$ vector was trivially set. If a node was in violation of a mandate, regulation, or some other form of compliance requirement, a value of 5.0 was assigned. Otherwise, the node had a value of 1.0.

\subsection{Adapted PageRank} \label{sec:pr}
The original PageRank algorithm was first designed by the authors of \cite{PageRank} for the Google prototype for ranking web pages. The authors of \cite{Adapted_PageRank} later introduced an Adapted PageRank algorithm that was designed to measure both the number and quality of connections specifically for an urban network. Equation \ref{eq:PR} displays the PageRank algorithm, where $\gamma$ is a damping factor with a value between 0 and 1, \textit{n} is the total number of nodes in the network, \textit{A} is the adjacency matrix of the network, \textit{i} and \textit{j} represent the row and column of the adjacency matrix, \textit{x} is a given node in the network, and \textit{k} is the row sum out degree. Since the Adapted PageRank algorithm measures the quality of connections, there is increased application to directed networks such as compliance graphs. As seen in Equation \ref{eq:PR}, the \textit{k$_j$} term is a penalizing factor. Importance is based on the in degree of a node, with a penalty for the out degree. If many nodes point to a given node, then that node is considered important due to its accessibility.

\begin{equation}
x_i = \frac{1-\gamma}{n} + \gamma\sum_{j = 1}^{n}\frac{A_{ij}}{k_j}x_j
\label{eq:PR}
\end{equation}

The adapted PageRank algorithm includes additional data that may be present in an urban network, such as geographical position, resource availability, and proximity to facilities. This data is user-defined, and may not be present in the network. Equation \ref{eq:APC} displays the Adapated PageRank algorithm in matrix form where \textit{D} is the user-defined data matrix, \textit{I} is the identity matrix, and $\mathds{1}$ is a column matrix comprised of 1s.

\begin{equation}
(I-\gamma A D)\vec{x} = \frac{1-\gamma}{n}\mathds{1}
\label{eq:APC}
\end{equation}

For compliance graphs, the Adapted PageRank algorithm is useful for a few reasons. First, it is able to include user-defined data regarding the network. This could include scaling certain nodes to have greater weight, such as those known to be in a noncompliant state. Second, since nodes are penalized for pointing to other nodes, this algorithm is useful for determining nodes that are likely to be visited. If a state has a greater in-degree, it may require greater prioritization since the system has a higher likelihood of falling into this state.

The implementation details for PageRank centrality are largely similar to the details described in Section \ref{sec:degree} for degree centrality. PageRank centrality was computed using igraph's ``pagerank" function, due to the ease-of-use and graph algorithm optimizations offered by the library. For the applications described in Section \ref{sec:example-networks}, the time complexity of the Adapted PageRank centrality when using igraph is $\mathcal{O}(|E|)$, where \textit{e} is the number of edges, and the spatial complexity is $\mathcal{O}(n)$ for the results vector, plus the spatial requirements of holding the graph object. The igraph computation features improvements over traditional implementations - the traditional time complexity is often $\mathcal{O}(n+e)$, where \textit{n} is the number of nodes, and \textit{e} is the number of edges.

\subsection{Percolation Centrality} \label{sec:perc}
Percolation centrality was originally presented by the authors of \cite{10.1371/journal.pone.0053095}, and has continued to see usage in works such as that presented by the authors of \cite{10.1145/3288599.3295597} for percolation centrality approximation, and in the work presented by the authors of \cite{9680376} for parallel programming approaches. Percolation centrality aims to measure importance of nodes through their topographical connectivity, as well as through using percolation theory. As a contagion travels through a network, it has the capacity to alter the state of each node. This alteration, and any residual effects, can cause nodes to become percolated, which can then themselves cause other nodes to also become percolated. Equation \ref{eq:PercC} displays the formal definition for percolation centrality, where \textit{x} is the percolated state, \textit{s} is a source node, \textit{v} is a different, unique node, \textit{N} is the number of nodes, $\sigma_{ij}$ is the total number of shortest paths from \textit{i} to \textit{j}, and $\sigma _{ij}(v)$ is the number of shortest paths that include a node \textit{v}.

\begin{equation} \label{eq:PercC}
    PC^t(v)=\frac{1}{(N-2)}\sum_{s \neq v \neq r}\frac{\sigma_{s,r}(v)}{\sigma_{s,r}}\frac{x_{s}^t}{[\sum{x_{i}^t}]-x_{v}^t}
\end{equation}

For compliance graphs, percolation centrality is able to examine and consider the dependencies that violations may have. Some compliance or regulation mandates rely on the statuses of other mandates. When nodes are flagged as ``at risk" of a violation or are actively violating a mandate, this percolation will spread to surrounding nodes. This measure is able to prioritize nodes based on their surrounding connections and their standings in regard to a mandate.

Percolation centrality required additional pre-processing before computing centrality values. Though NetworkX \cite{NetworkX} includes a percolation centrality function, percolation attributes need to be embedded within the graph object prior to calling the function. For this centrality metric, all work was performed in Python, and the final centrality vector was passed back to R through the Reticulate library \cite{reticulate}, which acts as an interface between R and Python.

The pre-processing component of percolation centrality required the parsing of the prior-knowledge network. This file was opened and parsed within Python in order to determine which edge labels marked a violation in a compliance requirement. The graph was processed to determine which nodes were considered ``in violation", which was identifiable through an ``in-edge" with a label that marked a violation state. In addition, the pre-processing identified nodes that were in contact with a violation node through \textit{n}-step reachability. A node was considered to have exposure to a percolated state if it was \textit{n} steps away from a node in violation. This work made use of a 2-step reachability scheme. As the graph was processed, node attributes were assigned with a ``percolation" label. Nodes in violation were assigned a percolation value of 0.99, exposed nodes were assigned a value of 0.50, and all other nodes were assigned a percolation value of 0.01. This approach can be seen in Algorithm \ref{alg:prePC}, which expands the algorithm into an unoptimized format to showcase the process for simplicity.

\IncMargin{1em}
\begin{algorithm}[htbp] \label{alg:prePC}
\SetKwData{Left}{left}
\SetKwData{This}{this}
\SetKwData{Up}{up}
\SetKwFunction{Union}{Union}
\SetKwFunction{FindCompress}{FindCompress}
\SetKwInOut{Input}{Input}\SetKwInOut{Output}{Output}
\SetKwComment{comment}{\#}{}
\Input{Prior-Knowledge Network (PKN), Network, n (Reachability Step)}
\Output{Network}
\BlankLine

\textbf{STEP 1:}  Parse PKN and identify exploits that denote a violation. \\
    violationEdges = array[]; \\
    \For{$exploit$ in PKN}{
        \If {$exploit$ causesVio}
        {do add $exploit$ ID or label to violationEdges array;}
    }

\textbf{STEP 2:}  Identify nodes in violation. \\
    violationNodes = array[];

    \For{$node$ in Network}{
        \uIf {any in-edge is in the violationEdges array}
        {do add $node$ to violationNodes array; \\
         do set $node$[percolation attribute] = 0.99;}
        \Else
        {do set $node$[percolation attribute] = 0.01;}
    }

\textbf{STEP 3:}  Identify nodes exposed to violation nodes. \\
    exposedNodes = array[];

    \If{$n$ is not provided}
    {do set $n$ = 2;}

    \For{$node$ in violationNodes array}{
        \If{\textbf{find} {nodes that are within $n$ path length away from $node$ and not in violationNodes array}}
            {do add to exposedNodes array;}
    }

\textbf{STEP 4:} Update percolation label for exposed nodes. \\
    \For{$node$ in exposedNodes array}{
        {do set $node$[percolation attribute] = 0.50;}
    }

 \caption{Expanded, Unoptimized Approach for Pre-Processing the Network for Percolation Centrality}
\end{algorithm}

Identifying violation nodes is bound by a time complexity $\mathcal{O}(e)$, where \textit{e} is the number of edges. This is due to iterating through the graph's edge labels. Identifying exposed nodes has a time complexity of $\mathcal{O}(v*k*\log{v})$, where \textit{v} is the number of nodes in violation and \textit{k} is the \textit{n}-step reachability cutoff length. Since the adjacency matrix has already been obtained through the graph object and through edge labels, there is no edge exploration cost incurred. The NetworkX implementation of percolation centrality uses the algorithm presented by the authors of \cite{10.1371/journal.pone.0053095}, which also includes Brande's algorithm \cite{brandes}. For the applications described in Section \ref{sec:example-networks}, the NetworkX time complexity for percolation centrality is $\mathcal{O}(|n|*|e|)$, where \textit{n} is the number of nodes, and \textit{e} is the number of edges. The spatial complexity is $\mathcal{O}(e)$.

\subsection{Centrality Aggregation} \label{sec:cent-aggr}
Each centrality metric assigns importance on various features of a network. Each approach focuses and highlights on different aspects of the network's topographical properties, with some centrality approaches also relying on external data matrices and prior-knowledge networks. Due to the utility and strengths of each approach, an aggregation of scores in a meta-centrality fashion allows importance to be obtained as a collection of all centrality approaches, rather than choosing a single centrality metric to use. The aggregation of centralities has been investigated in the works seen by the authors of \cite{AUDRITO2021102584}, \cite{LI2018512}, and \cite{MO2019121538}.

To aggregate the importance scores, a few approaches are possible. The authors of \cite{bordacent} use a Borda-count based aggregation system to study the effect of super-spreaders in a network through a meta-centrality approach. Other approaches include the Kemeny-Young method \cite{6023c4f8-ecc1-3dbe-9f88-265b318523d2}, \cite{doi:10.1137/0135023}, however this approach requires a greater computation requirement that yields this unfeasible for the large-scale compliance graphs. Though the Borda-count approach would be feasible, this work opted for a mean-based rank method for simplicity. Equation \ref{eq:CentAgg} displays the approach for computing an aggregated centrality score. In this approach, a proportion is computed for each node in relation to the overall centrality score for that metric. That proportion is then adjusted based on a weighting for the metric, where the weighting is a value between 0.0 and 1.0, where all weightings sum to a value of 1.0. This approach allows for each centrality metric to contribute to the aggregated centrality score, but additional tuning can be employed to assign greater contributions to metrics that may utilize prior-knowledge of the embedded network information (such as Katz or Percolation Centrality).

\begin{equation} \label{eq:CentAgg}
    \begin{split}
        importance_{i} = (\frac{\frac{degree_{i}}{\Sigma degree}*weight_{degree}}{length(CentralityMetrics)}) + \\
        (\frac{\frac{betweenness_{i}}{\Sigma betweenness}*weight_{betweenness}}{length(CentralityMetrics)})+\\
        (\frac{\frac{Katz_{i}}{\Sigma Katz}*weight_{Katz}}{length(CentralityMetrics)})+\\
        (\frac{\frac{PageRank_{i}}{\Sigma PageRank}*weight_{PageRank}}{length(CentralityMetrics)})+\\
        (\frac{\frac{percolation_{i}}{\Sigma percolation}*weight_{percolation}}{length(CentralityMetrics)})
    \end{split}
\end{equation}

Post-processing is performed on the aggregated centrality vector. Since prior-knowledge networks are implemented for all example applications, it is useful to further tune the aggregated scores. In order for a graph node to be fully mitigated, all ``in-edges" must be prevented. Within the prior-knowledge network, all exploits (edges) have information to specify the cost of mitigation if possible. This prior-knowledge network is parsed in order to identify exploits that cannot be prevented. The graph is then processed to identify nodes that have an unpreventable exploit as an in-edge, and are therefore unpreventable nodes. Since these nodes cannot be removed from the network or mitigated, their centrality value is removed from the aggregated vector since analysis computation will yield no beneficial results for these nodes. The removed aggregated value is equally distributed to all other nodes in the aggregated vector that have a nonzero value. This process is shown in Algorithm \ref{alg:redistcent}.

\IncMargin{1em}
\begin{algorithm}[htbp] \label{alg:redistcent}
\SetKwData{Left}{left}
\SetKwData{This}{this}
\SetKwData{Up}{up}
\SetKwFunction{Union}{Union}
\SetKwFunction{FindCompress}{FindCompress}
\SetKwInOut{Input}{Input}\SetKwInOut{Output}{Output}
\SetKwComment{comment}{\#}{}
\Input{Prior-Knowledge Network (PKN), aggregatedCentralityVector}
\Output{aggregatedCentralityVector}
\BlankLine

\textbf{STEP 1:}  Parse PKN and identify unpreventable exploits. \\
\textbf{STEP 2:}  Identify unpreventable nodes and gather/reset their centrality score. \\
    redistributeValue = 0;

    \For{$node$ in Network}{
        \If {any in-edge is unpreventable or unable to be mitigated}
        {do redistributeValue += aggregatedCentralityVector[$node$];

         do aggregatedCentralityVector[$node$] = 0;}
    }
\textbf{STEP 3:}  Redistribute Centrality Scores \\
    redistributeValue /= number of nonzero aggregated scores;

    \For{$score$ in aggregatedCentralityVector}{
        \If{$score$ is nonzero}
        {do $score$ += redistributeValue}
    }

 \caption{Redistribute Aggregated Centrality Scores}
\end{algorithm}

\subsection{Centrality Results and Analysis} \label{sec:cent-res}
\begin{table}[]
    \scriptsize
    \centering
    \caption{Properties of the Aggregated Centrality Scores for the Three Example Networks}
    \begin{tabular}{|c|c|c|c|}
    \hline
                                                                                          & \textbf{\begin{tabular}[c]{@{}c@{}}Automobile\\  Maintenance\end{tabular}} & \textbf{HIPAA} & \textbf{OSHA 1910H} \\ \hline
    \textbf{\begin{tabular}[c]{@{}c@{}}Number of Nonzero\\  Elements\end{tabular}}        & 4245                                                                       & 9215           & 4603                   \\ \hline
    \textbf{\begin{tabular}[c]{@{}c@{}}Percent of Total\\  Elements that are Nonzero\end{tabular}}   & 6.341\%                                                           & 1.481 x $10^{1}$\hspace{0.05em}\%          & 9.516\%                    \\ \hline
    \textbf{\begin{tabular}[c]{@{}c@{}}Number of Zero\\  Elements\end{tabular}}           & 6270 x $10^{1}$                                                            & 5300 x $10^{1}$          & 4377 x $10^{1}$                    \\ \hline
    \textbf{Nonzero Minimum}                                                              & 2.214 x $10^{-4}$                                                          & 9.723 x $10^{-5}$      & 2.076 x $10^{-4}$                    \\ \hline
    \textbf{Maximum}                                                                      & 1.936 x $10^{-3}$                                                          & 2.713 x $10^{-4}$      & 2.592 x $10^{-4}$                     \\ \hline
    \textbf{Nonzero Mean}                                                                 & 2.356 x $10^{-4}$                                                          & 1.085 x $10^{-4}$      & 2.172 x $10^{-4}$                    \\ \hline
    \textbf{\begin{tabular}[c]{@{}c@{}}Nonzero Element\\ Standard Deviation\end{tabular}} & 6.248 x $10^{-5}$                                                          & 5.428 x $10^{-6}$     & 2.855 x $10^{-6}$                     \\ \hline
    \end{tabular}
    \label{tab:aggCentScores}
\end{table}

Table \ref{tab:aggCentScores} displays the statistical properties of the aggregated centrality scores. In all example networks, by performing post-processing, the number of states that can be analyzed is drastically reduced. In all cases, due to unpreventable exploits in the network, severities or importances can be set to 0 to prevent any further analysis on the given states. This reduction in state space has implications that require contextual understanding of the input data; it could be considered beneficial, a negative indication, or neutral. Though it reduces the state space and alleviates additional computation strain on future analysis work, it can be indicative of insufficient mitigation information or of a large set of zero-day or critical issues with no known remedy. Alternatively, the pruned nodes could be neutral states. In each network, there are flag-setting states, states that progress time, and states that reflect normal, expected behavior. In the prior-knowledge network, these states have no mitigations since their execution is expected or required. For all three example networks in this work, all pruned nodes are neutral states. Section \ref{sec:example-networks} describes the networks in more detail, and describes how all exploits that are known to cause a violation have at least one mitigation.

\begin{figure}[htp]
    \includegraphics[width=\linewidth]{"./images/carDist.png"}
    \centering
    \vspace{.2truein} \centerline{}
        \caption[Aggregated Centrality Score Distribution for the Automobile Maintenance Network]{Aggregated Centrality Score Distribution for the Automobile Maintenance Network. The resulting distribution of the aggregated centrality scores when using the centrality metrics presented in Section \ref{sec:net-cents}.}
        \label{fig:carAggCentDist}
\end{figure}

\begin{figure}[htp]
    \includegraphics[width=\linewidth]{"./images/carDistCF.png"}
    \centering
    \vspace{.2truein} \centerline{}
        \caption[Cullen and Frey Plot for the Nonzero Elements of the Aggregated Centralities for the Automobile Maintenance Network]{Cullen and Frey Plot for the Nonzero Elements of the Aggregated Centralities for the Automobile Maintenance Network.  500 bootstrap values (random selection with replacement) are used. This Figure displays skewness\textsuperscript{2} versus kurtosis to characterize the aggregated centrality scores with various distributions.}
        \label{fig:carAggCentDistCF}
\end{figure}

\begin{figure}[htp]
    \includegraphics[width=\linewidth]{"./images/hipaaDist.png"}
    \centering
    \vspace{.2truein} \centerline{}
        \caption[Aggregated Centrality Score Distribution for the HIPAA Network]{Aggregated Centrality Score Distribution for the HIPAA Network. The resulting distribution of the aggregated centrality scores when using the centrality metrics presented in Section \ref{sec:net-cents}.}
        \label{fig:hipaaAggCentDist}
\end{figure}

\begin{figure}[htp]
    \includegraphics[width=\linewidth]{"./images/hipaaDistCF.png"}
    \centering
    \vspace{.2truein} \centerline{}
        \caption[Cullen and Frey Plot for the Nonzero Elements of the Aggregated Centralities for the HIPAA Network]{Cullen and Frey Plot for the Nonzero Elements of the Aggregated Centralities for the HIPAA Network.  500 bootstrap values (random selection with replacement) are used. This Figure displays skewness\textsuperscript{2} versus kurtosis to characterize the aggregated centrality scores with various distributions.}
        \label{fig:hipaaAggCentDistCF}
\end{figure}

\begin{figure}[htp]
    \includegraphics[width=\linewidth]{"./images/oshaDist.png"}
    \centering
    \vspace{.2truein} \centerline{}
        \caption[Aggregated Centrality Score Distribution for the OSHA 1910H Network]{Aggregated Centrality Score Distribution for the OSHA 190H Network. The resulting distribution of the aggregated centrality scores when using the centrality metrics presented in Section \ref{sec:net-cents}.}
        \label{fig:oshaAggCentDist}
\end{figure}

\begin{figure}[htp]
    \includegraphics[width=\linewidth]{"./images/oshaDistCF.png"}
    \centering
    \vspace{.2truein} \centerline{}
        \caption[Cullen and Frey Plot for the Nonzero Elements of the Aggregated Centralities for the OSHA 1910H Network]{Cullen and Frey Plot for the Nonzero Elements of the Aggregated Centralities for the OSHA 1910H Network.  500 bootstrap values (random selection with replacement) are used. This Figure displays skewness\textsuperscript{2} versus kurtosis to characterize the aggregated centrality scores with various distributions.}
        \label{fig:oshaAggCentDistCF}
\end{figure}

Figures \ref{fig:carAggCentDist}, \ref{fig:hipaaAggCentDist}, and \ref{fig:oshaAggCentDist} display the distribution of the aggregated centrality scores. These Figures depict a distribution of all elements, including the elements with a score of zero. Though no future analysis was conducted with the aggregated centrality scores in this work, additional insight or analysis could be performed regarding these results. To add insight, Cullen and Frey plots (Figures \ref{fig:carAggCentDistCF}, \ref{fig:hipaaAggCentDistCF}, and \ref{fig:oshaAggCentDistCF}) were generated using only the nonzero elements of all three example networks' aggregated centrality scores. Each plot uses 500 bootstrap values in the generation, which randomly select and replace values from the aggregated centrality vector to aid in the potential uncertainty of the data set. Though this work makes no direct analysis of these plots other than displaying their distribution characterizations, they could yield promising results for new techniques or approaches using statistical analysis. This is discussed further in Section \ref{sec:FW}.

\subsection{Validation} \label{sec:cent-valid}
In order to validate the aggregated centrality scores, the following characteristics were examined, and test cases were created to compare against expected behavior. The results of these tests are not included in this work, since the test results were a boolean ``pass" or ``fail". If a failed test was encountered, the validation process failed, and the methodology was flawed and in need of correction. For the work presented, each test returned a successful outcome.
\begin{itemize}
    \item The sum across the aggregated centrality scores vector is 1.0.
    \item All individual centrality metric scores are greater than or equal to 0.0.
    \item All aggregated centrality scores are greater than 0.0.
    \item All individual centrality metric scores contain nonzero values.
    \item Select a random node and check the PKN to determine if this node is preventable:
    \begin{itemize}
        \item If preventable, ensure that the aggregated centrality score is nonzero.
        \item If unpreventable, ensure that the aggregated centrality score is zero.
    \end{itemize}
    \item The aggregated centrality score for the root node is zero.
    \item All individual centrality metric score vectors match in length to the number of nodes in the network.
    \item The aggregated centrality metric score vector matches in length to the number of nodes in the network.
\end{itemize}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Graph Transformations} \label{sec:graph-xform}
Generating compliance graphs as a DAG is done so purposefully. DAGs are useful representations for relationships and dependencies, and the authors of \cite{stankovic2023fourier} reaffirm this standing. DAGs and their traversals reveal deeper understandings of causal relationships between nodes and events, and can aid in the analysis and prediction of known or expected events. However, for compliance graphs, it may be useful to transform the DAG into an alternate structure for additional analysis. It is still important to generate the compliance graph as a DAG initially to obtain the relationships of the network, and only after its initial generation is the graph transformation investigated. These transformations can be useful for determining which nodes are most important when an adversarial action can be considered to have infinite time and resources to perform changes to the original system. Alternatively, they can be useful for determining which nodes are most important from an information flow perspective, where adversarial actions must pass though a series of nodes to reach any other node in the network. This Section presents transformation options and contextualizations for compliance graphs to aid in the analysis process. Section \ref{sec:TC} presents the Transitive Closure, and Section \ref{sec:DT} presents the Dominant Tree.

\subsection{Transitive Closure} \label{sec:TC}
Transitive closure represents a transitive relation on a given binary set, and can be used to determine reachability of a given network. Figure \ref{fig:TC} displays an example output when performing transitive closure. In context of compliance graphs, it is useful to consider that an adversary (whether an internal or external malicious actor, poor policy execution by an organization, accidental misuse, or any other adversarial occurrence) could have no time constraints. That is, for any given state of the system or set of systems, an adversarial act could have ``infinite" time to perform a series of actions. If no prior knowledge is known about the network, it can be assumed that all changes performed on the systems are equally likely. In practice, specifying a probability that a change can occur has been performed through a Markov Decision Process, such as that seen by the authors of \cite{li_combining_2019} and \cite{zeng_cyber_2017}. When under these assumptions, it is useful to then consider which nodes are important, assuming they have 1-step reachability to any downstream node they may have a transitive connection to.

\begin{figure}[htp]
    \includegraphics[scale=0.6]{"./images/TC.png"}
    \centering
    \vspace{.2truein} \centerline{}
        \caption[Illustration of an Example DAG and its Transitive Closure]{Illustration of an Example DAG and its Transitive Closure. Each node in the original DAG has 1-step reachability to any downstream node it has a transitive connection to in the resulting transitive closure. }
        \label{fig:TC}
\end{figure}


\subsection{Dominant Tree} \label{sec:DT}
Dominance, as initially introduced by the author of \cite{dominance} in terms of flow, is defined as a node that is in every path to another node. If a node \textit{i} is a destination node, and every path to \textit{i} from a source node includes node \textit{j}, then node \textit{j} is said to dominate node \textit{i}. Figure \ref{fig:domNet} displays an example starting network. With node 1 as the source node, it is evident that node 2 immediately dominates nodes 3, 4, 5, and 6, since all messages from node 1 must pass through node 2. By definition, each node must also dominate itself, so node 2 also dominates node 2.

Following the properties of dominance, a dominator tree can be derived. In a dominator tree, each node has children that it immediately dominates. Immediate dominance is referred to nodes that strictly dominate a given node, but do not strictly dominate any other node that may strictly dominate a node. Figure \ref{fig:domTree} displays the dominant tree of the network seen in Figure \ref{fig:domNet}.

\begin{figure}[htp]
    \includegraphics[scale=0.6]{"./images/dom_net_unshaded.png"}
    \centering
    \vspace{.2truein} \centerline{}
        \caption[Example ``Base" Network for Illustrating Dominance]{Example ``Base" Network for Illustrating Dominance. An arbitrary DAG that has not yet undergone transformation.}
        \label{fig:domNet}
\end{figure}

\begin{figure}[htp]
    \includegraphics[scale=0.6]{"./images/dom_tree_unshaded.png"}
    \centering
    \vspace{.2truein} \centerline{}
        \caption[Dominant Tree Derived from the ``Base" Network]{Dominant Tree Derived from the Network Displayed in Figure \ref{fig:domNet}. Node 1 dominates Node 2, and Node 2 possesses immediate dominance over Nodes 3, 4, 5, and 6.}
        \label{fig:domTree}
\end{figure}

Dominant trees do alter the structure of compliance graphs, and lead to leaf nodes and branches that do not exist in the original network. As a result, some nodes that have directed edges to other nodes may be moved to a position where the edge no longer points to the original nodes. However, in dominant trees, all node parents dominate their children. In this format, the information flow is guided predominantly by the upstream nodes, and all parents in the dominant tree exist as upstream nodes in the original compliance graph. While some downstream nodes may be altered, the importance of nodes can be reexamined in the dominant tree to see how importance differs when information flow is refined.

\subsection{Results and Analysis} \label{sec:xform-res}
To analyze the changes in the original DAG, network properties were collected for the transitive closure and dominant tree representations. Table \ref{tab:auto-prop} displays the properties for the automobile maintenance example, Table \ref{tab:hipaa-prop} displays the properties for the HIPAA example, and Table \ref{tab:osha-prop} displays the properties for the OSHA 1910H example. For each of the graphs, the number of nodes and number of edges were collected to examine how the quantity of the network structures changes. In all examples, it was expected that the number of edges for the transitive closure would substantially increase, and it was expected that the number of edges for the dominant tree would decrease.  Efficiency in terms of network science was introduced by the authors of \cite{PhysRevLett.87.198701}. Efficiency of a graph is measured by its ability to exchange information, whereas the distance between nodes increases, the efficiency decreases. Global efficiency is a measure of communication exchange within the entire network. Local efficiency of a node is quantified through the impact of information exchange if that node was removed from the network, and is therefore a measure of fault tolerance. For the transitive closures, it is expected that the removal of any node has minimal impact on communication efficiency. Since each node has a connecting edge to all downstream nodes, the removal of any one midstream node should not degrade the ability to exchange information throughout the network. For dominant trees, the opposite is expected. The dominant tree network is generated through the concept of dominance, which is a measure of how information is passed through nodes. Since the dominant tree is hierarchical based on communication exchange properties, the removal of a node will have more severe impacts on the communication efficiency.

The radius and diameter of a graph are computed through eccentricity. Each node in a network will have an eccentricity value, which is the shortest path from that node to the farthest reachable node. The radius of a graph is the smallest eccentricity value, and the diameter is the largest eccentricity value. For transitive closures, since new edges are drawn from all upstream nodes to all reachable downstream nodes, it is expected that the radius and diameter decrease. No analysis or conclusions were drawn from the dominant trees, since the dominance of the original DAG may vary, which could result in either an increase or decrease in the radius and diameter. The density of a graph is a proportion of actual edges and theoretical edges. Due to the addition of edges in the transitive closure, the density is expected to increase compared to the original DAG. No analysis or conclusions were drawn from the dominant trees, since the dominance of the original DAG may vary, which could result in either an increase or decrease in the density.

% Automobile
\begin{table}[]
    \scriptsize
    \centering
    \caption{Network Property Comparisons of the Original DAG, Transitive Closure, and Dominant Tree for the Automobile Maintenance Network}
    \begin{tabular}{|c|c|c|c|}
    \hline
                                        & \textbf{DAG}      & \textbf{\begin{tabular}[c]{@{}c@{}}Transitive\\  Closure\end{tabular}}   & \textbf{\begin{tabular}[c]{@{}c@{}}Dominant\\  Tree\end{tabular}}  \\ \hline
    \textbf{Number of Nodes}            & 6695 x $10^{1}$                  & 6695 x $10^{1}$                                                                         & 6695 x $10^{1}$                                                                   \\ \hline
    \textbf{Number of Edges}            & 4682 x $10^{2}$                   & 3958 x $10^{4}$                                                                           & 6694 x $10^{1}$                                                                    \\ \hline
    \textbf{Global Efficiency}          & 1.541 x $10^{-3}$                  &  8.831 x $10^{-3}$                                                                         &  2.465 x $10^{-5}$                                                                    \\ \hline
    \textbf{\begin{tabular}[c]{@{}c@{}}Average Local\\  Efficiency\end{tabular}} & 1.515 x $10^{-1}$                  &  2.164 x $10^{-1}$                                                                         & 0.000                                                                    \\ \hline
    \textbf{Radius}                     & 13.00                  & 1.000                                                                         & 7.000                                                                    \\ \hline
    \textbf{Diameter}                   & 18.00                  & 1.000                                                                         & 8.000                                                                     \\ \hline
    \textbf{Density}                    & 1.045 x $10^{-4}$                  & 8.831 x $10^{-3}$                                                                         & 1.494 x $10^{-5}$                                                                    \\ \hline
    \end{tabular}
    \label{tab:auto-prop}
\end{table}

% HIPAA
\begin{table}[]
    \scriptsize
    \centering
    \caption{Network Property Comparisons of the Original DAG, Transitive Closure, and Dominant Tree for the HIPAA Network}
    \begin{tabular}{|c|c|c|c|}
    \hline
                                        & \textbf{DAG}      & \textbf{\begin{tabular}[c]{@{}c@{}}Transitive\\  Closure\end{tabular}}   & \textbf{\begin{tabular}[c]{@{}c@{}}Dominant\\  Tree\end{tabular}}  \\ \hline
    \textbf{Number of Nodes}            & 6222 x $10^{1}$                  & 6622 x $10^{1}$                                                                      & 6222 x $10^{1}$                                                                    \\ \hline
    \textbf{Number of Edges}            &  4009 x $10^{2}$                  & 2475 x $10^{4}$                                                                         & 6222 x $10^{1}$                                                                     \\ \hline
    \textbf{Global Efficiency}          & 1.417 x $10^{-3}$                  & 6.394 x $10^{-3}$                                                                    & 1.935 x $10^{-5}$                                                                    \\ \hline
    \textbf{\begin{tabular}[c]{@{}c@{}}Average Local\\  Efficiency\end{tabular}} & 1.225 x $10^{-1}$                   & 1.866 x $10^{-1}$                                                                    & 0.000                                                                    \\ \hline
    \textbf{Radius}                     & 17.00                  & 1.000                                                                            & 6.000                                                                     \\ \hline
    \textbf{Diameter}                   & 19.00                   & 1.000                                                                           & 6.000                                                                    \\ \hline
    \textbf{Density}                    & 1.036 x $10^{-4}$                    & 6.394 x $10^{-3}$                                                                  & 1.607 x $10^{-5}$                                                                    \\ \hline
    \end{tabular}
    \label{tab:hipaa-prop}
\end{table}

% OSHA
\begin{table}[]
    \scriptsize
    \centering
    \caption{Network Property Comparisons of the Original DAG, Transitive Closure, and Dominant Tree for the OSHA 1910H Network}
    \begin{tabular}{|c|c|c|c|}
    \hline
                                        & \textbf{DAG}      & \textbf{\begin{tabular}[c]{@{}c@{}}Transitive\\  Closure\end{tabular}}   & \textbf{\begin{tabular}[c]{@{}c@{}}Dominant\\  Tree\end{tabular}}  \\ \hline
    \textbf{Number of Nodes}            & 4837 x $10^{1}$                  & 4837 x $10^{1}$                                                                        & 4837 x $10^{1}$                                                                    \\ \hline
    \textbf{Number of Edges}            & 4083 x $10^{2}$                  & 3584 x $10^{4}$                                                                         & 4837 x $10^{1}$                                                                    \\ \hline
    \textbf{Global Efficiency}          & 3.187 x $10^{-3}$                   &  1.532 x $10^{-2}$                                                                           & 3.533 x $10^{-5}$                                                                    \\ \hline
    \textbf{\begin{tabular}[c]{@{}c@{}}Average Local\\  Efficiency\end{tabular}} & 1.616 x $10^{-1}$                  & 2.200 x $10^{-1}$                                                                           & 0.000                                                                    \\ \hline
    \textbf{Radius}                     & 7.000                  & 1.000                                                                         & 4.000                                                                    \\ \hline
    \textbf{Diameter}                   & 25.00                  & 1.000                                                                         & 5.000                                                                    \\ \hline
    \textbf{Density}                    & 1.745 x $10^{-4}$                  &  1.532 x $10^{-2}$                                                                           & 2.067 x $10^{-5}$                                                                     \\ \hline
    \end{tabular}
    \label{tab:osha-prop}
\end{table}

\subsection{Validation} \label{sec:xform-valid}
Since no direct analysis is conducted on the transformed compliance graphs, validation of the transformation is limited. Though centrality metrics can be collected on the transformed graphs, the same validation techniques employed by the original network will be used. In order to validate the transformed graphs, there are a few network properties that were examined, and test cases were created to compare against expected behavior. The results of these tests are not included in this work, since the test results were a boolean ``pass" or ``fail". If a failed test was encountered, the validation process failed, and the methodology was flawed and in need of correction. For the work presented, each test returned a successful outcome.
\begin{itemize}
    \item{The root node in the original DAG is the root node of the Transitive Closure and Dominant Tree representations.}
    \item{The number of nodes in the Transitive Closure and Dominant Tree representations do not exceed the number of nodes in the original DAG.}
    \item{The number of nodes in the Dominant Tree representation is equal to the number of nodes in the original DAG.}
    \item{The number of edges in the Dominant Tree representation do not exceed the number of edges in the original DAG.}
    \item{The number of edges in the Transitive closure representation do exceed the number of edges in the original DAG.}
    \item{For the Transitive Closure representation, the root node should have a number of edges equal to 1 minus the number of nodes.}
    \item{The diameter and radius of the Transitive Closure representation are both 1.}
\end{itemize}

\section{Future Work} \label{sec:FW}

\section{Conclusions}

This work presented and implemented a methodology for obtaining violation priorities in a compliance graph. Specifically, this work analyzed and validated the results on three example networks: an automotive maintenance example, a HIPAA example, and an OSHA 1910H example. Each network centrality metric provides unique insight on the topological information in a compliance graph, and three of the metrics (Katz (Section \ref{sec:katz}), Adapted PageRank (Section \ref{sec:pr}), and Percolation (Section \ref{sec:perc})) are also able to work with the embedded information of the compliance graph. Each unique scoring of each compliance graph node from the centrality metrics are then aggregated and processed as part of the work presented in Section \ref{sec:cent-aggr}. The results were validated as part of the validation process shown in Section \ref{sec:cent-valid}. These results are provided in the form of violation priorities under constraints, and showcase significant savings in terms of computations due to the implemented solutions for each centrality metric. Additional computational savings will be carried forward in future analysis techniques due to the significant reduction of nonzero elements for each example network.

In addition, this work presented and implemented two transformation options for compliance graphs. The transitive closure transformation presented in Section \ref{sec:TC} is useful for determining all possible routes for noncompliance given unlimited time and resources. By reducing all chains of events to single-step reachability, new analysis techniques could be investigated to focus on immediate importance. The dominant tree transformation presented in Section \ref{sec:DT} is useful for providing a graph structure based on information flow. This transformation alters the original structure of the compliance graph, leading to a new hierarchy of nodes based on dominance. Using this transformation, the importance of nodes can be reexamined to determine how importance differs when information flow is refined. The results were validated as part of the validation process shown in Section \ref{sec:xform-valid}. The results in Section \ref{sec:xform-res} showcase notable differences in network properties, which allow for new investigations to uncover further information from future analysis techniques.

\addcontentsline{toc}{section}{Bibliography}
\bibliography{Bibliography}
\bibliographystyle{ieeetr}
\end{document}