Graph Preprocessing
Normalization
Graph filters all use the same default graph normalization scheme
that performs symmetric (i.e. Laplacian-like) normalization
for undirected graphs and column-wise normalization that
follows a true probabilistic formulation of transition probabilities
for directed graphs, such as DiGraph instances. The type of
normalization can be specified by passing a normalization
argument to constructors of ranking algorithms. This parameter
can have the following values:
| Normalization | Description |
|---|---|
"auto" |
The above-described default behavior. |
"col" |
Column-wise normalization. |
"symmetric" |
Symmetric normalization. |
"laplacian" |
Generates the Laplacian of the graph. |
"salsa" |
The row-wise normalization employed by the salsa algorithm. |
"none" |
(A string with text "none".) Avoids any normalization, for example, because edge weights already hold the normalization. |
| callable | A callable applied to a scipy sparse adjacency matrix of the "numpy" backend (irrespective of the actually active backend). When applied, it ignores the preprocessor's reduction argument. |
Additionally, a transform_adjacency method can be provided
to modify the final adjacency matrix after all computations conclude.
This method runs within the currently active backend.
By default this is a tautology lambda x: x.
To create smoother versions of adjacency matrices,
renormalization argument may be provided
to add a multiple of the unit matrix to the adjacency matrix,
a concept called the renormalization trick.
This by default 0, but can help shrink the spectrum.For example,
you can create a strongly local version of the adjacency matrix like this:
alg = Algorithm(renormalization=2)
In all cases, adjacency matrix normalization involves the
computationally intensive operation of converting the graph
into a scipy sparse matrix each time node
ranking algorithms are called. pygrank
provides a way to avoid recomputing the normalization
during large-scale experiments by the same algorithm for
the same graphs by passing an argument assume_immutability=True
to the algorithms's constructor, which indicates that
the the graph does not change between runs of the algorithm
and hence computes the normalization only once for each given
graph, a process known as hashing.
Hashing only uses the Python object's hash method,
so a different instance of the same graph will recompute the
normalization if it points at a different memory location.
Warning
Do not alter graph objects after passing them to
calls of node ranking algorithms for the first time
if you set assume_immutability=True. If altering the
graph is necessary midway through your code, create a copy
instance, for example with one of networkx's in-built methods and
edit that one.
Hashing the outcome of graph normalization to speed up multiple calls to the same graph can be achieved as per the following code:
import pygrank as pg
graph, personalization1, personalization2 = ...
algorithm = pg.PageRank(alpha=0.85, normalization="col", assume_immutability=True)
ranks1 = algorithm(graph, personalization1)
ranks2 = algorithm(graph, personalization2) # does not re-compute the normalization
Caching
Sometimes, many different algorithms are applied on the
same graph. In this case, to prevent each one
from recomputing the hashing already calculated by others,
they can be made to share the same normalization method. This
can be done by using a shared instance of the
normalization preprocessing pg.preprocessor,
which can be passed as the preprocessor argument of ranking algorithm
constructors. In this case, the normalization, renormalization
and assume_immutability
arguments should be passed to the preprocessor and will be ignored by the
constructors (what would otherwise happen is that the constructors
would create a prerpocessor with these arguments).
Basically, when the default value preprocessor=None is passed to ranking algorithm
constructors, these create a new preprocessing instance
with the normalization, renormalization and assume_immutability
values passed
to their constructor. These two arguments are completely ignored
if a preprocessor instance is passed to the ranking algorithm.
Direct use of these arguments without needing to instantiate a
preprocessor was demonstrated in the previous code example.
Using the same outcome of graph preprocessing to speed up multiple rank calls to the same graph by different ranking algorithms can be done as:
import pygrank as pg
graph, personalization1, personalization2 = ...
pre = pg.preprocessor(normalization="col", assume_immutability=True)
algorithm1 = pg.PageRank(alpha=0.85, preprocessor=pre)
algorithm2 = pg.HeatKernel(alpha=0.85, preprocessor=pre)
ranks1 = algorithm1(graph, personalization1)
ranks2 = algorithm2(graph, personalization2) # does not re-compute the normalization
Info
When benchmarking in the above code you can call pre(graph)
before the first rank(...) call to make sure that that call
does not also perform the first normalization whose outcome will
be hashed and immediately retrieved by subsequent calls.
Cross-origin resources
Preprocessing admits a cors argument that toggles an optimization for huge speedups when constantly switching between backends at the cost of additional memory and badly-defined behavior. Default is False, which saves a lot of memory. However, the maximum used memory when using cors remains the same when processing one graph and switching between up to two backends out of which one is "numpy".
Warning
Enabling cors will not run certain parts of preprocessing, because it performs immediate backend switching with cached versions of the adjacency matrix. There is no error checking involved. For this reason:
- Remember that the only preprocessor that will actually run computations is the first one that runs. If you are unsure, either set the same arguments or generate the preprocessor first and share it between all algorithms.
- Make sure that
cors=Trueholds for every preprocessor and graph filter that will use cors graphs.
YOU HAVE BEEN WARNED.
If cors is enabled (set to True), backend primitives holding the outcome of graph preprocessing are enriched with additional private metadata that enable their usage as base graphs when passing through other preprocessors in other backends. There is one main usage pattern: obtaining an adjacency matrix that represents the graph, but which is a thin wrapper around fast-switching backend primitives. That is, you need to preprocess your graph per:
graph = pg.preprocessor(cors=True, ...)(graph)
This is technically an adjacency matrix but has
all the necessary ducktyping to be considered
a graph by pygrank. If you plan to stay within
one backend, you can also run the same command
without enabling cors anywhere and
construct signals per normal with
pg.to_signal(graph, personalization_data).
However, enabling cors is mandatory
when creating or planning to use
graph signals with the adjacency
you defined at different backends.
Usefulness of cross-origin resources in demonstrated in the
following graph neural network example, where training
time is sped up by 25% when cors=True in the
constructor of the GenericGraphFilter:
import pygrank as pg
import tensorflow as tf
from tensorflow.keras.layers import Dropout, Dense
from tensorflow.keras.regularizers import L2
class APPNP(tf.keras.Sequential):
def __init__(self, num_inputs, num_outputs, hidden=64):
super().__init__(
[
Dropout(0.5, input_shape=(num_inputs,)),
Dense(hidden, activation="relu", kernel_regularizer=L2(0.005)),
Dropout(0.5),
Dense(num_outputs),
]
)
self.ranker = pg.ParameterTuner(
lambda par: pg.GenericGraphFilter(
[par[0] ** i for i in range(int(10))],
error_type="iters",
max_iters=10,
cors=True, # DON'T FORGET THIS
),
max_vals=[1],
min_vals=[0.5],
verbose=False,
measure=pg.Mabs,
deviation_tol=0.01,
tuning_backend="numpy", # TUNING IN NUMPY
)
def call(self, features, graph, training=False):
predict = super().call(features, training=training)
propagate = self.ranker.propagate(graph, predict, graph_dropout=0.5 * training)
return tf.nn.softmax(propagate, axis=1)
from timeit import default_timer as time
graph, features, labels = pg.load_feature_dataset("citeseer")
# this is how e cache the cross-origin resource
pre = pg.preprocessor(renormalize=True, assume_immutability=True, cors=True)
graph = pre(graph)
training, test = pg.split(list(range(len(graph))), 0.8, seed=5)
training, validation = pg.split(training, 1 - 0.2 / 0.8)
model = APPNP(features.shape[1], labels.shape[1])
with pg.Backend("tensorflow"): # pygrank computations in tensorflow backend
tic = time()
pg.gnn_train(
model,
features,
graph,
labels,
training,
validation,
epochs=300,
optimizer=tf.optimizers.Adam(learning_rate=0.01),
verbose=True,
test=test,
)
print("Accuracy", pg.gnn_accuracy(labels, model(features, graph=graph), test))
print("Time", time()-tic)