Ranking

Node ranking

In ranking tasks, nodes obtain ordinal values 1,2,3,... based on their importance in the graph structure (1 is the most important node). To see how ranks can be obtained, load a graph and a community of nodes to test with. Below we define a personalized PageRank filter, which scores the structural relatedness of graph nodes to the ones of the given set, and apply a postprocessor that creates node ranks based on their score. Notice the time it took for the base ranker to converge. (Depends on backend and device characteristics.)

import pygrank as pg

_, graph, group = next(pg.load_datasets_one_community(["citeseer"]))
alpha = 0.85  # define this parameter to reuse the same value later

ranker = pg.PageRank(alpha=alpha, tol=1.E-20, max_iters=1000) >> pg.Ordinals()
ranks = ranker(graph, group)
print(ranker.convergence)  # 70 iterations (0.012790299995685928 sec)
print(ranks)  # {'100157': 107.0, '364207': 416.0, '38848': 626.0, 'bradshaw97introduction': 567.0, ... }

Robust stopping

We set an arbitrary desired numerical tolerance tol=1.E-20 because we were unsure if smaller values would yield a robust node order; we needed a very tight numerical tolerance to make sure. On the other hand, the number of iterations can be reduced when investigating node order convergence. In particular, we need to define a different measure than pg.Mabs, which is the default of the filter's convergence manager.

To stop at a robust node order, we set a new measure of node score deviations between consecutive PageRank iterations per error_type=pg.OrderAccuracy. This is is passed to the internally constructed convergence manager. We keep the default tol=1.E-6. which translates to stopping iterations only when only 1/10,000 of nodes with adjacent tanks change order. This time we find an earlier point in terms of the number of iterations. However, pg.OrderAccuracy performs heavyweight sorting operations whose running time grows \(O(n \log n)\) where \(n\) is the number of nodes. When compared to a graph with an upper bound node degree, this sorting ends up being five times slower:

early_stop_ranker = pg.PageRank(alpha=alpha, error_type=pg.OrderAccuracy) >> pg.Ordinals()
fastranks = early_stop_ranker(graph, group)
print(early_stop_ranker.convergence)  # 76 iterations (0.15961270000843797 sec)
print(pg.OrderAccuracy(ranks)(fastranks))  # 0.9996993385447985

Early stopping PageRank

We can instead estimate how premature the convergence should be to stop the PageRank filter's diffusion at a robust enough node order. To this end, we create a new convergence manager to pass to the filter with the class pg.RankOrderConvergenceManager. This manager stops at a diffusion step where an equivalent random walk with restart process would have accounted for a large percentage of random walks. When stopped, pairs of nodes are set to not lose node order with confidence=0.98. Our code now stops much earlier and, importantly, the new convergence criterion lets us enjoy the computational speedups we would expect from early stopping:

convergence = pg.RankOrderConvergenceManager(pagerank_alpha=alpha, confidence=0.98)
smart_early_stop_ranker = pg.PageRank(alpha=alpha, convergence=convergence) >> pg.Ordinals()
fastranks = smart_early_stop_ranker(graph, group)
print(smart_early_stop_ranker.convergence)  # 54 iterations (0.012241799995535985 sec)
print(pg.OrderAccuracy(ranks)(fastranks))  # 0.9882742032471438

Warning

In terms of asymptotic time complexity considerations, the variation we run above still performs sorting. However, it employs a mechanism of estimating when to check for stopping again, which means that the sorting happens very infrequently.

To avoid sorting completely and get a convergence manager with \(O(1)\) overhead use the argument criterion='fraction_of_walks', as shown below. This may stop a little too soon, given that it inserts a precomputed number of iterations that arise from theoretical analysis. However, it still accounts for the vast majority of random walks and creates an empirically robust node order.

convergence = pg.RankOrderConvergenceManager(pagerank_alpha=alpha, criterion='fraction_of_walks')
smart_early_stop_ranker = pg.PageRank(alpha=alpha, convergence=convergence) >> pg.Ordinals()
fastranks = smart_early_stop_ranker(graph, group)
print(smart_early_stop_ranker.convergence)  # 0.9882742032471438
print(pg.OrderAccuracy(ranks)(fastranks))  # 0.7269993986770896

from matplotlib import pyplot as plt  # install matplotlib separately, as it's not a pygrank dependency
plt.scatter(pg.to_numpy(ranks.np), pg.to_numpy(fastranks.np), marker='.')
plt.show()

pipeline

Warning

pg.RankOrderConvergenceManager relies on theoretical analysis that is applicable only to the pg.PageRank filter. Furthermore, the pagerank_alpha parameter needs to match the alpha parameter of the filter.