Big Graphs

This documents provides a FAQ on how to handle big graphs with millions of edges.

Which backend?

Use "numpy" (the default backend). It is the most well-tested and memory efficient. Prefer running big graph experiments on intel's Python distribution. This can be up to two times faster than regular Python distributions.

Which algorithm?

pygrank algorithms are split into graph filters and postprocessors to augment their outcome. Here we touch on graph filters. If you don't know anything else about your data, the following graph filter is recommended:

import pygrank as pg

algorithm = pg.PageRank(alpha=0.9, tol=1.E-9, max_iters=1000)

For compatibility with networkx and other historical practices, these are not the default parameters values. However, they tend to work well in big graphs. As a little explanation on the choices, personalized PageRank is equivalent to stochastic random walks with average length 1/(1-alpha) hops away from seed nodes. At the same time, you need a small enough numerical tolerance to make sure that your numer of seeds divided by your number of nodes is not immediately smaller than that. Finally, higher diffusion parameters alpha and lower numerical tolerances increase the number of iterations it takes for recursive graph filters to converge. Thus, a higher cap to the computational limit should be placed to make sure that this is not exceeded before convergence.

Info

As a last failsafe against unforeseen convergence properties, set tractable limits on the number of iterations.

I have only a few data.

If your communities have too few members, try to increase the receptive field of node ranking algorithms, for example by increasing alpha in pagerank. If you have increased the receptive field but require more expansions try applying the SeedOversampling() and, if you are fine with its computational demands, BoostedSeedOversampling() postprocessors on your algorithms.

My graph is already a scipy sparse matrix.

Note that node ranking algorithms and graph signals typically require graphs. However, sometimes it is more computationally efficient to construct and move around sparse scipy adjacency matrices, for example to avoid additional memory allocations.

For these situations, given an adjacency matrix adj, you can convert it to a graph wrapping its data in O(1) time and memory with the following code:

import pygrank as pg

adj = ...  # a square sparse scipy array
graph = pg.AdjacencyWrapper(adj, directed=True)

In this case, the graph's nodes are considered to be the numerical values 0,1,..,adj.shape[0]-1. The directed argument in the constructor only affects the type of "auto" normalization in preprocessors.

Warning

The AdjacencyWrapper can only wrap scipy matrices and not, say, matrices generated by different backends.