Setup
Create a (virtual) environment with Python 3.9 or later
and install or upgrade to the latest version of pygrank with:
pip install --upgrade pygrank
Creating graphs
When working n practical problems,
use networkx to construct graphs
by adding edges between Python objects.
For example, you can construct a graph
that pygrank can process with the
following pattern, which we use throughout
our documentation for ease of development:
import networkx as nx
graph = nx.Graph(directed=False) # undirected is also the default
graph.add_edge('A', 'B')
graph.add_edge('A', 'C')
Graphs like the above require a lot of memory to keep track of relations
between data,
which can be an issue when processing large graphs.
On the other hand,
pygrank is typically interested in
converting those graphs to sparse matrices of respective
backends. For this reason, the library provides its own
trimmed down pygrank.Graph class that implements a subset of
of graph operations needed for node ranking algorithms
and speeds up the add_edge method. Instances of this class
can be created with the pattern:
import pygrank as pg
graph = pg.Graph(directed=False) # undirected is also the default
graph.add_edge('A', 'B')
graph.add_edge('A', 'C')
Backends
Several popular computational backends are supported.
To avoid bloat of the main package,
these should be installed separately as needed.
Only "numpy" can be used immediately out-of-the box as
the default. Find instructions on how to install
and enable the rest below.
Info
First-time users can stick to the default and skip the rest of this section. However, setting up graph analysis on GPU with respective backends can be hundreds of times faster.
To switch between backends, either use the load_backend(name)
command or define an execution context that temporarily switches
to the specified backend and then reverts to the previous one.
This is the recommended approach, as demonstrated below.
Switching backends only affects how new operations are executed. Data types are automatically converted as needed, and caching optimizations are tied to the backend.
import pygrank as pg
algorihtm = pg.PageRank()
with pg.Backend("tensorflow"): # tensorflow needs to be installed
scores = algorihtm(...)
print(scores.np) # a tensor
print(scores.np) # an array now that we switched back
When importing pygrank a message appears indicating that "numpy" is the default backend.
The same message points to a JSON configuration file stored under home/.pygrank,
alongside any automatically downloaded content. The configuration
file specifies the default backend to be set upon the library's
first import, initialization parameters for that backend, and the option
to silence the reminder message. These options can either be
edited directly on the file or programmatically set with:
pg.set_backend_preference(name, reminder=True, **init) # essentially call pg.load_backend(name, **init) on pygrank's first import
The init dictionary holds parameters passed to backend initialization.
The configuration file's contents looks like this:
{
"backend": "numpy",
"reminder": "true",
"init": {}
}
Below is a list of supported backends with installation instructions and comments.
numpy
About
This is the default backend and is enabled by default. Internally,
it employs scipy for sparse-dense matrix operations. All other backends rely on scipy sparse matrices
as an intermediate step when initializing their own sparse matrix types. This backend is
best suited to general-purpose numerical computations and
handling very large graphs with memory efficiency, but is not
the fastest option.
Links
numpy
scipy
tensorflow
About
Performs computations within the tensorflow execution environment.
The latter is an open-source platform for machine learning developed by the Google Brain team.
There
are two modes in which this backend can be executed: "dense" (default) and "sparse".
The mode may be provided as additional arguments to the backend loading call like this:
import pygrank as pg
with pg.Backend("tensorflow", mode="dense", device="auto"):
... # code to run on pytorch here
In dense mode, the tensorflow backend attempts to store graphs in dense square
matrices that take full advantage of tensorflow's parallelization.
If there is not enough memory to allocate a sparse adjacency matrix,
the backend generates a sparse version and creates a warning.
The backend's initialization also accepts a device string or object to
which computations should be internally transferred. If provided, this needs to
be a tensorflow device name.
Installation
pip install tensorflow[and-cuda]
On Windows install WSL2 (Windows Subsystem for Linux) first.
Links
tensorflow
pytorch
About
Performs computations within the pytorch execution environment.
The latter is an open-source platform for machine learning developed by Meta's AI Research lab.
Similarly to "tensorflow",
are two modes in which this backend can be executed: "dense" (default) and "sparse".
The mode may be provided as additional arguments to the backend loading call like this:
import pygrank as pg
with pg.Backend("pytorch", mode="dense", device="auto"):
... # code to run on pytorch
In dense mode, the pytorch backend attempts to store graphs in dense square
matrices that take full advantage of pytorch's device parallelization.
If there is not enough memory to allocate a sparse adjacency matrix,
the backend generates a sparse version and creates a warning.
The backend's initialization also accepts a device string or object to
which computations should be internally transferred. If provided, this needs to
be one among pytorch's available devices (typically "cuda" or "cpu").
If not provided, the device will be the same as the one selected during the
last time this backend was loaded. If this is the first time,
the device will be automatically selected to be "cuda"
if the latter is properly integrated, and "cpu" otherwise.
Installation
For full installation instructions visit pytorch's website in the links below.
Links
pytorch
torch_sparse
About
Performs computations within the pytorch execution environment,
but contrary to the "pytorch backend uses the sparse computations of the torch_sparse library.
The latter is an open-source platform for machine learning developed by Meta's AI Research lab.
This backend always executes on sparse mode
and its initialization accepts a device string or object to
which computations should be internally transferred. This follows the
same conventions as "pytorch" to determine the employed device. For example,
use this backend like this:
import pygrank as pg
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
with pg.Backend("torch_sparse", device=device):
... # code to run on torch_sparse
Info
"torch_sparse" is near-identical as "pytorch"
in sparse mode but is much faster in preprocessing adjacency matrices.
Installation
For full installation instructions visit pytorch's website in the links below.
Links
pytorch
torch_sparse
matvec
About
Offers multithreaded implementations and memory reuse that are much faster that "numpy"
when processing extremely sparse graphs. It very fast when the number of edges is a small multiple of the number of nodes,
but is slower than other backends for dense graphs.
Installation
pip install matvec
Links
matvec
dask
About
Offers the distributed computational model of dask.distributed.
Enables distributed computing and parallel processing, making it ideal for very large graphs that need
to be processed in a distributed manner.
This backend's instantiation accepts additional positional and a keyword argument chunks=8 to denote
the number of chunks to which sparse matrices are split (the maximum number of engaged
distributed works), and keyword arguments to pass to the dask client's constructor.
Installation
pip install dask[distributed]
Links
dask.distributed
sparse_dot_mkl
About
Running computations on parallelized scipy multiplications.
Provides speedups for sparse matrix multiplications by utilizing optimized MKL routines.
Best suited when Intel's hardware and software stack are available.
Installation
pip install sparse_dot_mkl
Links
mkl
Info
If you use Intel's Python distribution, "sparse_dot_mkl" is only marginally faster than "numpy".