A graph representation of the Web

There are multiple way we could repressent the World Wide Web, but since the web is consisted of multiple html pages being linked together, a clever way would be a directed graph.

A graph is an ordered pair $G = (V, E)$, where $V$ is the set of vertices in the graph, and $E = \{(x,y) \vert x,y \in V,x \neq y\}$.

A graph could have many characteristics:

  1. Directed/Undirected
  2. Weighted/Unweighted
  3. Network flow

For our purposes, we consider the WWW as a directed graph with each webpage is displayed as a node, and if page A is linked to page B, we draw an edge from node A to node B.

Let's take a look at a graph representation of a miniature version of the web with 8 webpages.

To make our computation more convenient, we can represent the directed graph as a square matrix, in which if there is an edge from node A to node B, we give entry (A,B) a value of 1, otherwise 0. Apply the rule to the graph above, we have the matrix:

The PageRank algorithm

To determine the importance of a webpage, we need to first assign a numerical value to each of the pages in our network. Let us call the importance of a webpage A the PageRank of A, denoted $PR(A)$.

Let the total PageRank of the network equal to 1, intially each page get the PageRank value of $1/n$, with $n$ be the total number of webpage in our network. We then perform the following iterative algorithm to calculate the PageRank of each webpage at step k.

We perform the sequence of k update as follow:

At each step, each page divides its current PageRank equally across its out-going links, and passes these equal shares to the pages it points to. (If a page has no out-going links, it passes all its current PageRank to itself.) Each page updates its new PageRank to be the sum of the shares it receives.

The code for the iterative algorithm of PageRank is presented below:

We can see that the PageRank value of all webpage seems to converge to some value, which brings us to the following definition:

Equilibrium Values of PageRank: As with hub-authority computations, one can prove that except in certain degenerate special cases the PageRank values of all nodes converge to limiting values as the number of update steps k goes to infinity.

Since the PageRank value of a page remains constant, we can also prove that the flow from one webpage to another also remain constant.

Scaling of PageRank

Our algorithm seems to work for most cases, however, given the right kind of network, the basic PageRank algorithm can create some huge trouble.

Consider our network of 8 webpages. Let make a minor change to the graph, by pointing the outlinks of F and G to each other, instead of to A.

We can see that the PageRank value of the network slowly move towards F and G.

How can we fix that?

By using a special scaling factor, $s$, we can prevent the problem above from happenning. We modify our initial algorithm to the following: At each step of the algorithm:

Scaled PageRank Update Rule: First apply the Basic PageRank Update Rule. Then scale down all PageRank values by a factor of s. This means that the total PageRank in the network has shrunk from 1 to s. We divide the residual $1 − s$ units of PageRank equally over all nodes, giving $\frac{1 − s}{n}$ to each.

Below is the implementation of the algorithm:

The scaling factor minimize the change in the PageRank when we add or delete more nodes to our network. However, there are still some disadvantages of the algorithm.

The Limit of the Scaled PageRank Update Rule: With each factor $s$ we choose, we end up with a different set of equilibrium PageRank.

PageRank and Linear Algebra