Skip to main content

11.Link Analysis - Page Rank

Created on December 26|Last edited on January 13

Web as a Graph

We want to conceptualize what web as a structure looks like, with nodes as web pages and edges as hyperlinks. Web's strongly connected components and etc.

This is a little tricky when it comes to dynamic single page web pages. Early internet most of the web links were navigational but many links now are transactional.

Web can be looked at like a directed graph, just like other information networks like citation network and etc.

Web as a directed graph [Broder et al. 2000]

Given a node v, what nodes can v reach In(v)In(v), and what other nodes can reach v, Out(v)Out(v)? We have two types of directed graphs. Strongly connected graphs, where any node can reach any other node via a directed path. And Directed Acyclic Graphs (DAG), where if u can reach v, then v cannot reach u. And any directed graph can be expressed in terms of these two types.

Any given directed graph is a Directed Acyclic Graphs on its Strongly connected components. Each node is in exactly one SCC. If G is our original directed graph, we if we build a graph G' whose nodes are SCCs, and with an edge between nodes of G' if there is an edge between corresponding SCCs in G, then G' is a DAG. Below you can see how an example directed graph can be converted into a DAG of all of its SCCs.

directed graph with DAGs of SCC

Lets see how web crawl at 1999 as directed graph, has its SCCs fit together as DAG (Border et al Altavista web crawl) It has 200 million urls and 1.5 billion links. This is how we make finding SCC's of large graphs tractable. For every node v, Out(v)∩In(v)Out(v) \cap In(v) is an SCC containing v. That is intersection of all the nodes that can reach v, and all the nodes that can be reached from v. You do a BFS starting from v on our original graph G, and another BFS starting from v on the graph G', where all the edge directions were flipped and find the intersection of both these sets to get the SCC containing v.

Broder et. al., did this for random nodes and found out that either a given node visits many nodes or very few nodes. They found that 50% of the nodes has Out(v)∼100 millionOut(v) \sim \text{100 million}, 50% nodes has In(v)∼100 millionIn(v) \sim \text{100 million}. Largest SCC has 56 million nodes around 28% of the nodes. Below image shows the bowtie structure of the web Border et al came up with.

structure of the web



Page Rank

This is one of the foundational algos that lead to google. Their idea is that not all nodes/pages are of equal importance. They wanted to rank the pages according to their importance based on the link structure of the graph. This idea of ranking nodes in a graph based on their link structure is called link analysis. We will study the following link analysis algos in this report.

  • PageRank
  • Personalized PageRank
  • Random Walk with Restarts

We consider each link as a vote, and a page is more important if its connected to more links. Intuitively we can see that number of links a page has is less important than number of links that lead to a given page. So for every node, we consider the number of in-links as votes. And also links from more important page count more. This leads to the recursive pattern. More links to a page makes it more important and links from more important pages also makes a page more important.

If a page with ii importance rir_i has did_i out-links, each link gets ri/dir_i/d_i votes. And page j's own importance rjr_j is the sum of the votes on its in-links.

rj=∑i→jridi\displaystyle r_j = \sum_{i \rarr j}\frac{r_i}{d_i} where did_i and rir_i are the degree and rank of node ii respectively.

For a network of 3 nodes with links, we might get the below flow equations.

ry=ry/2+ra/2ra=ry/2+rmrm=ra/2r_y = r_y/2 + r_a/2\\r_a = r_y/2 + r_m\\r_m = r_a/2

From above we can see we have a system of linear equations with 3 variables and 3 flow equations, we can solve it by doing Gaussian elimination. But the problem is when n is very big.

The same flow equations can be seen in matrix form. Consider Stochastic adjacency matrix M, where page jj have djd_j out-links. And if j→ij \rarr i, then Mij=1djM_{ij} = \frac1{d_j}. This makes M a column stochastic matrix where all the columns sum to 1. And a rank vector rr of size n, where rir_i is the importance score of page ii. We also make all the ranks sum to 1, hence ∑iri=1\sum_i r_i = 1. Now flow equations can be written in terms of r and M as follows

r=M.rrj=∑i→jridi\displaystyle r = M.r \quad \quad r_j = \sum_{i \rarr j}\frac{r_i}{d_i}

We can see that pagerank vector of all the nodes corresponds to eigenvector corresponds to the eigenvalue of 1 for the stochastic adjacency matrix M.

Random walk interpretation

Imagina at any time tt, a web surfer is on some page i and at time t+1t+1, the surfer follows an out-link from ii uniformly at random ends up on page jj. This process repeats indefinitely. p(t)p(t) be a vector whose i-th coordinate is the probability that the surfer is at page i at time t, so p(t)p(t) is a probability distribution over pages.

If we follow a link uniformly at random we have p(t+1)=M.p(t)p(t+1) = M.p(t). When the random walk reaches a state where p(t+1)=M.p(t)=p(t)p(t+1) = M.p(t) = p(t), then p(t)p(t) is called stationary distribution of a random walk. We can see our original rank vector r also satisfies the same equation. So now can can conclude that the stationary distribution for an indefinite random walk is the rank of the pages. Both the random surfer intuition and in-link votes importance lead us to the same equations.

r=M.rr = M.r

As the rank vector r is an eigenvector of the stochastic web matrix M. Now we can start with any distribution vector uu, and the limit M(M(M(M(⋯M(M(u))))))M(M(M(M(\cdots M(M(u)))))) is the long term distribution of the surfers. The limiting distribution is the principal eigenvector of M which is the PageRank of all the nodes.

Power Iteration

This method of starting with some random vector uu, iterate r(t+1)=M.r(t)r^{(t+1)} = M.r^{(t)} and stop when r converges.

Power Iteration is where we initialize with vector r(0)=[1/N,1/N,⋯ ,1/N]Tr^{(0)} = [1/N, 1/N,\cdots ,1/N]^T and do r(t+1)=M.r(t)r^{(t+1)} = M.r^{(t)} iteratively and stop when ∣r(t+1)−r(t)∣<ϵ|r^{(t+1)} - r^{(t)}| < \epsilon.

In practice this process converges with 50 iterations.

Problems

Does it converge? If it converges does it converge it to reasonable results.?

  • Dead end problem where Pages with dead ends exist, that have no out-links will make the random walk surfer stop. Such pages cause the importance to leak out. The matrix column is not column stochastic so our initial assumptions are not met here.

  • Spider traps problem where there are a bunch of nodes whose out-links are within the group, making it absorb all the importance eventually. Once a random walk surfer comes into any of the nodes of a spider trap, he will be stuck in it. Spider traps make the rank not what we want, but suck all the importance from rest of the network.

In these two cases, our method will not converge.

We can solve the spider trap problem by making the surfer do random jumps to rest of the network. With probability β\beta follow a link at random, and with probability 1−β1 - \beta jump to a random node/page within the network. Common values for β\beta are in the range of 0.8-0.9. Surfer will teleport out of the spider trap with in a few time steps.

We can solve the dead end problem by teleporting to random node with total probability 1 from dead-ends. Make the matrix column stochastic by always teleporting when there is nowhere else to go.

Random Teleports

At each step random surfer with probability β\beta, follow a random link, and with probability 1−β1-\beta jump to some random page.

PageRank Equation[Brin-Page, 98]

rj=∑i→jβridi+(1−β)1N\displaystyle r_j = \sum_{i \rarr j}\beta \frac{r_i}{d_i} + (1-\beta)\frac1{N}

Above formulation assumes that M has no dead ends. We can either preprocess matrix M to remove all dead ends or explicitly follow random teleport links with probability 1 from dead-ends.

The Google Matrix A:

A=βM+(1−β)[1N]N×NA = \beta M + (1-\beta)\bigg[\frac1{N}\bigg]_{N\times N} with the recursive problem r=A.rr = A.r

Computing PageRank

The above formulation of the Google matrix A and pagerank vectors is computable if we can manage to fit it in the main memory. But with N=1 Billion pages, A has N2N^2 entries with 101810^{18} entries. To make this feasible we have the sparse matrix formulation below

r=βM.r+[1−βN]Nr = \beta M.r + [\frac{1 - \beta}{N}]_N, where [(1−β)/N]N[(1-\beta)/N]_N is a vector with all NN entries (1−β)/N(1-\beta)/N

Now M is a sparse matrix, where we have approximately 10N entries instead of N2N^2.

In each iteration we compute rnew=βM.roldr^{new} = \beta M.r^{old}, and add a constant value (1−β)/N(1-\beta)/N to each entry in rnewr^{new}. And if M contains dead-ends then ∑jrjnew<1\sum_jr_j^{new}<1 and we also have to renormalize rnewr^{new} so that it sums to 1.

PageRank Algorithm

Given Graph GG and parameter β\beta, we need to get PageRank vector rnewr^{new}, here G has both spider traps and dead ends.

Set rjold=1Nr_j^{old} = \frac1{N} Repeat until convergence where ∑j∣rjnew−rjold∣<ϵ\sum_j |r_j^{new} - r_j^{old}| < \epsilon

  • ∀j:r′jnew=∑i→jβriolddi\forall j: {r'}_j^{new} = \sum_{i\rarr j}\beta \frac{r_i^{old}}{d_i} and r′jnew=0{r'}_j^{new} = 0 if in-degree of jj is 0
  • reinsert the leaked PageRank: ∀j:rjnew=r′jnew+1−SN\forall j: r_j^{new} = {r'}_j^{new} + \frac{1-S}{N} where: S=∑jr′jnewS = \sum_j{r'}_j^{new}
  • rold=rnewr^{old} = r^{new}


Random Walk with Restarts and Personalized PageRank

Consider a bipartite graph of conferences and authors that attend those conferences and the problem of recommending what new conferences a given author can attend, or for a conference what new author should be recommended for an invitation. Which conferences are more related, or a measure of proximity between conferences.

In recommender systems, say in a user-item bipartite graph, how are two items that doesnt have a direct link related, how similar are those. Or how two users who don't have a direct link to one another are similar to each other. We call it collaborative filtering to recommend items based on the users similarity.

Node proximity measurements

How do we have a measure of proximity for two nodes that are not connected to each other in a bipartite graph.

  • Shortest path
  • Common neighbours proximity of nodes In the above picture, even though CC' and DD' pairs of nodes have same shortest path and same number of common neighbours we want a measurement where CC' are much more similar to DD' pairs of nodes as D and D' have so many other nodes that are unrelated to each other which implies D and D' they have many more dis-similarities than they have similarities. We want to have a measure that can capture this.

It turns out that PageRank obeys this kind of intuition that gives us a better metric for proximity.

Personalized PageRank ranks proximity of nodes to the teleport nodes S. This S can be a set of nodes or an individual node. At every step, the random walker teleports to S. This random walk is personalized random walk for S, hence giving us proximity or pagerank of any node for S.

Given a set of Query Nodes we simulate a random walk starting from these query nodes, after every visit to a node, we record the visit count. And with probability α\alpha we restart the walk at one of the Query Nodes. And the nodes with the highest count have the highest proximity to the Query Nodes. You don't even have to do power iteration, just based on the visit counts we can figure out the most proximal nodes.

Normal PageRank

Teleports uniformly at random to any node, and all nodes have the same probability of surfer landing there.

Personalized PageRank

Also called as Topic-Specific PageRank, teleports to a topic specific set of nodes, Nodes will have different probabilities of surfer landing there.

Random Walk with Restarts

Topic-Specific PageRank where teleport is always to the same node S.