Skip to main content

6.Message Passing and Node Classification

How to make use of network structure and neighbourhood correlations to classify nodes.
Created on December 22|Last edited on January 13

Node Classification

Given a network with labels on some nodes, how do we assign labels to all the other nodes in the network. Sample use case may be identifying fraudsters and trusty worthy nodes. Many data collection tasks are very expensive in practice. So we do some sampling, and collect some fraction of samples, and we want to be able to predict the rest.

Class of technique, an umbrella term we use called collective classification, getting to a stable label assignment for each node.

The core intuition is that we want to be able to leverage correlations that exist in networks. In case of fraud detection online, fraudsters might be able to fake the content of their text and etc, but it's so much more harder for them to fake their network connections. We want to be able to take advantage of this.

We will be covering these three techniques in this section.

  • Relational classification
  • Iterative classification
  • Belief propagation

Correlations in networks

Three main types of dependencies that lead to correlation. Individual behaviours are correlated in an network environment.

Homophily

People with similar characteristics are likely to be more connected. The tendency of individuals to associate and bond with similar others leads to this. It has been observed in vast array of network studies, based on a variety of attributes like age, gender, race etc.,

Influence

Social connections will affect individual characteristics. Example: I recommend my 'peculiar' musical preferences to my friends, until one of them grows to like my same favorite genres. Individual behaviour changed because of our friendship connection.

Confounding

Environment that affects both your individual characteristics and your social connections.

Classification with Network Data

node labels with signed network In the above signed network, we have positive labeled nodes and negative labeled nodes, and our task is to identify the nodes that are not labeled.

Similar nodes are typically close together, if a node is connected to a node positively labeled, that node is likely to be positively labeled as well. Malicious web pages link to one another to increase visibility in search engines, look credible.

Classification label of an object OO in network may depend on, features of OO, labels of objects in O′sO's neighbourhood, and features of objects in O′sO's neighbourhood.



Collective Classifiers

The intuition behind collective classification is that we will classify interlinked nodes simultaneously using correlations.

Relevant applications:

  • Document classification
  • Part of speech tagging
  • Link prediction
  • Optical character recognition
  • Image/3D data segmentation
  • Entity resolution in sensor network
  • Spam and fraud detection

The assume markov assumption when it comes to nodes behaviour given its neighbourhood nodes behaviour. The label YiY_i of one node ii depends on the labels of its neighbours NiN_i accordingly P(Yi∣i)=P(Yi∣Ni)P(Y_i|i) = P(Y_i|N_i)

Collective classification involves 3 steps:

Local Classifier

Assign initial labels, based on just the node features/attributes. This is a standard classification task using basic ML concepts. This step doesn't use network information.

Relational Classifier

Capturing correlations based on the network, between the neighbourhood nodes and etc. Learns to classify a node based on the labels/features of its neighbours. This is where we start taking advantage of the network information.

Collective Inference

Propagate correlations through the network. Iterate until the inconsistency between neighbouring labels is minimized. Network structure substantially affects the final prediction.

Exact inference in collective classification models is practical only when the network satisfies certain conditions. For arbitrary networks exact inference is NP-hard. If we represent every node as a discrete random variable with a joint mass function pp of its class membership, the marginal distribution of a node is the summation of pp over all the other nodes. The exact solution takes exponential time in the number of nodes, therefore we use inference techniques that approximate the solution by narrowing the scope of the propagation (e.g., only neighbours) and the number of variables by means of regression.



Relational Classifiers

Given a graph GG with some of its nodes labeled, with node ii having a feature vector fif_i and label YiY_i, our goal is to find P(Yi)P(Y_i) of unlabeled nodes.

Probabilistic Relational Classifier

The core idea is that the class probability of YiY_i is the weighted average of class probabilities of its neighbours. For labeled nodes, we initialize P(Yi)P(Y_i) with the ground-truth YY labels. For unlabeled nodes, we initialize YY uniformly among all classes. And then we update all nodes in random order until convergence or until maximum number of iterations is reached.

Repeat for each node ii and label cc

P(Yi=c)=1∑(i,j)∈EW(i,j)∑(i,j)∈EW(i,j)P(Yj=c)\displaystyle P(Y_i=c) =\frac{1}{\sum_{(i, j) \in E}W(i, j)}\sum_{(i, j) \in E}W(i, j)P(Y_j=c)

We weight each neighbour nodes probability that it is of relevant label with the edge weight, and normalize it again with the edge weights of all neighbour nodes.

Convergence is not guaranteed for a probabilistic relational classifier. And we are not able to use node features, just neighbourhood nodes' class label information.

This model is very easy to implement and straightforward.

example network

Consider the example network above. Green nodes [6,7][6, 7] are positive labels, blue nodes [1,2][1, 2] are negative labels, and yellow nodes [3,4,5,9,8][3,4, 5, 9, 8] are unlabeled. First we initialize for all the nodes their positive class probability P(Yi=1)P(Y_i = 1). For positive nodes P(Yi=1)=1P(Y_i = 1) = 1, for negative nodes P(Yi=1)=0P(Y_i=1) = 0 and for unlabeled nodes, as there are only two classes P(Yi=1)=0.5P(Y_i=1) = 0.5

Now consider Node 3, we update its label probabilities according to neighbour node class probabilities and the edge weights. For the first iteration using its neighbour nodes N3=[1,2,4]N_3 = [1, 2, 4] we get from above equation P(Y=1∣N3)=13(0+0+0.5)=0.17P(Y=1|N_3) = \frac13(0 + 0 + 0.5) = 0.17.

We compute these probabilities for all the unlabeled nodes and all their class probabilities in a single iteration. After we update all the class probabilities of all unlabeled nodes, we repeat the same process with new inferred probabilities for the unlabeled nodes until the node probabilities converge.



Iterative Classifiers

Unlike in case of relational classification we will be taking advantage of node features in case of Iterative classifiers. The main idea is that classify node ii based on its attributes as well as labels of neighbour set NiN_i

This is how we do it.

  • Create a flat vector aia_i for each node ii that represents its features
  • Train a classifier f(ai)f(a_i) using aia_i to get the best value for YiY_i
  • Update node vector aia_i based on new node label probabilities
  • Update label YiY_i to f(ai)f(a_i). This is hard assignment.
  • Repeat the process with new node features, until convergence or maximum number of iterations.

REV2 Fraudulent User Predictions in Rating Platforms

REV2 is an example iterative classification framework, for fake reviewer/review detection. The problem they are trying to solve is identifying review farms, a +1 increase in rating of a product increases revenue by 5-9%. So people are constantly trying to game the e-commerce systems. We also have fake hype in social media to increase engagement metrics, defame spam and etc.

We can make use of behavioral analysis, like the individual features, geographic locations, login times, session history etc. Language analysis, use of superlatives, lots of self-referencing, rate of misspellings, many agreement words. But both of these are easy to fake, individual behaviours and language they use, but their position in a network or neighbourhood graph structure is hard to fake. Graphs capture relationships between reviewers, reviews and stores.

The specific problem REV2 was solving is fake review detection. The problem setup involves, users and the ratings they give to various products. So the input graph GG is a bipartite graph with nodes representing users and products. With edges rating scores +1 or -1. Our goal is to find out the set of users that give fake ratings.

Solution formulation

We assume users, products and ratings having an intrinsic quality scores. User has fairness scores. Even if a user is not a fake user, he might have different standards for rating or fairness, We want to be able to capture that. Products have goodness scores, every product has its intrinsic usefulness or goodness, we want to capture this as well. And ratings users give to products have reliability scores, how reliable a rating given by a user to a product is. A fraudulent user might not always be giving a fake rating, to game the system they might be giving genuine rating to products that he might not be interested in. So we want to be able to capture the reliability of a given rating.

All these values are unknown, fairness of a user F(u)∈[0,1]F(u) \in [0, 1], goodness of a product G(p)∈[−1,1]G(p) \in [-1, 1] and reliability of a rating R(u,p)∈[0,1]R(u, p) \in [0, 1]. We now will calculate all these values for all nodes and edges simultaneously using Iterative classification

bipartite graph

Fairness of user

We can write fairness of a user in terms of goodness of a product and reliability of a rating like below

F(u)=∑(u,p)∈Out(u)R(u,p)∣Out(u)∣\displaystyle F(u) = \frac{\sum_{(u, p) \in Out(u)} R(u, p)}{|Out(u)|}

Goodness of product

We can write goodness of a product in terms of fairness of user and reliability of rating like below

G(p)=∑(u,p)∈In(p)R(u,p).score(u,p)∣In(p)∣\displaystyle G(p) = \frac{\sum_{(u, p) \in In(p)} R(u, p).score(u, p)}{|In(p)|}

Reliability of ratings

We can write reliability in terms of fairness of a user and goodness of a product like below

R(u,p)=1γ1+γ2(γ1.F(u)+γ2.(1−∣score(u,p)−G(p)∣2))\displaystyle R(u, p) = \frac{1}{\gamma_1 + \gamma_2}\Bigg(\gamma_1.F(u) + \gamma_2.\bigg(1 - \frac{|score(u, p) - G(p)|}{2}\bigg)\Bigg)

term γ1.F(u)\gamma_1.F(u) represents how fair the user giving the rating

and the term γ2.(1−∣score(u,p)−G(p)∣2)\gamma_2.(1 - \frac{|score(u, p) - G(p)|}{2}) represents how close the rating given by user u to product p is to the goodness of the product p.

In the above equations score(u,p)score(u, p) is the actual rating user u gave to p, In(p)In(p) is the set of edges into a product p, and Out(u)Out(u) are the edges going out from user u.

initialization

We initialize every score to the best possible values like in the image above. And update goodness, reliability, and fairness using the actual product ratings users gave according to the network structure and the equations above.

After convergence the scores will look like below, and low fairness users are the fraudsters. convergence

Properties of the REV2 solution.

Guaranteed to converge, and number of iterations till convergence is upper-bounded. Linear in the number of edges in the graph. Seems like we didn't actually build or make use of node features, and just used the rating edge weights/features.



Loopy Belief Propagation

Belief propagation is a dynamic programming approach to answering conditional probability queries in a graph model. This is an iterative process in which neighbor variables "talk" to each by passing messages.

Every node has some belief of what states its neighbouring node will be in with various likelihoods. When we make use of these messages iteratively to figure out state probabilities of all nodes until consensus is reached to calculate final belief.

Consider for example, a long chain of nodes, with nodes only able to talk to its two neighbours. And the problem statement is figure out how many total nodes are in the chain, each node will pass to the next node, how many nodes are there on its left side and right side, and if we start this message propagation from one end to the other and again in reverse, all nodes will have the node count information just by talking to its neighbours. As long as there are no cycles in the chain, we will be able to figure this out.

Consider the same node count problem with a network that looks like a tree, as we keep running it iteratively, every node just talking to all of its neighbours. We will be able to figure out the total count as long as there are no cycles in the tree network or a loopy cyclic graph. Each node receives reports from all the branches of the tree.

Loopy Belief Propagation algorithm

The message node ii sends to node jj depends on what ii hears from its neighbours kk. Each neighbour k passes a message to ii its beliefs of the state to ii

We use the following notation in Loopy BP algo.

Label-label potential matrix ψ\psi: Dependency between a node and its neighbor. ψ(Yi,Yj)\psi(Y_i, Y_j) equals the probability of a node jj being in state YjY_j given that it has a ii neighbor in state YiY_i

Prior belief ϕ\phi: Probability ϕi(Yi)\phi_i(Y_i) of node ii being in state YiY_i

Message mi→j(Yj)m_{i\rarr j}(Y_j): is i′si's estimate of jj being in state YjY_j

and finally L\mathcal{L} the set of all possible node states.

We initialize all messages ∀i,jmi→j\forall_{i, j} m_{i \rarr j}to 1, and repeat for each node:

mi→j(Yj)=α∑Yi∈Lψ(Yi,Yj)ϕi(Yi)∏k∈Ni\jmk→i(Yi)\displaystyle m_{i \rarr j}(Y_j) = \alpha \sum_{Y_i \in \mathcal{L}} \psi(Y_i, Y_j) \phi_i(Y_i) \prod_{k \in \mathcal{N}_i\backslash j}m_{k\rarr i}(Y_i)

bi(Yi)=αϕi(Yi)∏j∈Nimj→i(Yi),∀Yi∈L\displaystyle b_i(Y_i) = \alpha \phi_i(Y_i) \prod_{j \in \mathcal{N}_i} m_{j \rarr i}(Y_i), \forall Y_i \in \mathcal{L} where ∏j∈N\prod_{j \in \mathcal{N}} term represents all messages from node i′si's neighbours and bi(Yi)b_i(Y_i) node i′si's belief of being in state YiY_i

Messages from different subgraphs are no longer independent if out network has cycles in it. BP is a local algorithm so it doesn't see the cycles. There are extreme examples where we can show that BP fails to converge. BP treats two messages within a cycle as if they were independent, are seperate evidence. One influenced the other via the cycle. But in practice we find that the cyclic influences are weak, as cycles are long or include at least one weak correlation.

Loopy BP is easy to program and parallelise, can apply to any graphical model with any form of potentials even higher order than pairwise. But convergence is not guaranteed, especially if there are many closed loops. Potential functions with parameters require training to estimate, and we will have to use gradient-based optimization to learn these parameters.



Online Auction Fraud detection using Loopy BP

The algorithm from [Netprobe: A fast and Scalable System for Fraud detection in Online Auction Networks] makes use of loopy BP to detecting fraud in auction websites like eBay and etc. Auction sites are an attractive target for fraud made up of 63% of complaints to Federal Internet crime complaint center in US in 2006, averaging loss per incident of $385 USD.

We can build a fraud detection classifier just on users, with their user attributes like geographic location, login times, session history etc. These things might be easy to fake, given the fraudsters, always try to beat the by building trust in the system by doing legitimate transactions. But the actual graph structure is harder to fake who they interact with to do legitimate transactions and how they fraud normal users.

The main question we need to understand is how the fraudsters interact among each other and how they interact with actual users. In addition to buy sell relations, are there more complex relations.

The reputation score each user given by other users during a transaction is feedback mechanism we use to build our fraud detection algorithm.

From the basic understanding of how these fraudsters operate, we define these three roles to each user node, accomplice, fraudster and honest. The fraudsters don't just boost each others reputation because if one is caught all others will be caught. They make use of accomplices who trades honestly with other users and looks legit. And a fraudster trades with accomplice and does fraud with honest users. These are the various roles our users in an action system adopt and our goal is to find the probability of each node accepting any of these roles.

The accomplice and fraudsters form a near-bipartite cores.

We use belief propagation to find the roles of all the users in an auction network. Now our challenge is to decide on the BP parameters. Prior beliefs, prior knowledge is unbiased if none, and compatibility potentials by insight. We use the below potential matrix ψ(Yi,Yj)\psi(Y_i, Y_j) of neighbours understanding of what a given node can be given its own state.

Node state
Neighbour StateFraudAccompliceHonest
Fraudϵp\epsilon_p1−2ϵp1-2\epsilon_pϵp\epsilon_p
Accomplice0.50.52ϵp2\epsilon_p0.5−2ϵp0.5-2\epsilon_p
Honestϵp\epsilon_p(1−ϵp)/2(1-\epsilon_p)/2(1−ϵp)/2(1-\epsilon_p)/2

Using the above potential matrix from user insight, and initializing all nodes as unbiased, at each iteration for each node, we compute messages to its neighbours.Continue this until the role probabilities of individual nodes converge with Final Belief scores being final roles.