6.Message Passing and Node Classification

How to make use of network structure and neighbourhood correlations to classify nodes.
Created on December 22|Last edited on January 13
Comment
﻿
Node ClassificationGiven a network with labels on some nodes, how do we assign labels to all the other nodes in the network. Sample use case may be identifying fraudsters and trusty worthy nodes. Many data collection tasks are very expensive in practice. So we do some sampling, and collect some fraction of samples, and we want to be able to predict the rest.
Class of technique, an umbrella term we use called collective classification, getting to a stable label assignment for each node.
The core intuition is that we want to be able to leverage correlations that exist in networks. In case of fraud detection online, fraudsters might be able to fake the content of their text and etc, but it's so much more harder for them to fake their network connections. We want to be able to take advantage of this.
We will be covering these three techniques in this section.
Relational classification
Iterative classification
Belief propagation
Correlations in networksThree main types of dependencies that lead to correlation. Individual behaviours are correlated in an network environment.
HomophilyPeople with similar characteristics are likely to be more connected. The tendency of individuals to associate and bond with similar others leads to this. It has been observed in vast array of network studies, based on a variety of attributes like age, gender, race etc.,
InfluenceSocial connections will affect individual characteristics. Example: I recommend my 'peculiar' musical preferences to my friends, until one of them grows to like my same favorite genres. Individual behaviour changed because of our friendship connection.
ConfoundingEnvironment that affects both your individual characteristics and your social connections.
Classification with Network Data
In the above signed network, we have positive labeled nodes and negative labeled nodes, and our task is to identify the nodes that are not labeled.
Similar nodes are typically close together, if a node is connected to a node positively labeled, that node is likely to be positively labeled as well. Malicious web pages link to one another to increase visibility in search engines, look credible.
Classification label of an object OOO in network may depend on, features of OOO, labels of objects in O′sO'sO′s neighbourhood, and features of objects in O′sO'sO′s neighbourhood.
﻿
Collective ClassifiersThe intuition behind collective classification is that we will classify interlinked nodes simultaneously using correlations.
Relevant applications:
Document classification
Part of speech tagging
Link prediction
Optical character recognition
Image/3D data segmentation
Entity resolution in sensor network
Spam and fraud detection
The assume markov assumption when it comes to nodes behaviour given its neighbourhood nodes behaviour. The label YiY_iYi​ of one node iii depends on the labels of its neighbours NiN_iNi​ accordingly P(Yi∣i)=P(Yi∣Ni)P(Y_i|i) = P(Y_i|N_i)P(Yi​∣i)=P(Yi​∣Ni​)
Collective classification involves 3 steps:
Local ClassifierAssign initial labels, based on just the node features/attributes. This is a standard classification task using basic ML concepts. This step doesn't use network information.
Relational ClassifierCapturing correlations based on the network, between the neighbourhood nodes and etc. Learns to classify a node based on the labels/features of its neighbours. This is where we start taking advantage of the network information.
Collective InferencePropagate correlations through the network. Iterate until the inconsistency between neighbouring labels is minimized. Network structure substantially affects the final prediction.
Exact inference in collective classification models is practical only when the network satisfies certain conditions. For arbitrary networks exact inference is NP-hard. If we represent every node as a discrete random variable with a joint mass function ppp of its class membership, the marginal distribution of a node is the summation of ppp over all the other nodes. The exact solution takes exponential time in the number of nodes, therefore we use inference techniques that approximate the solution by narrowing the scope of the propagation (e.g., only neighbours) and the number of variables by means of regression.
﻿
Relational ClassifiersGiven a graph GGG with some of its nodes labeled, with node iii having a feature vector fif_ifi​ and label YiY_iYi​, our goal is to find P(Yi)P(Y_i)P(Yi​) of unlabeled nodes.
Probabilistic Relational ClassifierThe core idea is that the class probability of YiY_iYi​ is the weighted average of class probabilities of its neighbours. For labeled nodes, we initialize P(Yi)P(Y_i)P(Yi​) with the ground-truth YYY labels. For unlabeled nodes, we initialize YYY uniformly among all classes. And then we update all nodes in random order until convergence or until maximum number of iterations is reached.
Repeat for each node iii and label ccc
P(Yi=c)=1∑(i,j)∈EW(i,j)∑(i,j)∈EW(i,j)P(Yj=c)\displaystyle P(Y_i=c) =\frac{1}{\sum_{(i, j) \in E}W(i, j)}\sum_{(i, j) \in E}W(i, j)P(Y_j=c)P(Yi​=c)=∑(i,j)∈E​W(i,j)1​(i,j)∈E∑​W(i,j)P(Yj​=c)
We weight each neighbour nodes probability that it is of relevant label with the edge weight, and normalize it again with the edge weights of all neighbour nodes.
Convergence is not guaranteed for a probabilistic relational classifier. And we are not able to use node features, just neighbourhood nodes' class label information.
This model is very easy to implement and straightforward.
Consider the example network above. Green nodes [6,7][6, 7][6,7] are positive labels, blue nodes [1,2][1, 2][1,2] are negative labels, and yellow nodes [3,4,5,9,8][3,4, 5, 9, 8][3,4,5,9,8] are unlabeled. First we initialize for all the nodes their positive class probability P(Yi=1)P(Y_i = 1)P(Yi​=1). For positive nodes P(Yi=1)=1P(Y_i = 1) = 1P(Yi​=1)=1, for negative nodes P(Yi=1)=0P(Y_i=1) = 0P(Yi​=1)=0 and for unlabeled nodes, as there are only two classes P(Yi=1)=0.5P(Y_i=1) = 0.5P(Yi​=1)=0.5
Now consider Node 3, we update its label probabilities according to neighbour node class probabilities and the edge weights. For the first iteration using its neighbour nodes N3=[1,2,4]N_3 = [1, 2, 4]N3​=[1,2,4] we get from above equation P(Y=1∣N3)=13(0+0+0.5)=0.17P(Y=1|N_3) = \frac13(0 + 0 + 0.5) = 0.17P(Y=1∣N3​)=31​(0+0+0.5)=0.17.
We compute these probabilities for all the unlabeled nodes and all their class probabilities in a single iteration. After we update all the class probabilities of all unlabeled nodes, we repeat the same process with new inferred probabilities for the unlabeled nodes until the node probabilities converge.
﻿
Iterative ClassifiersUnlike in case of relational classification we will be taking advantage of node features in case of Iterative classifiers. The main idea is that classify node iii based on its attributes as well as labels of neighbour set NiN_iNi​
This is how we do it.
Create a flat vector aia_iai​ for each node iii that represents its features
Train a classifier f(ai)f(a_i)f(ai​) using aia_iai​ to get the best value for YiY_iYi​
Update node vector aia_iai​ based on new node label probabilities
Update label YiY_iYi​ to f(ai)f(a_i)f(ai​). This is hard assignment.
Repeat the process with new node features, until convergence or maximum number of iterations.
REV2 Fraudulent User Predictions in Rating PlatformsREV2 is an example iterative classification framework, for fake reviewer/review detection. The problem they are trying to solve is identifying review farms, a +1 increase in rating of a product increases revenue by 5-9%. So people are constantly trying to game the e-commerce systems. We also have fake hype in social media to increase engagement metrics, defame spam and etc.
We can make use of behavioral analysis, like the individual features, geographic locations, login times, session history etc. Language analysis, use of superlatives, lots of self-referencing, rate of misspellings, many agreement words. But both of these are easy to fake, individual behaviours and language they use, but their position in a network or neighbourhood graph structure is hard to fake. Graphs capture relationships between reviewers, reviews and stores.
The specific problem REV2 was solving is fake review detection. The problem setup involves, users and the ratings they give to various products. So the input graph GGG is a bipartite graph with nodes representing users and products. With edges rating scores +1 or -1. Our goal is to find out the set of users that give fake ratings.
Solution formulationWe assume users, products and ratings having an intrinsic quality scores. User has fairness scores. Even if a user is not a fake user, he might have different standards for rating or fairness, We want to be able to capture that. Products have goodness scores, every product has its intrinsic usefulness or goodness, we want to capture this as well. And ratings users give to products have reliability scores, how reliable a rating given by a user to a product is. A fraudulent user might not always be giving a fake rating, to game the system they might be giving genuine rating to products that he might not be interested in. So we want to be able to capture the reliability of a given rating.
All these values are unknown, fairness of a user F(u)∈[0,1]F(u) \in [0, 1]F(u)∈[0,1], goodness of a product G(p)∈[−1,1]G(p) \in [-1, 1]G(p)∈[−1,1] and reliability of a rating R(u,p)∈[0,1]R(u, p) \in [0, 1]R(u,p)∈[0,1]. We now will calculate all these values for all nodes and edges simultaneously using Iterative classification
Fairness of userWe can write fairness of a user in terms of goodness of a product and reliability of a rating like below
F(u)=∑(u,p)∈Out(u)R(u,p)∣Out(u)∣\displaystyle F(u) = \frac{\sum_{(u, p) \in Out(u)} R(u, p)}{|Out(u)|}F(u)=∣Out(u)∣∑(u,p)∈Out(u)​R(u,p)​
Goodness of productWe can write goodness of a product in terms of fairness of user and reliability of rating like below
G(p)=∑(u,p)∈In(p)R(u,p).score(u,p)∣In(p)∣\displaystyle G(p) = \frac{\sum_{(u, p) \in In(p)} R(u, p).score(u, p)}{|In(p)|}G(p)=∣In(p)∣∑(u,p)∈In(p)​R(u,p).score(u,p)​
Reliability of ratingsWe can write reliability in terms of fairness of a user and goodness of a product like below
R(u,p)=1γ1+γ2(γ1.F(u)+γ2.(1−∣score(u,p)−G(p)∣2))\displaystyle R(u, p) = \frac{1}{\gamma_1 + \gamma_2}\Bigg(\gamma_1.F(u) + \gamma_2.\bigg(1 - \frac{|score(u, p) - G(p)|}{2}\bigg)\Bigg)R(u,p)=γ1​+γ2​1​(γ1​.F(u)+γ2​.(1−2∣score(u,p)−G(p)∣​))
term γ1.F(u)\gamma_1.F(u)γ1​.F(u) represents how fair the user giving the rating
and the term γ2.(1−∣score(u,p)−G(p)∣2)\gamma_2.(1 - \frac{|score(u, p) - G(p)|}{2})γ2​.(1−2∣score(u,p)−G(p)∣​) represents how close the rating given by user u to product p is to the goodness of the product p.
In the above equations score(u,p)score(u, p)score(u,p) is the actual rating user u gave to p, In(p)In(p)In(p) is the set of edges into a product p, and Out(u)Out(u)Out(u) are the edges going out from user u.
We initialize every score to the best possible values like in the image above. And update goodness, reliability, and fairness using the actual product ratings users gave according to the network structure and the equations above.
After convergence the scores will look like below, and low fairness users are the fraudsters.
Properties of the REV2 solution.Guaranteed to converge, and number of iterations till convergence is upper-bounded. Linear in the number of edges in the graph. Seems like we didn't actually build or make use of node features, and just used the rating edge weights/features.
﻿
Loopy Belief PropagationBelief propagation is a dynamic programming approach to answering conditional probability queries in a graph model. This is an iterative process in which neighbor variables "talk" to each by passing messages.
Every node has some belief of what states its neighbouring node will be in with various likelihoods.  When we make use of these messages iteratively to figure out state probabilities of all nodes until consensus is reached to calculate final belief.
Consider for example, a long chain of nodes, with nodes only able to talk to its two neighbours. And the problem statement is figure out how many total nodes are in the chain, each node will pass to the next node, how many nodes are there on its left side and right side, and if we start this message propagation from one end to the other and again in reverse, all nodes will have the node count information just by talking to its neighbours. As long as there are no cycles in the chain, we will be able to figure this out.
Consider the same node count problem with a network that looks like a tree, as we keep running it iteratively, every node just talking to all of its neighbours. We will be able to figure out the total count as long as there are no cycles in the tree network or a loopy cyclic graph. Each node receives reports from all the branches of the tree.
Loopy Belief Propagation algorithmThe message node iii sends to node jjj depends on what iii hears from its neighbours kkk. Each neighbour k passes a message to iii its beliefs of the state to iii
We use the following notation in Loopy BP algo.
Label-label potential matrix ψ\psiψ: Dependency between a node and its neighbor. ψ(Yi,Yj)\psi(Y_i, Y_j)ψ(Yi​,Yj​) equals the probability of a node jjj being in state YjY_jYj​ given that it has a iii neighbor in state YiY_iYi​
Prior belief ϕ\phiϕ: Probability ϕi(Yi)\phi_i(Y_i)ϕi​(Yi​) of node iii being in state YiY_iYi​
Message mi→j(Yj)m_{i\rarr j}(Y_j)mi→j​(Yj​): is i′si'si′s estimate of jjj being in state YjY_jYj​
and finally L\mathcal{L}L the set of all possible node states.
We initialize all messages ∀i,jmi→j\forall_{i, j} m_{i \rarr j}∀i,j​mi→j​to 1, and repeat for each node: 
mi→j(Yj)=α∑Yi∈Lψ(Yi,Yj)ϕi(Yi)∏k∈Ni\jmk→i(Yi)\displaystyle m_{i \rarr j}(Y_j) = \alpha \sum_{Y_i \in \mathcal{L}} \psi(Y_i, Y_j) \phi_i(Y_i) \prod_{k \in \mathcal{N}_i\backslash j}m_{k\rarr i}(Y_i)mi→j​(Yj​)=αYi​∈L∑​ψ(Yi​,Yj​)ϕi​(Yi​)k∈Ni​\j∏​mk→i​(Yi​)
bi(Yi)=αϕi(Yi)∏j∈Nimj→i(Yi),∀Yi∈L\displaystyle b_i(Y_i) = \alpha \phi_i(Y_i) \prod_{j \in \mathcal{N}_i} m_{j \rarr i}(Y_i), \forall Y_i \in \mathcal{L}bi​(Yi​)=αϕi​(Yi​)j∈Ni​∏​mj→i​(Yi​),∀Yi​∈L where ∏j∈N\prod_{j \in \mathcal{N}}∏j∈N​ term represents all messages from node i′si'si′s neighbours and bi(Yi)b_i(Y_i)bi​(Yi​) node i′si'si′s belief of being in state YiY_iYi​
Messages from different subgraphs are no longer independent if out network has cycles in it. BP is a local algorithm so it doesn't see the cycles. There are extreme examples where we can show that BP fails to converge. BP treats two messages within a cycle as if they were independent, are seperate evidence. One influenced the other via the cycle. But in practice we find that the cyclic influences are weak, as cycles are long or include at least one weak correlation.
Loopy BP is easy to program and parallelise, can apply to any graphical model with any form of potentials even higher order than pairwise. But convergence is not guaranteed, especially if there are many closed loops. Potential functions with parameters require training to estimate, and we will have to use gradient-based optimization to learn these parameters.
﻿
Online Auction Fraud detection using Loopy BPThe algorithm from [Netprobe: A fast and Scalable System for Fraud detection in Online Auction Networks] makes use of loopy BP to detecting fraud in auction websites like eBay and etc. Auction sites are an attractive target for fraud made up of 63% of complaints to Federal Internet crime complaint center in US in 2006, averaging loss per incident of $385 USD.
We can build a fraud detection classifier just on users, with their user attributes like geographic location, login times, session history etc. These things might be easy to fake, given the fraudsters, always try to beat the by building trust in the system by doing legitimate transactions. But the actual graph structure is harder to fake who they interact with to do legitimate transactions and how they fraud normal users.
The main question we need to understand is how the fraudsters interact among each other and how they interact with actual users. In addition to buy sell relations, are there more complex relations.
The reputation score each user given by other users during a transaction is feedback mechanism we use to build our fraud detection algorithm.
From the basic understanding of how these fraudsters operate, we define these three roles to each user node, accomplice, fraudster and honest. The fraudsters don't just boost each others reputation because if one is caught all others will be caught. They make use of accomplices who trades honestly with other users and looks legit. And a fraudster trades with accomplice and does fraud with honest users. These are the various roles our users in an action system adopt and our goal is to find the probability of each node accepting any of these roles.
The accomplice and fraudsters form a near-bipartite cores. 
We use belief propagation to find the roles of all the users in an auction network. Now our challenge is to decide on the BP parameters. Prior beliefs, prior knowledge is unbiased if none, and compatibility potentials by insight. We use the below potential matrix ψ(Yi,Yj)\psi(Y_i, Y_j)ψ(Yi​,Yj​) of neighbours understanding of what a given node can be given its own state. 
Node state
Neighbour StateFraudAccompliceHonest
Fraudϵp\epsilon_pϵp​1−2ϵp1-2\epsilon_p1−2ϵp​ϵp\epsilon_pϵp​
Accomplice0.50.50.52ϵp2\epsilon_p2ϵp​0.5−2ϵp0.5-2\epsilon_p0.5−2ϵp​
Honestϵp\epsilon_pϵp​(1−ϵp)/2(1-\epsilon_p)/2(1−ϵp​)/2(1−ϵp)/2(1-\epsilon_p)/2(1−ϵp​)/2
Using the above potential matrix from user insight, and initializing all nodes as unbiased, at each iteration for each node, we compute messages to its neighbours.Continue this until the role probabilities of individual nodes converge with Final Belief scores being final roles.
﻿
Chapters
Introduction, Structure of Graphs
Properties of Networks and Random Graph Models
Motifs and Structural Roles in Networks
Community Structure in Networks
Spectral Clustering
Message Passing and Node Classification
Graph Representation Learning
Graph Neural Networks
Graph Neural Networks - Pytorch Geometric
Deep Generative Models for Graphs
Link Analysis: PageRank
Network Effects and Cascading Behaviour
Probabilistic Contagion and Models of Influence
Influence Maximization in Networks
Outbreak Detection in Networks
Network Evolution
Reasoning over Knowledge Graphs
Limitations of Graph Neural Networks
Applications of Graph Neural Networks
﻿
﻿
		Node state
Neighbour State	Fraud	Accomplice	Honest
Fraud	$ϵp\epsilon_p$	$1−2ϵp1-2\epsilon_p$	$ϵp\epsilon_p$
Accomplice	$0.5$	$2ϵp2\epsilon_p$	$0.5−2ϵp0.5-2\epsilon_p$
Honest	$ϵp\epsilon_p$	$(1−ϵp)/2(1-\epsilon_p)/2$	$(1−ϵp)/2(1-\epsilon_p)/2$
Add a comment