17.Reasoning over Knowledge Graphs
Knowledge graphs are networks capture entities, types and relationships. Nodes are entities, nodes are labeled with their types and edges between two nodes capture relationship between entities.
Bibliographic networks are an example of knowledge graphs. Node types: paper, title, author, conference, year. And each of these different kinds of nodes will be connected by relation types like published_Where, published_Year, has_Title, has_author etc.
Social networks where node types of account, song, post, food, channel with relation types can be friendship relation, like, cook watch and listen. This is a broader representation of social network as a knowledge graph.
Google's Knowledge Graph example
A sample application of a knowledge graph is serving information, where you have a search query for "latest films by director of titanic" instead of searching through all the web and unstructured data, if we were able to construct a knowledge graph of movies and directors, we will be able to answer this query much faster.
Knowledge graphs are also useful in question answering systems and conversation agents.
Sample publicly available Knowledge Graphs: FreeBase, Wikidata, Dbpedia, YAGO, NELL etc.,
Knowledge graph Completion
Knowledge graphs are usually massive with millions of nodes and edges, and most of the graph is incomplete, with many true edges missing. Whenever we have massive knowledge graph, enumerating all the possible true relationships and facts is not tractable. So we need to be able to predict plausible but missing links from knowledge graphs.
Freebase Knowledge Graph
Freebase is a knowledge graph with ~50 million entities, with ~38K relations types and ~3 Billion facts about relationships between entities. Even with these many relationship edges there are 93% of people in the KG with no place of birth and 78% with no nationality.
We have a FB15k and FB15k-237 Knowledge Graphs that are subsets the original KG for research purposes that are much more refined as a good starting point.
Dataset | Entities | Relations | TotalEdges |
---|---|---|---|
FB15k | 14,951 | 1,345 | 592,213 |
FB15k-237 | 14,505 | 237 | 310,079 |
Given an enormous Knowledge Graph our task is to complete the KG and predict missing relations. It's harder because we have to predict which two entities are related and what kind of relationship is between them.
KG Representation
Edges are represented as triplets (h, r, t) where head(h) has a relation(r) with tail(t). And the key idea is to model entities and relations in the embedding/vector space ℜd\Re^d.
Given a relationship tuple (h, r, t) that is present in the KG, the goal is that the embedding of (h, r) should be close to the embedding t. We need a way to embed (h, r) and a way to define closeness of (h, r) and t.
Not all relationships have similar properties, some are symmetric, some are not and etc.
- Symmetric relations where r(h,t)⇒r(t,h)∀h,tr(h, t) \Rightarrow r(t, h) \forall h, t. Roommate and etc come under this.
- Composition relations where r1(x,y)∧r2(y,z)⇒r3(x,z)∀x,y,zr_1(x, y) \wedge r_2(y, z) \Rightarrow r_3(x, z) \forall x, y, z. Where r1,r2,r3r_1, r_2, r_3 are different relations. for example mother's husband is a father.
- 1 to N, N to 1 relations, where r(h,t1),r(h,t2),⋯r(h,tn)r(h, t_1), r(h, t_2),\cdots r(h, t_n) are all true, for example where r is a relation type "StudentsOf".
TransE
When we were learning about the TransE algorithm in previous sections, we considered the Translation intuition for relationships. For a triple (h, r, t) its embedding s h,r,t∈ℜd\bold{h,r, t} \in \Re^d, will satisfy h+r=t\bold{h + r = t}
And we used the score function fr(h,t)=∣∣h+r−t∣∣f_r(h, t) = ||h + r - t||.
And the max margin loss was given by:
L=∑(h,r,t)∈G,(h,r,t′)∉G[γ+fr(h,t)−fr(h,t′)]+\displaystyle \mathcal{L} = \sum_{(h, r, t) \in G, (h, r, t') \notin G} [\gamma + f_r(h, t) - f_r(h, t')]_+
where (h, r, t) was a valid triple and (h, r, t') a corrupted sample triple and γ\gamma is the margin, the smallest distance tolerated by the model between a valid triple and a corrupted one.
Composition in TransE
When we have relations that follow composition like r1(x,y)∧r2(y,x)⇒r3(x,z)∀x,y,zr_1(x, y) \wedge r_2(y, x) \Rightarrow r_3(x, z) \forall x, y, z, in TransE the above relations will follow r3=r1+r2r_3 = r_1 + r_2
Symmetry in TransE
If we want TransE to handle symmetric relations r for all h, t that satisfy r(h, t) if r(t, h) is also True. This implies ||h + r - t|| = 0 and ||t + r - h|| = 0, which leads to r=0 and h=t, but h, and t are different entities and if they are represented by the same embedding because of a single symmetric relation r, it will be a problem.
N-ary Relations
If (h,r,t1)(h, r, t_1) and (h,r,t2)(h, r, t_2) both exist in the knowledge graph, with TransE, t1t_1 and t2t_2 both will map to the same vector, although they are different entities.
h+r=t1=t2h + r = t_1 = t_2 in TransE, TransE will squash all the N entities into a single embedding when we have N-ary relations.
So TransE is not powerful enough to work where relations are symmetric or we have N to 1 or 1 to N relations in our Knowledge Graph.
TransR
TransR models entities as vectors in the entity space ℜd\Re^d and model each relation in the relation space ℜk\Re^k with Mr∈ℜk×dM_r \in \Re^{k\times d} as the projection matrix. We separate out the embedding space for entities and relations.
If (h, r, t) is a valid triple in the Knowledge graph then h⊥=Mrh,t⊥=Mrt,fr(h,t)=∣∣h⊥+r−t⊥∣∣h_\bot = M_rh, t_\bot = M_rt, \quad f_r(h, t) = ||h_\bot + r - t_\bot||
Symmetry in TransR
In case of symmetric relation types r where r(h,t)⇒r(h,t)∀h,tr(h, t) \Rightarrow r(h, t) \forall h, t, h and t will have different embeddings in the embedding space, but their respective projections in the relations embedding h⊥,t⊥h_\bot, t_\bot can be the same.
In TransR h⊥=Mrh=Mrt=t⊥h_\bot = M_rh = M_rt = t_\bot, now we can map h and to to the same location on the relation space.
N-ary Relations in TransR
In TransR we can learn MrM_r so that t⊥=Mrt1=Mrt2t_\bot = M_rt_1 = M_rt_2 when we have valid triples (h,r,t1)(h, r, t_1) and (h,r,t2)(h, r, t_2) in the knowledge graph. And t1,t2t_1, t_2 have their own embeddings in the entity space.
Composition Relations in TransR
TransR fails when we have relations that can be composed, each relation has a different space. It is not naturally compositional for multiple relations.
Summary of TransE and TransR
Embedding | Entity | Relation | fr(h,t)f_r(h, t) |
---|---|---|---|
TransE | h,t∈ℜdh, t \in \Re^d | r∈ℜdr \in \Re^d | ∥h+r−t∥\lVert h+r-t\rVert |
TransR | h,t∈ℜdh, t \in \Re^d | r∈ℜk,Mr∈ℜk×dr \in \Re^k, M_r \in \Re^{k\times d} | ∥Mrh+r−Mrt∥\lVert M_rh+r-M_rt\rVert |
Embedding | Symmetry | Composition | One to Many |
---|---|---|---|
TransE | ❌ | ✅ | ❌ |
TransR | ✅ | ❌ | ✅ |
Queries
Our goal is to make use of Knowledge Graphs and do multi-hop reasoning, i.e answering complex queries efficiently on an incomplete massive KG.
Example of different kinds of queries:
- One-hop Queries: Where did Hinton graduate?
- Path Queries: Where did turing Award winners graduate?
- Conjunctive Queries: Where did Canadians with Turing award graduate?
- EPFO Queries: Where did Canadians with turing Award or Nobel graduate?
One-hop Query
Easiest possible query which is basically a link prediction query on the Knowledge Graph.
where in Link Prediction we ask if link (h, r, t) is True where as on One-hop query we ask if t is an answer to the query (h, r)
Path Queries
We use one-hope queries to generalize our path queries, by adding more relations on the path. Path queries can be represented by q=(va,r1,⋯ ,rn)q = (v_a, r_1, \cdots, r_n), where vav_a is a constant node.
We define the computation graph of q as the graph of path queries in the chain.
In the question "Where did Turing Award winners graduate?", vav_a is "Turing award" and r1,r2r_1, r_2 is ("win", "graduate"). Given a Knowledge Graph and the computational graph representing the query below how do we get the answer?
We can answer path queries by traversing the Knowledge Graph like below
We can traverse an incomplete KG and miss lots of predictions, or we have to do link prediction and complete the KG with probabilistic edges. Which makes the KG a dense graph. And the time complexity of traversing a dense KG with |V| entities to answer (va,r1,⋯ ,rn)(v_a, r_1, \cdots, r_n) of length n is O(∣V∣n)O(|V|^n)
Instead of doing this, we work on embedded queries. We generalize TransE to multi-hop reasoning. given a path query q=(va,r1,⋯ ,rn)q = (v_a, r_1, \cdots, r_n) we get the relation q=va+r1+⋯+rnq = v_a + r_1 + \cdots + r_n in the embedding space. And after that we do a nearest neighbor search for all entities based on fq(v)=∣∣q−v∣∣f_q(v) = ||q - v||, with time complexity O(∣V∣)O(|V|)
Guu, Kelvin, John Miller et al "Traversing knowledge graphs in vector space".
Below you can see the computation graph of the path query and the embedding process for the question "Where did Turing Award winners graduate?"
We train the general TransE algorithm, and just using them embeddings for answering the path query.
Conjunctive Queries
Conjunctive queries are queries that have multiple anchor nodes, with two different relations satisfying the same end node. An example of conjunctive query is "Where did Canadian citizens with Turing Award graduate?". And the computation graph of this conjunctive query looks like below.
If we have a complete KG we just traverse the KG starting from the anchor nodes and get the below paths. We will have to do an intersection for the first question and arrive at answers.
Similar to path traversal this is not tractable in the KG space, and we do it on the embedding space, but the problem arises when figuring out how to do the intersection in the embedding space. You can see the computational graph and the embedding process of the above query in the below image
We make use of Neural Intersection Operator to do intersection of several vectors in the embedding space.
For this we have several current query embeddings q1,⋯ ,\qmq_1, \cdots, \q_m and the output has to be the intersection of the query embedding q. For this the order of the queries should not matter so our operation should be permutation invariant on the embedding vectors. Below J\mathcal{J} should be permutation invariant.
J(q1,⋯ ,qm)=J(qp(1),⋯ ,qp(m))\mathcal{J}(q_1, \cdots, q_m) = \mathcal{J}(q_{p(1)}, \cdots, q_{p(m)}); where [p(1),⋯ ,p(m)][p(1), \cdots, p(m)] is any permutation of [1,⋯ ,m][1, \cdots, m]
We make use of DeepSets architecture, to get the intersection embedding of initial query embeddings.
In the above architecture, ϕ,β\phi, \beta are neural networks, where ϕ\phi generates features of the input query embeddings, and add them using a permutation invariant operator like mean, and put it through the neural network β\beta and get our vector embedding of the intersection query.
We train this along with our embedding method on our KG.
Training
Given entity embedding v and a query embedding q, the distance fq(v)=∣∣q−v∣∣f_q(v) = ||q-v||. The training parameters will be
- Entity embeddings: d|V|
- Relation embeddings: d|R|
- intersection operator ϕ,β\phi, \beta
The number of training parameters does not depend on the graph size. We will have the same training strategy as TransE
We sample a query q, answer v, and negative sample v'. Embed the query q. Calculate the distances fq(v)f_q(v) and fq(v′)f_q(v'). Optimize the Loss L\mathcal{L}
Given a test query q, embed the query q, for all v in KG, calculate fq(v)f_q(v) and get the best node entity.
Limitations
We need a better way of finding intersection in case of conjunctive queries. When we traverse the KG to achieve the answers, each step produces a set of reachable entities, we are not modeling on these sets, we just operating on the embeddings of the relationships. Can we find a better expressive geometry to embed the queries.
Box Embeddings
Instead of operating on vector embeddings we want to be able to model properly on the sets of entities we get after each step of the query. For this we use Box Embeddings. We embed queries with hyper-rectangles
q=(Center(q),Offset(q))q = (Center(q), Offset(q))
Taking intersection between query vectors is not an intuitive operation, but intersection of boxes is well defined. And similarly boxes are a powerful abstraction to model the sets of entities we end up with each query that can be thought of as entities enclosed by the box with a projected center and offset.
Parameters
- Entity embeddings d|V|, can be seen as zero-volume boxes.
- Relations embeddings are modeled as actual boxes, hence we have 2d|R| parameters, one for the center and one for offset. Here we are augmenting earlier relation embeddings with an offset.
- Intersection operator ϕ,β\phi, \beta, here the number of parameters do not depend on the graph size. But the key difference is that the inputs and outputs of this intersection operator are both boxes.
Geometric Projection Operator P:Box×Relation→Box\mathcal{P}: \text{Box} \times \text{Relation} \rarr \text{Box}.
Center(q') = Center(q) + Center(r)
Offset(q') = Offset(q) + Offset(r) The geometric projection operator will increase the size of the box with more relations. An entity with zero size will become a box a projection on a box will become even bigger.
The computation graph and embedding space of the above discussed query in case of Box embeddings will look like below
Instead of neural intersection operator we just use a Geometric Intersection Operator J:Box×⋯Box→Box\mathcal{J}: \text{Box} \times \cdots \text{Box} \rarr \text{Box}
The new center of the intersected box will be a weighted average and the new offset shrinks. The Geometric Intersection operator shrinks the size of the boxes.
Center(qinter)=∑iwi⊙Center(qi)\text{Center}(q_{\text{inter}}) = \sum_iw_i\odot\text{Center}(q_i)
Offset(qinter)=min(Offset(q1),⋯ ,Offset(qn))⊙σ(Deepsets(q1,⋯ ,qn))\text{Offset}(q_{\text{inter}}) = \min(\text{Offset}(q_1),\cdots,\text{Offset}(q_n)) \odot \sigma(\text{Deepsets}(q_1,\cdots,q_n))
The min guarantees shrinking, and the sigmoid function squashes output in (0, 1). We use Deepsets to identify what boxes are more important.
Following the computation graph on the complete query using box embeddings it looks like below in the Knowledge Graph.
Entity to box distance
Once we have the query box q and entity vector v, we compute the distance between the two like below
dbox(q,v)=dout(q,v)+α.din(q,v)d_{\text{box}}(q, v) = d_{\text{out}}(q, v) + \alpha . d_{\text{in}}(q, v) where 0<α<10 < \alpha < 1
doutd_{\text{out}} is the distance from the box edge if the entity is outside, and dind_{\text{in}} is the distance from the center.
We defined it this way to penalize entities outside the box while training.
Training
Given a set of queries and answers, query q, valid answer v, and wrong answer v'
L=−logσ(γ−dbox(q,v))−logσ(dbox(q,v′i)−γ)\mathcal{L} = -\log \sigma(\gamma - d_{\text{box}}(q, v)) - \log \sigma(d_{\text{box}}(q, {v'}_i) - \gamma)
Summary of Query algorithms
From the paper
Embedding | Symmetry | Composition | One-to-many |
---|---|---|---|
TransE | ❌ | ✅ | ❌ |
TransH | ✅ | ❌ | ✅ |
Query2Box | ✅ | ✅ | ✅ |
N-ary Relations in query2box
Symmetric Relations in query2box
Composition Relations in query2box
EPFO Queries
Conjunctive queries + disjunction is called Existential Positive First-Order(EPFO) queries.
A sample EPFO query looks like "Where did Canadians with Turing Award or Nobel graduate?"
For designing a disjunction operator and embed EPFO queries in low-dimensional vector space you can read this paper
Chapters
- Introduction, Structure of Graphs
- Properties of Networks and Random Graph Models
- Motifs and Structural Roles in Networks
- Community Structure in Networks
- Spectral Clustering
- Message Passing and Node Classification
- Graph Representation Learning
- Graph Neural Networks
- Graph Neural Networks - Pytorch Geometric
- Deep Generative Models for Graphs
- Link Analysis: PageRank
- Network Effects and Cascading Behaviour
- Probabilistic Contagion and Models of Influence
- Influence Maximization in Networks
- Outbreak Detection in Networks
- Network Evolution
- Reasoning over Knowledge Graphs
- Limitations of Graph Neural Networks
- Applications of Graph Neural Networks