Semantic search using Microsoft's Phi-2 and Spotify Annoy
Created on January 18|Last edited on May 23
Comment
Finding relevant information online quickly and accurately can be a daunting task. Recently, semantic search has gained popularity as an approach to information retrieval as it finds contextual results versus relying on traditional keyword-based search patterns.
In this article, we'll dig into semantic search, how it works, and what it's used for. We'll also walk you through a quick example near the end. Here's what we'll be covering:
Table of Contents
Table of ContentsKey components of semantic searchBuilding a Semantic Search PipelineMicrosoft Phi-2Using W&BDatasetCodeResultsReferences
Key components of semantic search
Entity recognition
Semantic search excels at recognizing entities — think specific objects, people, or concepts—within a given context. By understanding the relationships between these entities, the search engine can provide more relevant and contextually rich results.
Contextual understanding:
Unlike traditional search, semantic search takes context into account. It considers the user's search history, location, and other relevant factors to tailor results based on the individual's unique context, delivering a more personalized and meaningful experience.
Natural language processing (NLP):
NLP plays a pivotal role in Semantic Search, enabling machines to interpret and generate human-like language. By comprehending the intent behind user queries, search engines can return results that align more closely with the user's needs.
Building a Semantic Search Pipeline
The pipeline of a semantic search system involves a series of steps that transform a user query into relevant search results based on the meaning and context of the query. Below is a typical pipeline for semantic search:
1. User Input:
The process begins with the user entering a query into the search system. This query can be a natural language sentence, a question, or any textual input seeking information.
2. Preprocessing:
The user input undergoes preprocessing to standardize and tokenize the text. This step involves breaking down the input text into smaller units, such as words or subwords, making it suitable for processing by the semantic search model.
3. Large Language Model (LLM):
The preprocessed query is then fed into a Large Language Model (LLM), such as GPT-3 or BERT. These models are pre-trained on vast amounts of textual data and have a deep understanding of language, allowing them to capture context, semantics, and relationships between words.
4. Embedding:
The LLM transforms the preprocessed query into a high-dimensional vector representation known as an embedding. This embedding encodes the semantic information of the query, capturing its meaning in a numerical format.
5. Document Embedding:
Similarly, each document in the search corpus (the collection of documents to be searched) is also transformed into embeddings using the same LLM. This step is usually performed offline, and the document embeddings are stored for efficient retrieval during the search process.
6. Semantic Matching:
The semantic matching step involves comparing the query embedding with the embeddings of all the documents in the search corpus. Various similarity metrics, such as cosine similarity, may be employed to measure the similarity between the query and each document.
7. Ranking:
The results from the semantic matching step are ranked based on their similarity scores. Documents with higher similarity scores are considered more relevant to the user query and are placed higher in the ranked list.
8. Post-Processing:
Post-processing may involve additional filtering or ranking adjustments based on specific criteria. For example, you might filter out documents that do not meet certain relevance thresholds or adjust rankings based on other contextual factors.
9. Presentation of Results:
The final step involves presenting the search results to the user in a user-friendly format. This could be a list of documents, snippets, or any other format depending on the nature of the search application.
10. User Feedback and Iteration:
Semantic search systems often incorporate mechanisms for user feedback. This feedback loop helps improve the system over time by learning from user interactions. If a user indicates that a particular result was helpful or not, the system can use this information to enhance its understanding and improve future search results.
Microsoft Phi-2
For this tutorial, we will be using the Microsoft's Phi2 model. Phi-2 is a Transformer with 2.7 billion parameters. It was trained using the same data sources as Phi-1.5, augmented with a new data source that consists of various NLP synthetic texts and filtered websites (for safety and educational value). When assessed against benchmarks testing common sense, language understanding, and logical reasoning, Phi-2 showcased a nearly state-of-the-art performance among models with less than 13 billion parameters.
This model hasn't been fine-tuned through reinforcement learning from human feedback. The intention behind crafting this open-source model is to provide the research community w ith a non-restricted small model to explore vital safety challenges, such as reducing toxicity, understanding societal biases, enhancing controllability, and more. Phi-2 is an ideal playground for researchers, including for exploration around mechanistic interpretability, safety improvements, or fine-tuning experimentation on a variety of tasks. The training data mixture contains synthetic datasets specifically created to teach the model common sense reasoning and general knowledge, including science, daily activities, and theory of mind, among others. It is further augmented with carefully selected web data that is filtered based on educational value and content quality. Secondly, innovative techniques are useda to scale up, starting from our 1.3 billion parameter model, Phi-1.5, and embedding its knowledge within the 2.7 billion parameter Phi-2. This scaled knowledge transfer not only accelerates training convergence but shows clear boost in Phi-2 benchmark scores.
Using W&B
Create a W&B account and install W&B using
pip install wandb
Then login using
wandb login
Dataset
Use the following code to log the dataset
table = wandb.Table(data=df)run.log({'data':table})
Code
import torchfrom transformers import AutoModelForCausalLM, AutoTokenizertokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype="auto", trust_remote_code=True)map = {}archive = {}count = 0for index, row in df.iterrows():count+=1data = row['abstract']embedding = torch.zeros(384)emb = tokenizer(data, return_tensors="pt", max_length=384, padding='max_length', truncation=True)embedding[:emb.input_ids.shape[1]] = emb.input_ids[0]map[row['id']] = embeddingarchive[row['id']] = [row['title'], row['abstract'], row['categories']]from annoy import AnnoyIndexmap2 = {}index = AnnoyIndex(384)for i, id in enumerate(map):index.add_item(i, map[id])map2[i] = id# building the indexn_trees = 10000/100index.build(n_trees=1)query = "Physics"target_vector = torch.zeros(384)emb = tokenizer(query, return_tensors="pt", max_length=384, padding='max_length', truncation=True)target_vector[:emb.input_ids.shape[1]] = emb.input_ids[0]k = 3nearest_neighbors = index.get_nns_by_vector(target_vector, k)for i in range(k):print("Result ", i+1)print("*"*10)id = nearest_neighbors[i]for i in archive[map2[id]]:print(i)print(" ")
Results
Result 1
**********
Scientific Machine Learning through Physics-Informed Neural Networks:
Where we are and What's next
Physics-Informed Neural Networks (PINN) are neural networks (NNs) that encode
model equations, like Partial Differential Equations (PDE), as a component of
the neural network itself. PINNs are nowadays used to solve PDEs, fractional
equations, integral-differential equations, and stochastic PDEs. This novel
methodology has arisen as a multi-task learning framework in which a NN must
fit observed data while reducing a PDE residual. This article provides a
comprehensive review of the literature on PINNs: while the primary goal of the
study was to characterize these networks and their related advantages and
disadvantages. The review also attempts to incorporate publications on a
broader range of collocation-based physics informed neural networks, which
stars form the vanilla PINN, as well as many other variants, such as
physics-constrained neural networks (PCNN), variational hp-VPINN, and
conservative PINN (CPINN). The study indicates that most research has focused
on customizing the PINN through different activation functions, gradient
optimization techniques, neural network structures, and loss function
structures. Despite the wide range of applications for which PINNs have been
used, by demonstrating their ability to be more feasible in some contexts than
classical numerical techniques like Finite Element Method (FEM), advancements
are still possible, most notably theoretical issues that remain unresolved.
cs.LG cs.AI cs.NA math.NA physics.data-an
Result 2
**********
Anomalies, equivalence and renormalization of cosmological frames
We study the question of whether two frames of a given physical theory are
equivalent or not in the presence of quantum corrections. By using field theory
arguments we claim that equivalence is broken in the presence of anomalous
symmetries in one of the frames. This is particularized to the case of the
relation between the Einstein and Jordan frames in scalar-tensor theories used
to describe early Universe dynamics. Although in this case a regularization
that cancels the anomaly exists, the renormalized theory always develop a
non-vanishing contribution to the S-matrix that is present only in the Jordan
frame, promoting the different frames to different physical theories that must
be UV completed in a different way.
hep-th gr-qc
Result 3
**********
Error correction and fast detectors implemented by ultra-fast neuronal
plasticity
We experimentally show that the neuron functions as a precise
time-integrator, where the accumulated changes in neuronal response latencies,
under complex and random stimulation patterns, are solely a function of a
global quantity, the average time-lag between stimulations. In contrast,
momentary leaps in the neuronal response latency follow trends of consecutive
stimulations, indicating ultra-fast neuronal plasticity. On a circuit level,
this ultra-fast neuronal plasticity phenomenon implements error-correction
mechanisms and fast detectors for misplaced stimulations. Additionally, at
moderate/high stimulation rates this phenomenon destabilizes/stabilizes a
periodic neuronal activity disrupted by misplaced stimulations.
q-bio.NC
References
Add a comment