Skip to main content

Youtube Video Summarisation using Mistral 7B, LangChain, and Whisper

An interactive report showcasing how to create a video summarization pipeline with three common ML tools
Created on December 30|Last edited on January 17

Introduction

YouTube is a treasure trove of knowledge and entertainment, but it can be challenging to sift through long videos to extract the key takeaways.
Enter Whisper, LangChain and Mistral, groundbreaking technologies reshaping the landscape of video summarization, enabling users to grasp the essence of lengthy videos swiftly and efficiently.
In this report, we'll walk you through how to use the three tools together to quickly and easily summarize video content.

source


Table of Contents



The Process of Video Summarization using LLMs:

  • Transcription and Preprocessing: LLMs begin by converting the video's spoken content into text through transcriptions or extracting audio transcripts. This textual data serves as the input for subsequent summarization.
  • Language Analysis and Summarization: Employing advanced NLP techniques, LLMs analyze the textual content, identifying key phrases, concepts, and contextually relevant information. They then generate concise summaries that encapsulate the crucial aspects of the video content.
  • Summary Generation and Output: The LLMs produce summarized versions of the video content in text format, presenting the most salient points and insights gleaned from the original video, allowing users to quickly comprehend its essence.

Understanding Whisper:

Whisper, an AI-powered tool developed with cutting-edge natural language processing capabilities, specializes in creating concise and accurate video summaries. By harnessing advanced algorithms, Whisper efficiently analyzes the audio content of videos, transcribing and extracting key insights, thereby generating comprehensive summaries.
The Whisper technology employs speech-to-text algorithms, parsing through spoken content to identify crucial information. It then applies machine learning models to comprehend context, extract vital points, and condense the essence of lengthy videos into shorter, digestible summaries.
You can read more about using Whisper with W&B in these sibling reports:


LangChain's Role in Enhancing Summarization:

LangChain, a revolutionary blockchain-based platform, collaborates seamlessly with Whisper to further refine and enhance the video summarization process. Leveraging blockchain's inherent security and decentralization, LangChain ensures the authenticity and reliability of summarized content.
Through its decentralized network, LangChain maintains a secure ledger of summarized videos, enhancing transparency and mitigating the risk of tampering or misinformation. Furthermore, LangChain's integration with Whisper facilitates language-specific summarization, catering to diverse audiences across the globe.

source

Benefits of Whisper and LangChain in Video Summarization:

Efficiency: By condensing lengthy videos into succinct summaries, users save valuable time while still obtaining comprehensive knowledge and insights.
Accessibility and Inclusivity: The summarization technology offered by Whisper and LangChain transcends language barriers, making content accessible to a broader audience worldwide.
Accuracy and Reliability: Through AI-driven analysis and blockchain-powered security, the summarized content maintains high accuracy and credibility.
Enhanced Learning and Information Retention: Concise summaries aid in better comprehension and retention of information, facilitating effective learning experiences.

Understanding the Mistral Model: A Breakthrough in Natural Language Processing

The Mistral model stands as a testament to commitment to pushing the boundaries of language understanding and generation. Built upon the foundations laid by its predecessors like GPT-3, Mistral represents a new milestone in the evolution of AI-driven language models.
  1. Advanced Capabilities
Mistral boasts a remarkable set of capabilities, significantly enhancing its performance in various language-related tasks. It excels in understanding context, generating coherent text, and handling diverse language nuances, making it a versatile tool for a wide range of applications.
2. Enhanced Contextual Understanding:
One of Mistral's standout features is its ability to understand and contextualize information more effectively. Through multi-layered architectures and sophisticated training techniques, it comprehends intricate contexts within texts, producing responses that exhibit a deeper understanding of the given input.
3. Few-shot and Zero-shot Learning:
Similar to its predecessors, Mistral demonstrates proficiency in few-shot and zero-shot learning. This means the model can perform tasks or answer questions based on minimal or no examples, showcasing its ability to generalize and adapt to new prompts and scenarios.
4. Ethical and Responsible AI:
OpenAI has been steadfast in its commitment to responsible AI development. The Mistral model is designed with ethical considerations in mind, aiming to mitigate biases and uphold ethical standards in language generation, aligning with MistralAI's principles of AI safety and societal benefit.
5. Applications and Impact:
The applications of the Mistral model span across various industries and domains. From aiding content creation, improving customer support systems, advancing language translation, to empowering research and development, Mistral's versatility holds the potential to revolutionize numerous fields reliant on language understanding and generation.
6. Continued Research and Advancements:
As with any AI model, research and development remain ongoing. MistralAI continues to refine the Mistral model, exploring ways to enhance its capabilities, address limitations, and improve its overall performance across a spectrum of tasks and languages.

Using W&B

Create a W&B account and install W&B using
pip install wandb
Then login using
wandb login

Dataset

In this tutorial we finetune the Mistral model using the cnn_dailymail dataset and use that model for video summarisation task.
Use the following code to log the dataset
table = wandb.Table(data=df)
run.log({'data':table})




Fine-Tuning the model

We fine-tune the Mistral-7B-v0.1 which is Mistral AI’s first large language model (LLM). The metric graphs for which can be seen below. To log the metrics to Wandb, we use the FastAI Wandb integration by just adding the Wandbcallback to the Learner.
learn = Learner(dls,
model,
opt_func=ranger,
loss_func=CrossEntropyLossFlat(),
cbs=WandbCallback(),
splitter=partial(blurr_seq2seq_splitter, arch=hf_arch)).to_fp16()



Using Whisper and Mistral to summarize YouTube videos

To summarize a YouTube video we will need to:
  • Download the YouTube audio file.
  • Transcribe the audio using Whisper.
  • Summarize the transcribed text using LangChain.
Step 1: Download the YouTube audio file
You can use any video downloader to download the audio file from a YouTube video. Once you have downloaded the audio file, save it to a location on your computer.
Step 2: Transcribe the audio using Whisper
The following code will download the audio file from the YouTube video and transcribe it using Whisper:
import whisper

model = whisper.load_model("base")
result = model.transcribe("example.mp4")

transcribed_text = result['text']

# Save the transcribed text to a file
with open("text.txt", "w") as file:
file.write(transcribed_text)

print("Transcribed text saved to text.txt")
The output of the code will be the transcribed text of the YouTube video.
Step 3: Summarize the transcribed text using Mistral and Langchain
The following code will summarize the transcribed text using Mistral and Langchain:
def summarize_transcript(filename):
# Load transcript
loader = TextLoader(filename)
docs = loader.load()

# Load LLM
config = {'max_new_tokens': 4096, 'temperature': 0.7, 'context_length': 4096}
llm = CTransformers(model="TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
model_file="mistral-7b-instruct-v0.1.Q4_K_M.gguf",
config=config,
threads=os.cpu_count())
map_template = """<s>[INST] The following is a part of a transcript:
{docs}
Based on this, please identify the main points.
Answer: [/INST] </s>"""
map_prompt = PromptTemplate.from_template(map_template)
map_chain = LLMChain(llm=llm, prompt=map_prompt)

# Reduce template and chain
reduce_template = """<s>[INST] The following is set of summaries from the transcript:
{doc_summaries}
Take these and distill it into a final, consolidated summary of the main points.
Construct it as a well organized summary of the main points and should be between 3 and 5 paragraphs.
Answer: [/INST] </s>"""
reduce_prompt = PromptTemplate.from_template(reduce_template)
reduce_chain = LLMChain(llm=llm, prompt=reduce_prompt)
combine_documents_chain = StuffDocumentsChain(
llm_chain=reduce_chain, document_variable_name="doc_summaries"
)
# Combines and iteratively reduces the mapped documents
reduce_documents_chain = ReduceDocumentsChain(
# This is final chain that is called.
combine_documents_chain=combine_documents_chain,
# If documents exceed context for `StuffDocumentsChain`
collapse_documents_chain=combine_documents_chain,
# The maximum number of tokens to group documents into.
token_max=4000,
)
# Combining documents by mapping a chain over them, then combining results
map_reduce_chain = MapReduceDocumentsChain(
# Map chain
llm_chain=map_chain,
# Reduce chain
reduce_documents_chain=reduce_documents_chain,
# The variable name in the llm_chain to put the documents in
document_variable_name="docs",
# Return the results of the map steps in the output
return_intermediate_steps=True,
)
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=4000, chunk_overlap=0
)
split_docs = text_splitter.split_documents(docs)

# Run the chain
start_time = time.time()
result = map_reduce_chain.__call__(split_docs, return_only_outputs=True)
print(f"Time taken: {time.time() - start_time} seconds")
return result['output_text']
The output of the code will be a summary of the transcribed text.
In this tutorial, we have shown you how to use Whisper, LangChain and Mistral to summarize YouTube videos. This is a powerful tool that can help you save time and efficiently extract the key takeaways from long videos.
Example video link:



Generated Summary:
The video discusses creating a level five prompt structure for a creative assistant to generate a synthetic dataset of weight and bias user questions. The prompt should include a high-level goal, subtasks, an explanation of the output, evaluation guidelines, and examples. The system and user templates are also discussed, with the former instructing the model to create synthetic data sets of weight and bias user questions grounded in user objectives rather than documentation. The latter provides examples of real user questions and prompts the model to generate a user question and an answer while explaining the user context. The model's output will be evaluated based on the quality of the generated questions and answers. The video also demonstrates how to parse the model's output automatically to extract the context, question, and answer. Finally, the video discusses running this function in a loop to generate questions for each document in the repository to create a large dataset.

References

Iterate on AI agents and models faster. Try Weights & Biases today.