Video Editing Using Automatic Speech Recognition

In this article, we'll learn how to can build cool gifs and create slow-motion and time-lapse video segments — all using speech transcripts extracted using ASR.
yuvraj sharma
Created on June 14|Last edited on January 20
Comment
In this article, we'll be learning how to generate gifs and short segments from a longer video, complete with subtitles generated from automatic speech recognition (ASR). And yes, you'll see those gifs below the fold. Let's get started by defining a few general terms and then dig into actually training our model. 
Here's what we'll be covering: 
Table of ContentsWhat is ASR? What is Wav2Vac2?Using HuggingFace ASR PipelineGenerating GIF ImagesGIF ImagesSloMo and Timelapse VideosConclusionLinksMore Gif Samples
﻿
What is ASR? ASR is the term used for extracting or converting speech to text using machine learning technology. 
The earlier approach to converting speech to text was a hybrid one - using a combination of the lexicon model, acoustic model, and language model to make transcription possible. This approach was prone to low accuracy and was more labor-intensive. 
That method was eventually replaced by an end-to-end deep learning approach that directly maps input speech to a sequence of words and is much more accurate. 
I found this AssemblyAI post a good resource to understand speech recognition. You can also refer to the wonderful YouTube Video Series by Omar and Vaibhav to understand the basics and state-of-the-art in the Audio ML domain. 
What is Wav2Vac2?Wav2Vac2 is a popular ASR model developed by Facebook AI Research (FAIR) team in September 2020; its checkpoints are available for free download and also as a model inference at HuggingFace Hub. 
This model uses the Connectionist Temporal Classification (CTC) architecture in order to achieve very good results, even on long MP3 files or even while using it on live audio. CTC algorithm implies that every frame of audio maps to a single letter. Also, this model uses CTC with strides which makes it pretty fast and accurate as you can now run the model over consecutive small chunks (for example, of size 10 seconds) while clipping the specified strides on the left and right sides of the chunk. 
For a deeper understanding of how Wav2Vec2 is using chunking and strides, I would refer you to the brilliant blog by Nicolas Patry on HuggingFace. Also, you can refer to the model card on HuggingFace to learn more about how you can use this model.
Using HuggingFace ASR PipelineNow that you have some idea about what ASR is and how our Wav2Vec2 model operates let's look at the code which would help you to query the hosted API on HuggingFace. 
In this code piece, you can see that we are specifying the size of chunks and strides, while payload and json_response are the audio file or input that we will be sent as a request to the model and model API response, respectively:
#calling the hosted model inference
def query_api(audio_bytes: bytes):
    """
    Query for Huggingface Inference API for Automatic Speech Recognition task
    """
    payload = json.dumps({
        "inputs": base64.b64encode(audio_bytes).decode("utf-8"),
        "parameters": {
            "return_timestamps": "char",
            "chunk_length_s": 10,
            "stride_length_s": [4, 2]
        },
        "options": {"use_gpu": False}
    }).encode("utf-8")
﻿
    response = requests.request("POST", API_URL, headers=headers, data=payload)
    json_reponse = json.loads(response.content.decode("utf-8"))
    return json_reponse
In order to generate transcriptions from a video, we first need to extract the speech audio from it, and this can be done using the awesome open-source library called FFmpeg, as shown in the code below:
def generate_transcripts(in_video): 
    #convert video to audio
    audio_memory, _ = ffmpeg.input(in_video).output('-', format="wav", ac=1, ar='16k').overwrite_output().global_args('-loglevel', 'quiet').run(capture_stdout=True)
    
    #Getting transcripts using wav2Vec2 huggingface hosted accelerated inference
    #sending audio file in request along with stride and chunk length information
    model_response = query_api(audio_memory)
    
    #model response has both - transcripts as well as character timestamps or chunks
    transcription = model_response["text"].lower()
    chnk = model_response["chunks"]
    
    #creating timestamps' list from chunks to consume easily downstream 
    timestamps = [[chunk["text"].lower(), chunk["timestamp"][0], chunk["timestamp"][1]]
              for chunk in chnk]
    
    #getting words and word timestamps
    words, words_timestamp = get_word_timestamps(timestamps)
    return transcription, words, words_timestamp
When we query the Wav2Vec2 model API, we receive two values as a response - one is the full transcript of the audio, while the second one is a dictionary with characters or letters and their associated timestamps (even blanks and spaces are included). I am providing a sample out below for your quick reference:
1.'text transcript': "DO IT JUST DO IT DON'T LET YOUR DREAMS BE DREAMS YESTERDAY YOU SAID TO MORROW SO JUST DO IT MAKE YOU DREAMS CAN'T YRO JUST DO IT SOME PEOPLE DREAM OF SUCCESS WHILE YOU'RE GOING TO WAKE UP AND WORK HOT ATI NOTHING IS IMPOSSIBLE YOU SHOULD GET TO THE POINT WHERE ANY ONE ELSE WOULD QUIT AND YOU'RE LUCK IN A STOP THERE NO WHAT ARE YOU WAITING FOR DO ET JOT DO IT JUST YOU CAN JUST DO IT IF YOU'RE TIRED IS STARTING OVER STOP GIVING UP"}
2.
  {'text': 'D', 'timestamp': [2.36, 2.38]},
  {'text': 'O', 'timestamp': [2.52, 2.56]},
  {'text': ' ', 'timestamp': [2.68, 2.72]},
  {'text': 'I', 'timestamp': [2.84, 2.86]},
  {'text': 'T', 'timestamp': [2.88, 2.92]},
  {'text': ' ', 'timestamp': [2.94, 2.98]},
  {'text': 'J', 'timestamp': [4.48, 4.52]},
  {'text': 'U', 'timestamp': [4.66, 4.68]},
  {'text': 'S', 'timestamp': [4.7, 4.74]},
  {'text': 'T', 'timestamp': [4.76, 4.78]},
  {'text': ' ', 'timestamp': [4.84, 4.88]},
...... so on
The below piece of code then uses this API output and extracts individual words and their corresponding start and end timestamps as follows:
#getting word timestamps from their character timestamps
def get_word_timestamps(timestamps): 
  words, word = [], []
  letter_timestamp, word_timestamp, words_timestamp = [], [], []
  for idx,entry in enumerate(timestamps):
    word.append(entry[0])
    letter_timestamp.append(entry[1])
    if entry[0] == ' ':
      words.append(''.join(word))
      word_timestamp.append(letter_timestamp[0])
      word_timestamp.append(timestamps[idx-1][2])
      words_timestamp.append(word_timestamp)
      word, word_timestamp, letter_timestamp = [], [], []
﻿
  words = [word.strip() for word in words]
  return words, words_timestamp
Below you can see a sample output with words and their respective start and end timestamps in two separate lists. The words and timestamps are in the same order of occurrence as in the original speech audio. 
transcript word list is :['do', 'it', 'just', 'do', 'it', "don't", 'let', 'your', 'dreams', 'be', 'dreams', 'yesterday', 'you', 'said', 'to', 'morrow', 'so', 'just', 'do', 'it', 'make', 'you', 'dreams', "can't", 'yro', 'just', 'do', 'it', 'some', 'people', 'dream', 'of', 'success', 'while', "you're", 'going', 'to', 'wake', 'up', 'and', 'work', 'hot', 'ati', 'nothing', 'is', 'impossible', 'you', 'should', 'get', 'to', 'the', 'point', 'where', 'any', 'one', 'else', 'would', 'quit', 'and', "you're", 'luck', 'in', 'a', 'stop', 'there', 'no', 'what', 'are', 'you', 'waiting', 'for', 'do', 'et', 'jot', 'do', 'it', 'just', 'you', 'can', 'just', 'do', 'it', 'if', "you're", 'tired', 'is', 'starting', 'over', 'stop', 'giving', 'up'], type of words is :<class 'list'> 
Word timestamps are :[[2.36, 2.56], [2.84, 2.92], [4.48, 4.78], [5.2, 5.36], [5.6, 5.66], [7.26, 7.5], [7.58, 7.74], [7.82, 7.96], [8.0, 8.48], [8.6, 8.7], [8.82, 9.3], [10.32, 10.86], [11.16, 11.24], [11.34, 11.48], [11.56, 11.64], [11.68, 11.96], [12.26, 12.36], [12.54, 12.88], [13.4, 13.58], [13.78, 13.84], [14.34, 14.58], [14.82, 14.9], [15.02, 15.42], [15.76, 16.0], [16.1, 16.32], [17.06, 17.3], [18.04, 18.26], [18.5, 18.56], [21.64, 21.8], [21.84, 22.08], [22.16, 22.38], [22.44, 22.48], [22.54, 23.02], [23.3, 23.44], [23.48, 23.64], [23.66, 23.76], [23.78, 23.82], [23.86, 24.06], [24.18, 24.24], [24.5, 24.56], [24.64, 24.8], [24.9, 25.18], [25.28, 25.42], [25.8, 26.16], [26.26, 26.34], [26.42, 27.1], [29.42, 29.52], [29.62, 29.82], [29.86, 30.0], [30.04, 30.12], [30.16, 30.24], [30.32, 30.58], [31.1, 31.26], [31.42, 31.56], [31.7, 31.78], [31.88, 32.04], [32.12, 32.28], [32.38, 32.6], [32.88, 32.96], [33.0, 33.22], [33.26, 33.44], [33.52, 33.56], [33.64, 33.66], [33.74, 33.98], [34.08, 34.28], [35.38, 35.56], [35.74, 35.86], [35.88, 35.94], [35.98, 36.06], [36.1, 36.42], [36.52, 36.7], [38.22, 38.38], [38.66, 38.8], [41.06, 41.34], [42.96, 43.12], [43.4, 43.5], [44.18, 44.4], [44.48, 44.58], [44.68, 44.98], [46.02, 46.32], [46.48, 46.64], [46.8, 46.86], [51.78, 51.82], [51.88, 52.06], [52.1, 52.4], [52.46, 52.5], [52.56, 52.9], [53.0, 53.2], [54.34, 54.68], [55.26, 55.64], [56.14, 56.22]]
Generating GIF ImagesLet's tuck in and understand how we can make use of the above ASR output for our use case. 
The first two inputs that the below-mentioned code takes are the initial full sample video and the transcript selected for the gif section; for example, if you want to create this Shia LaBeouf's gif from a larger video file, you will provide "Don't let your dreams be dreams" as a transcript for your gif. 
Note that the idea is to use the transcript as a means of video editing tool. So, the set of words we provide as transcripts for gifs is just a clipped subset of the larger, complete transcript of the video. We will be creating smaller clips from the video using text transcript first and then convert them to gif or slow-motion or time-lapse sets.
﻿
def generate_gifs(in_video, gif_transcript, words, words_timestamp, vid_speed):
    #creating list of words from input 'gif transcript' 
    gif = gif_transcript
    giflist = gif.split()
﻿
    #getting index values for 'gif transcript words' from the generator
    giflist_indxs = list(list(get_gif_word_indexes(words, giflist))[0])
﻿
    #getting start and end timestamps for a gif videoclip
    start_seconds, end_seconds = get_gif_timestamps(giflist_indxs, words_timestamp)
    print(f"start_seconds, end_seconds  are : ({start_seconds}, {end_seconds})")
    #generated .gif image
    #gif_out, vid_out = gen_moviepy_gif(in_video, start_seconds, end_seconds)
    print(f"vid_speed from SLider is : {vid_speed}")
    
    speededit_vids_list, concat_vid = gen_moviepy_gif(in_video, start_seconds, end_seconds, float(vid_speed), video_list)
    
    return concat_vid #speededit_vids_list
The last three inputs to this function, in the order of occurrence, are the set of words in the full video, timestamps for every word, and the speed of the video as needed (to create SloMo and Timelapse). The following piece of code creates a Gif image and writes it to the memory, among other things.
#extracting the  video and building and serving a .gif image, SloMo, Timelapse clips
def gen_moviepy_gif(in_video, start_seconds, end_seconds, vid_speed, vid_list):
    video = mp.VideoFileClip(in_video)   
    leftover_clip_start = video.subclip(0, int(start_seconds) + float("{:.2f}".format(1-start_seconds%1))).without_audio() #float("{:.2f}".format(1-a%1))
    final_clip = video.subclip(start_seconds, end_seconds)
    tmp = int(end_seconds) + float("{:.2f}".format(1-end_seconds%1)) 
    if tmp < video.duration:
      leftover_clip_end = video.subclip(int(end_seconds) + float("{:.2f}".format(1-end_seconds%1)) ).without_audio() #end=None
    else:
      leftover_clip_end = video.subclip(int(end_seconds)).without_audio()
﻿
    #slowmo or timelapse
    speededit_clip = final_clip.fx(mp.vfx.speedx, vid_speed)
    speededit_clip = speededit_clip.without_audio()
﻿
﻿
    #concat the clips back into a larger video
    concatenated_clip = mp.concatenate_videoclips([leftover_clip_start, speededit_clip, leftover_clip_end])
    concatenated_clip.write_videofile("concat.mp4")
﻿
    filename = f"speededit{len(vid_list)}"
    speededit_clip.write_videofile("speededit.mp4") 
    vid_list.append("speededit.mp4") 
      
    #writing GIF to memory 
    final_clip.write_gif("gifimage.gif") 
    final_clip.close()
    return vid_list, "concat.mp4"
This code piece uses the Moviepy library extensively to clip video files, concatenate two video clips, increase or decrease the video playback speed, and lastly, write video files and Gif files to memory.
GIF ImagesFull Transcript extracted from a sample video -
"do it just do it don't let your dreams be dreams yesterday you said to morrow so just do it make you dreams can't yro just do it some people dream of success while you're going to wake up and work hot ati nothing is impossible you should get to the point where any one else would quit and you're luck in a stop there no what are you waiting for do et jot do it just you can just do it if you're tired is starting over stop giving up"The transcript for which you want to create a Gif image could be, for example -
"don't let your dreams be dreams"Getting the corresponding timestamps for the start and end words of the Gif's transcript -
Timestamps for all words are : [[2.36, 2.56], [2.84, 2.92], [4.48, 4.78], [5.2, 5.36], [5.6, 5.66], [7.26, 7.5], [7.58, 7.74], [7.82, 7.96], [8.0, 8.48], [8.6, 8.7], [8.82, 9.3], [10.32, 10.86], [11.16, 11.24], [11.34, 11.48], [11.56, 11.64], [11.68, 11.96], [12.26, 12.36], [12.54, 12.88], [13.4, 13.58], [13.78, 13.84], [14.34, 14.58], [14.82, 14.9], [15.02, 15.42], [15.76, 16.0], [16.1, 16.32], [17.06, 17.3], [18.04, 18.26], [18.5, 18.56], [21.64, 21.8], [21.84, 22.08], [22.16, 22.38], [22.44, 22.48], [22.54, 23.02], [23.3, 23.44], [23.48, 23.64], [23.66, 23.76], [23.78, 23.82], [23.86, 24.06], [24.18, 24.24], [24.5, 24.56], [24.64, 24.8], [24.9, 25.18], [25.28, 25.42], [25.8, 26.16], [26.26, 26.34], [26.42, 27.1], [29.42, 29.52], [29.62, 29.82], [29.86, 30.0], [30.04, 30.12], [30.16, 30.24], [30.32, 30.58], [31.1, 31.26], [31.42, 31.56], [31.7, 31.78], [31.88, 32.04], [32.12, 32.28], [32.38, 32.6], [32.88, 32.96], [33.0, 33.22], [33.26, 33.44], [33.52, 33.56], [33.64, 33.66], [33.74, 33.98], [34.08, 34.28], [35.38, 35.56], [35.74, 35.86], [35.88, 35.94], [35.98, 36.06], [36.1, 36.42], [36.52, 36.7], [38.22, 38.38], [38.66, 38.8], [41.06, 41.34], [42.96, 43.12], [43.4, 43.5], [44.18, 44.4], [44.48, 44.58], [44.68, 44.98], [46.02, 46.32], [46.48, 46.64], [46.8, 46.86], [51.78, 51.82], [51.88, 52.06], [52.1, 52.4], [52.46, 52.5], [52.56, 52.9], [53.0, 53.2], [54.34, 54.68], [55.26, 55.64], [56.14, 56.22]]
Timestamps for Gif words are : [[7.26, 7.5], [7.58, 7.74], [7.82, 7.96], [8.0, 8.48], [8.6, 8.7], [8.82, 9.3]]
Start_seconds, End_seconds for GIF transcript are : 7.26,9.3
Clipping the given video based on the start and end timestamp using the above code lines, we get our video clip which we then write to memory as a Gif image using the feature-rich Moviepy library. 
SloMo and Timelapse VideosPicking another Video as an example and extracting the transcript again (please ignore the gibberish text in some places as the audio has a lot of background noise and that interferes with the ASR output in this case, but it is a good sample for our use-case)-
"hugs to get here to get on the start line simply no complaints it was absolutely platant minus point one five zero seconds and he has gone hugely hugely disappointing huse a winner of his semi final argara the opportunity of a medal snatched away by a moment of madness the athleads go to the blocks again there's just go through the line up in lane tos and beany south africa jacobs of italy lane for is vacant curly united states in five siou china in seces bacer united states in seven at a gokin igeria in eight and de grass of canada in lane nine the final of the men's one hundred meters this time they go soueas way quickly in the sunney is not so quickly away this time alongside of his curny and curn e's coing on look at tleanjakjak o i suppero"Selecting the transcript of the part for which we want to slow down the video -
"sunney is not so quickly away this time alongside of his curny and curn e's coing on look at tleanjakjak o i"Getting the start and end timestamps for this segment using transcripts and above code lines -
SloMo words list is : ['sunney', 'is', 'not', 'so', 'quickly', 'away', 'this', 'time', 'alongside', 'of', 'his', 'curny', 'and', 'curn', "e's", 'coing', 'on', 'look', 'at', 'tleanjakjak', 'o', 'i']
SloMo words timestamps are :[[66.24, 66.54], [66.9, 67.0], [67.06, 67.2], [67.24, 67.34], [67.4, 67.7], [67.76, 67.96], [68.04, 68.2], [68.24, 68.42], [68.48, 68.92], [68.98, 69.02], [69.06, 69.16], [69.22, 69.58], [69.88, 69.96], [70.04, 70.26], [70.34, 70.44], [70.48, 70.7], [70.76, 70.82], [70.88, 71.0], [71.06, 71.1], [71.12, 72.94], [74.58, 74.6], [74.98, 75.0]]
Start_seconds, End_seconds are :66.24,75.0
We can perform the same steps for creating a Timelapse video as well, i.e., first specifying the transcript for Timelapse as below, and then later extracting start and end timestamps for this whole segment -
"hugs to get here to get on the start line simply no complaints it was absolutely platant minus point one five zero seconds and he has gone hugely hugely disappointing huse a winner of his semi final argara the opportunity of a medal snatched away by a moment of madness the athleads go to the blocks again there's just go through the line up in lane tos and beany south africa jacobs of italy"Moviepy library provides another useful function, vfx.speedx(),  which allows us to apply different speed effects to a video clip (for example, 0.5x, 0.9x, 1.75x, 2.0x, etc.). I have uploaded the original video and edited videos on Youtube to give the reader a fair idea of how the videos finally look after these implementations. 
This is the original video of a 100 m race from Tokyo2020 Original Youtube Video﻿
This is the SloMo Video that the Code has produced at 0.5x speed - Slomo Video 
This is the video that shows both Timelapse and SloMo effects inside the original video - Timelapse+SloMo Video﻿
ConclusionASR is a massively useful technology and is being improved pretty rapidly, with more and more big techs publishing their work on speech-to-text conversion as we speak. With HuggingFace making model hosting and serving incredibly easy, it appears that we might soon see a hockey stick rise in tech improvement as well as new use cases in this domain. 
Currently, there are more than 2500 ASR models available on HuggingFace Hub. 
I have used both ffmpeg and moviepy to handle video and audio files. However, some of the use cases can be built using just the ffmpeg library. I am publishing my work as Colab notebooks as well. It would help anyone to understand and reproduce the code themselves. I am also providing below the links for my HuggingFace Gradio Spaces to play with these demos. Hope you had fun learning about the technology and use cases.
Thanks to @_ScottCondron for starting this Blogathon and also to everybody in the W&B team who has worked on building this extremely easy-to-use and delightful Reports tool. The tool is brilliant, and I enjoyed a lot writing this piece. I hopefully will be returning pretty soon with another article. 
LinksHugging Face Gradio Space 1 Watch your video in SloMo or in Timelapse!﻿
Hugging Face Gradio Space 2 Create Any GIF From Your Favorite Videos!﻿
For more such fun content, please follow me on Twitter at @yvrjsharma﻿
Feedback and questions are very much welcome on my GitHub Repository.
More Gif SamplesLeaving you with some bit of Harry Potter Magic and Shia's motivation madness. These  Gifs are created using the  HuggingFace Space implementation link provided above.
﻿
﻿
﻿
﻿