Create Alternative Data for automated investing using Natural Language Processing in Python

A step-by-step beginner’s guide for using off-the-shelf Language API to generate insight directly from YouTube videos

Moez Ali
6 min readSep 1, 2022
Photo by Szabo Viktor on Unsplash

Introduction

The term “alternative data” refers to non-traditional data sets that are used by investors to drive their investment strategy. Data from social media, news, company’s filling, management guidance, earning calls, satellite imagery, product reviews, weather data, web traffic are some examples of alternative data sets.

In addition to assisting investors in monitoring the health of a firm, industry, or economy, this data may also be utilized as a component of the pre-trade investment research.

Increases in processing power and the use of personal devices have led to a tremendous expansion in the amount of data that has been generated during the past ten years. As a direct result of this, a huge number of businesses known as “Alternative Data Providers” came into existence with the mission of “collecting, cleaning, analyzing, and interpreting data and providing it as a product that may inform investment decisions.”

Source: https://alternativedata.org/alternative-data/

The goal of this tutorial is to demonstrate step-by-step on how to create valuable alternative data yourself directly from the videos on YouTube in Python. By the end of this tutorial you will learn:

  • How to download videos from YouTube using pytube library in Python
  • How to connect to a Voice Transcription API. In this tutorial I will use AssemblyAI API.
  • How to analyze the results in Python.

pytube is a lightweight, open-source Python library for downloading YouTube Videos directly from the URL. You can use it in Jupyter Notebook or in any IDE or even on command line terminal. It has no third-party dependencies and extensively documented code base. To learn more about this library, check out this official website.

Get Started

The first step is to head over to the AssemblyAI website and click on Get Started with the API to create a free account. No Credit Card Required! You will be asked to confirm your email address before you can use your free services.

Step 1 — Copy API Key from the AssemblyAI Dashboard

Login to your AssemblyAI Dashboard and on the right hand side you will see Your API Key. Click on that to copy to key to clipboard. We will store this key in a variable in Jupyter Notebook. If you want you can store save the key in some kind of secret store and retrieve it from the store instead of manually pasting in Notebook — but I won’t do that for this tutorial. I will keep it simple and just store it as a variable in the Notebook directly.

AssemblyAI Dashboard
https://www.assemblyai.com/

First thing first is to just import some libraries that we are going to use in this Notebook and store your API key into variable API_KEY.

# importing libraries
import os
import sys
import time
import requests
from pytube import YouTube
# storing API key
API_KEY = 'PASTE-YOUR-API-KEY-HERE'

Step 2 — Download Video from YouTube

We will start with downloading the video from YouTube using YouTube class from pytube library that we imported in Step 1. In the example below we are downloading Federal Reserve Chairman Jerome Powell speech at Jackson Hole on August 26 2022 which sent the market down by 3.37%.

Actually, notice that we are using the method streams.get_auto_only — this means that we are only downloading the audio of the file because that’s what we need for transcription API.

# downloading video from YouTube
video = YouTube('https://www.youtube.com/shorts/GzBYiMYUDCI')
yt = video.streams.get_audio_only()
yt.download()
# store the path of the downloaded file
mp4_file = yt.get_file_path()
print(mp4_file)
Output from print(mp4_file)

Step 3 — Upload Video on AssemblyAI

Now that we have mp4 file locally available we will simply upload it to the AssemblyAI service before we can pass the URL to REST API for transcription and other analysis.

# uploading downloaded video to AssemblyAI
data = read_file(mp4_file)
headers = {'authorization': API_KEY}
response = requests.post('https://api.assemblyai.com/v2/upload',
headers=headers,
data=data)
# store the upload url
audio_url = response.json()['upload_url']
print(audio_url)
Output from print(audio_url)

Step 4 — Send POST Request to API

Once the mp4 file is uploaded on AssemblyAI the final step is to call the REST API and pass the URL.

import requestsendpoint = "https://api.assemblyai.com/v2/transcript"json = {
"audio_url": audio_url,
"sentiment_analysis": True,
"auto_chapters" : True
}
headers = {
"authorization": API_KEY,
"content-type": "application/json"
}
transcript_input_response = requests.post(endpoint, json=json, headers=headers)

Notice that in json dictionary I have passed sentiment_analysis : True and auto_chapters : True — This is because in addition to the transcription I also need these two audio intelligence features for which I will analyze the output in the next step.

Step 5 — Analyze Output

The final step is to retrieve the output and analyze the results.

transcript_id = transcript_input_response.json()["id"]
endpoint = f"https://api.assemblyai.com/v2/transcript/{transcript_id}"
headers = {
"authorization": API_KEY,
}
transcript_output_response = requests.get(endpoint, headers=headers)

Once the retrieval is completed you can access your output with the following command:

transcript = transcript_output_response.json()["text"]
sentiment_analysis = transcript_output_response.json()["sentiment_analysis_results"]
summary = transcript_output_response.json()["chapters"]

The first variable transcript contains raw transcription (speech-to-text). The second and third variable sentiment_analysis and summary contains sentiment analysis and text summary. This is because we have passed two additional parameters in json object in step 4 (see above).

# print the first 1000 words of transcript
print(transcript[:1000])
Output from print(transcript)
# check the summary object
len(summary)
>>> 2
# see the summary text
print(summary[0])
print(summary[1])
Output from print(sentiment)

Finally we can also see what does sentiment_analysis object has. It is a dictionary where each element is a sentence and has a starting position, ending position, sentiment (POSITIVE, NEGATIVE, NEUTRAL), confidence level and the speaker (if we opt for speaker identification in Step 4 — In this example, it is None).

Let’s summarize this by extracting the sentiment for each sentence.

sentiments = []
for i in sentiment_analysis:
sentiments.append(i['sentiment'])
Counter(sentiments).most_common()
Output from Counter(sentiments).most_common()

Out of 73 sentences in the speech, 53 are neutral, 10 positive, and 10 negative.

Conclusion

Speech-to-text API’s and related audio intelligence services are really important in this time of big data. Using audio/video data to augment your current analytical pipelines gives you an edge over other companies. Speech-to-text is not a new invention. It is been there for a long time and is currently having applications everywhere from Automated Dictation to Chatbots to Smart Assistants.

In this article we have reviewed one of such company AssemblyAI that provides this API but this is not the only company.

Source: https://activewizards.com/content/blog/Comparison_of_the_Top_Speech_Processing_APIs/comparison-of-cloud-api-for-speech-table02.png

I write about data science, machine learning, and PyCaret. If you would like to be notified automatically, you can follow me on Medium, LinkedIn, and Twitter.

--

--