Fetching YouTube Transcripts Using Python with the YouTube Transcript API
Fetching YouTube Transcripts Using Python with the YouTube Transcript API
The YouTube Transcript API is a powerful library that allows you to easily fetch transcripts of YouTube videos using Python. This post explains how to install and use the YouTube Transcript API and provides solutions to common issues encountered in certain server environments.
1. Installing the YouTube Transcript API
First, install the youtube-transcript-api
package by running the following command:
pip install youtube-transcript-api
2. Simple Usage Example
To fetch a transcript of a YouTube video, you’ll need its video ID, which is the value after v= in the URL. For example, if the URL is https://www.youtube.com/watch?v=abcdefghijk, the video ID is abcdefghijk.
Here is a simple example:
from youtube_transcript_api import YouTubeTranscriptApi
video_id = "abcdefghijk" # Enter your video ID here
transcripts = YouTubeTranscriptApi.list_transcripts(video_id)
transcripts = [transcript.fetch() for transcript in transcripts][0]
transcripts = [(f'{item["start"]}s', item["text"]) for item in transcripts]
This script fetches the transcript and prints the text along with the start time.
3. Resolving Issues Using a Tor Proxy
When using the YouTube Transcript API, some server environments, such as AWS EC2, may block requests. A Tor proxy can help you bypass such restrictions by acting as an intermediary and changing your IP address.
Installing and Setting Up Tor
Install Tor on your EC2 instance or local environment:
sudo apt update
sudo apt install tor -y
Start Tor using the following command:
tor
Integrating Tor Proxy with the YouTube Transcript API
The default port for Tor is 9050. You can configure the proxy as follows:
from youtube_transcript_api import YouTubeTranscriptApi
video_id = "abcdefghijk" # Enter your video ID here
transcripts = YouTubeTranscriptApi.list_transcripts(video_id, proxies={
"https": "socks5://127.0.0.1:9050",
"http": "socks5://127.0.0.1:9050",
})
transcripts = [transcript.fetch() for transcript in transcripts][0]
transcripts = [(f'{item["start"]}s', item["text"]) for item in transcripts]
Notes
- Tor networks may not be suitable for large-scale API calls. Check Tor’s usage policy.
- Some proxy servers may already be blocked. If needed, you can rotate Tor exit nodes to distribute API requests. Refer to this configuration file for guidance.
Example Code
You can find an example integrating the Tor proxy with the YouTube Transcript API using Docker Compose here.