Downloading PDF Files Using Python
I saw a post on Mastodon yesterday about how Tom Lehrer recently put all his music, lyrics, and sheet music into the public domain, and also made them freely downloadable from his website.
While his mp3 music can be downloaded in “complete sets” for each album as RAR files, the lyrics and sheet music are individual PDF files for each song.
That’s a lot of clicking that I didn’t want to do, so I put together a small python script to download the PDF files for me.
While the code below is specific to this use case, you can use it as a starting point if you want to download files from other websites.
Change the “url”, “domain”, and “ext” variables to reflect the URL to download, the base domain you’re downloading from, and the file extension you want to download.
That should work for basic websites, other websites may require additional customization depending on how they are set up.
Have fun!
You’ll need Python Requests and BeautifulSoup4 installed
import requests
import re
from bs4 import BeautifulSoup
# Set user-agent otherwise we get a 403 forbidden error
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; rv:108.0) Gecko/20100101 Firefox/108.0'
}
url = 'https://tomlehrersongs.com/category/sheet-music/'
domain = 'https://tomlehrersongs.com'
# Set parameters for regex search
ext = '.pdf'
pattern = re.compile(ext)
# Function to get the links from the site using requests and beautifulSoup
def get_links(url,headers):
page = requests.get(url,headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
links = [str(link.get('href')) for link in soup.find_all('a')]
return links
# Call get_links function
links = get_links(url,headers)
# For loop to download the files matching the regex search
for link in links:
# If search match found do The Thing
if (pattern.search(link)):
###### The Thing ######
# Change the relative links to absolute links
url = domain+link
# Get the filename from the URL so we can write out the file as the original name
fileName = url.split("/")[-1]
# Status Feedback
print('Downloading: '+url)
# Request the file
response = requests.get(url,headers=headers)
# Write response content to file
file = open(str(fileName), 'wb')
file.write(response.content)
file.close()
print("Downloaded file: ", fileName)
If you have any questions/comments please leave them below.
Thanks so much for reading ^‿^
Claire
If this tutorial helped you out please consider buying me a pizza slice!