Automatically Generate Site-Wide Meta Descriptions with Python + BART for PyTorch
A guide on how to efficiently create quality summarized content for search engine optimization (SEO)
Preface
In my previous articles, I explored several applications for natural language processing (NLP) implementations within the realm of digital marketing and e-commerce. This article is oriented toward technical search engine optimization (SEO) professionals who are comfortable with Python, the basics of NLP, and simple data mining.
The purpose of this article is to provide readers with a guide on how to quickly generate SEO meta descriptions for an entire site using a simple Python script. The script will be divided into four main sections:
- Domain URL extraction and clean-up
- Data mining of each domain URL
- NLP pipeline applied to mined data
- Standardization of meta descriptions
Project Dependencies
This program requires several libraries to function. We begin with a brief description of each library along with its respective installation instructions.
Data Organization Dependencies
#Data organization dependenciesfrom os import remove
import pandas as pd
pandas
is a software library written for the Python programming language for data manipulation and analysis.
Web Scraping Dependencies
#Web scraping dependencies#!pip install beautifulsoup4
from bs4 import BeautifulSoupfrom urllib.request import Request, urlopen#!pip install requests
import requests#!pip install justext
import justextimport re
beautifulsoup4
is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.- The
urllib.request
module defines functions and classes which help in opening URLs, basic and digest authentication, redirections, cookies and more. requests
is a popular open source HTTP library that simplifies working with HTTP requests.jusText
is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora.re
is a module that provides regular expression matching operations.
NLP Dependencies
#NLP dependencies!pip install transformers
from transformers import pipeline#Summarize textual content using BART in pytorch
bart_summarizer = pipeline(“summarization”)
- HuggingFace
transformers
is a state-of-the-art Machine Learning module for JAX, PyTorch and TensorFlow. Transformers provides thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio. transformers
will be used in this program to generate a pipeline(), allowing for simple imports of any model from the Model Hub.- To generate meta descriptions, the program will need to import a model capable of inference on a textual data task. There are many models capable of completing this task, and some are written in Tensorflow (Google’s Deep Learning framework) or Pytorch (Meta’s framework).
- In this program, a Pytorch model will be used, named BART. BART is a denoising sequence-to-sequence pre-training model for natural language generation, translation, and comprehension. BART will be used to create abstracts of textual data it ingests — as such, the summarization pipeline will be loaded.
- Initialize BART and the summarization pipeline with the following command:
# Summarize textual content using BART in pytorchbart_summarizer = pipeline(“summarization”)
Note: Python IDEs do not support either Pytorch or Tensorflow in their default states. It is possible to install either framework on a local machine. That said, Google Colab has both frameworks installed by default and is the recommended platform for conducting deep-learning without local installation.
This concludes the list of all required dependencies.
Domain URL Extraction and Clean-Up
For the purposes of this article, the program will be parsing the website of a shawarma restaurant named Barakat from London, Ontario, Canada.
To generate meta descriptions for all the pages pertaining to a website, a complete list of active URLs is needed for the respective domain. For this, the Request function from the urllib.request
module will request a URL, and the urlopen function will open the specified URL.
#Obtain list of links on a domaindomain = "https://www.barakatrestaurant.com"req = Request(domain, headers={'User-Agent': 'Mozilla/5.0'})html_page = urlopen(req)
Note: 403 errors may be thrown when scraping a site’s content. It is recommended to bypass HTML response errors via user agents specified in the headers parameter of the request. User agents allow servers and network peers to identify the application, operating system, vendor, and/or version of the requesting user agent.
The BeautifulSoup module will then be enacted to parse the respective opened URL’s HTML page. To parse the HTML document, the BeautifulSoup constructor will be configured to use lxml’s HTML parser with the following command:
soup = BeautifulSoup(html_page, “lxml”)
Next, instantiate an empty list to store the domain’s active links. For each link in BeautifulSoup’s parse where the <a> tag defines a hyperlink, append to the empty list the relative href
attribute indicating the link's destination.
links = []for link in soup.findAll('a'):
links.append(urljoin(domain, link.get('href')))
Some of the appended URLs may be of a None
type, they may be of a non-conforming URL structure, or they may be duplicates of other URLs pertaining to the domain.
Remove the None
type objects from the links list using a list comprehension:
#Links cleanup for None typeslinks = [link for link in links if link is not None]
Remove the objects in the list not conforming to a standard URL structure:
#Links cleanup for non-URL objectsfor link in links:
if not link.startswith("http"):
links.remove(link)
Remove the duplicate objects in the list to create a unique set:
#Links cleanup for duplicate URLslinks = [link for n, link in enumerate(links) if link not in links[:n]]
Lastly, remove any potential social links that may be driving off-site. These will likely not be parsed correctly and only serve as additional noisy data for the summarization module to process. Feel free to add any additional socials that may be residing on the domain under inspection.
#Links cleanup for social URLssocials = [“instagram”, “facebook”, “twitter”, “linkedin”, “tiktok”, “google”, “maps”, “mealsy”]clean_links = []for link in links:
if not any(social in link for social in socials):
clean_links.append(link)
Data Mining of Each Domain URL
Given the previously created list of URLs pertaining to a domain, the next step is to parse each individual HTML page for textual content which can later be abstracted into a meta description.
Begin by instantiating a new list to store all textual content in.
# Extract and clean text from each linkcontent = []
What follows is a nested conditional for-loop, which will be explained within the context of a single code block. The aim of the code block is as follows:
- For each link in the previously created list of links, request the URL of each link with the
requests
module, invoking therequest.get
function (the same caveat applicable to HTML error handling also is valid here). - For each link, use the
justext
module to extract the textual content in a variable namedparagraphs.
Note: The recommended implementation of the justext module for text data extraction is typically of the form seen below:
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
if not paragraph.is_boilerplate:
print paragraph.text
Note (cont’d): In some instances, justext may be too aggressive in removing falsely categorized boilerplate content. As such, the following alternative is used where more textual information is captured, at the expense of boilerplate accuracy:
paragraphs = justext.justext(response.content, justext.get_stoplist(“English”))
- For each link, a variable named
for_processing
is initialized to store processed paragraph data. - For each link, a set of block words (
block_words
) are created to exclude href destinations which are not of interest. These destinations can be modified to suit your respective application’s needs. - For each link, for each extracted paragraph, convert the extracted paragraph object to text.
- For each link, for each extracted paragraph, if the length of the paragraph is under 50 characters, omit it from processing. This condition exists to provide the summarization model paragraphs of sufficient length — very short paragraphs are more difficult to interpret and condense further.
- For each link, for each extracted paragraph, given the first condition, omit any of the extracted paragraphs which contain any of the aforementioned block words.
- Lastly, for each link, for each extracted paragraph, conjoin at maximum 5 paragraphs into a single entry, and append that same entry to the
contents
list.
Note: Why are only 5 paragraphs joined together for a single entry? This is done to improve the model’s summarization speed and reduce computational time per summarization task. In addition, this is done to ensure there are no more than 1024 tokens present per entry given the summarization model’s limitation of 1024 tokens. Feel free to change this parameter to a higher value if the paragraphs returned are insufficiently rich in textual data.
- Optional: organize the
links
andcontent
data into apandas
dataframe for rapid verification of results so far.
Applying the NLP Pipeline to the Mined Data
The next step is to apply the NLP pipeline to the mine data from each page’s textual content. Recall the NLP pipeline was instantiated with the following command:
# Summarize textual content using BART in pytorch
bart_summarizer = pipeline(“summarization”)
Proceed by creating a new empty list for all the summarized content to be stored in.
summarized_descriptions = []
Recall that all textual content was stored in a variable named contents
. We will use that same variable and iterate through each item in the contents
list, such that each item is processed by the NLP pipeline.
It is optional to print the initial item along with its processed version for realtime viewing in the IDE.
# Summarize content for meta descriptionssummarized_descriptions = []for item in content:
print(item)
print(bart_summarizer(item, min_length = 20, max_length = 50))
summarized_descriptions.append(bart_summarizer(item, min_length = 20, max_length = 50))
Note: Three parameters are passed into the bart_summarizer function: the
item
to be processed, themin_length
, and themax_length
.Please note that both the minimum and maximum lengths of the sequences that are to be generated are measured in tokens, not in characters. Since the number of characters per token is often variable, it is difficult to estimate the correct value for every use case.
Meta description best practices indicate a character range of 120 to 60 characters — adjust the
min_length
andmax_length
parameters accordingly to meet these best practices.
Standardization of Meta Descriptions
Currently, the summarized data from the deep learning model is residing within the variable summarized_descriptions
. The next step is to standardize the data for easy reading and export.
Note that the final summarization has been stored as a list of dictionaries within summarized_descriptions
.
The summarized_descriptions variable
is by default a list of dictionaries.Begin by unpacking the variable to only extract the values of the list of dictionaries. Store the values in a variable called meta_descriptions
.
#Retrieve values from list of dictionaries in summarized_descriptionsmeta_descriptions = [summary[0][“summary_text”] for summary in summarized_descriptions]
The process of cleaning the data will be iterative. For the sake of simplification, the steps required to complete each task have been assigned to iterations of the same variable.
Begin general data cleaning by instantiating a variable named meta_descriptions_clean1
as an empty list.
#General data cleaningmeta_descriptions_clean1 = []
For each description in meta_descriptions
:
for description in meta_descriptions:
- Clean up leading and trailing spaces
#Clean up leading and trailing spaces:description = description.strip()
- Clean up excessive spaces
#Clean up excessive spacesdescription = re.sub(‘ +’, ‘ ‘, description)
- Clean up extra spaces between punctuations
#Clean up punctuation spacesdescription = description.replace(‘ .’, ‘.’)
- Clean up incomplete sentences
#Clean up incomplete sentencesif “.” in description and not description.endswith(“.”):
description = description.split(“.”)[:-1]
description = description[0]
- Append the cleaned descriptions to
meta_descriptions_clean1
meta_descriptions_clean1.append(description)
Manual and Auto Truncation
This script also ought to adhere meta descriptions to the previously discussed SEO best practices. All descriptions should approximate 160 characters and only exceed that character limit if no other summary can be generated.
This program aims to output optimal truncation points for manual review, as well as auto-truncate meta descriptions wherever possible.
To enable this feature, it is required for full sentences to be present in all meta descriptions. Ensure a period is found at the end of each description with the following command, storing the relevant values in a list called meta_descriptions_clean2
:
meta_descriptions_clean2 = []#Add a period to all sentences (if missing)for description in meta_descriptions_clean1:
if not description.endswith(“.”):
description = description + “.”
meta_descriptions_clean2.append(description)
Now that the meta descriptions are correctly formatted, define a function where the indexes of the desired punctuation character are identified. In this case, the desired punctuation is a period signifying the end of a sentence.
#Find the index of the punctuation character desireddef find_all(string, character):
index = string.find(character)
while index != -1:
yield index
index = string.find(character, index + 1)
To display viable truncation points for a user’s manual review of each description, store the truncation point values in a list called truncation_points
. This list will store the relevant description’s character coordinates of where a description can be split, such that the final 160 character limit can be adhered to.
# Store truncation pointstruncation_points = []character = “.”for description in meta_descriptions_clean2:
indexes = list(find_all(description, character))
truncation_points.append(indexes)
To auto-truncate the meta descriptions, initialize a new list for the auto-truncated values in meta_descriptions_clean3
. For each description in meta_descriptions_clean2
, if the length of each description is greater than 160 characters, and there is more than 1 period present in the description, shorten the description by the last sentence in the description. If the truncation results in a description not ending in a period, reformat it correctly by concatenating the punctuation. Else, the description’s length is not greater than 160 characters, and simply append it to the meta_descriptions_clean3
list.
# Auto truncatemeta_descriptions_clean3 = []character = “.”for description in meta_descriptions_clean2:
if len(description) > 160 and description.count(character) > 1:
split = description.split(character)[:-2]
description = character.join(split)
if not description.endswith(“.”):
description = description + “.”
meta_descriptions_clean3.append(description)
else:
meta_descriptions_clean3.append(description)
Next, verify the character counts of both non-truncated and auto-truncated meta descriptions.
#Verify length adherence for non-truncated descriptionslen_meta_descriptions = []for description in meta_descriptions_clean2:
len_meta_descriptions.append(len(description))#Verify length adherence for auto-truncated descriptionstruncated_len_meta_descriptions = []for description in meta_descriptions_clean3:
truncated_len_meta_descriptions.append(len(description))
Lastly, organize all the relevant variables into a pandas dataframe for easy viewing.
# Organize results into dataframedf = pd.DataFrame({‘link’: clean_links, ‘content’: content, ‘meta_descriptions’: meta_descriptions_clean2, ‘description_length’: len_meta_descriptions, ‘truncation_points’: truncation_points, ‘truncated_descriptions’: meta_descriptions_clean3, ‘truncated_length’: truncated_len_meta_descriptions})df
Congratulations, you’ve made it to the end! With this final data frame output, you can export the data to .CSV
, or use it to feed an API that will automatically update your website’s meta descriptions.
It is strongly recommended that review of the meta descriptions occur, even at a cursory glance. Even with considerable programmatic cleaning of the data, there are potential formatting errors or truncations that are entirely case dependent on the URL being mined.
As always, you can find the relevant project files in both .py and .ipynb format at my GitHub repository.
If you enjoyed this content, feel free to connect on LinkedIn.