October 7, 2021

How to Manage Website ROT with Python

It’s been more than 25 years since the mainstreaming of the World Wide Web. Over the decades, most organizations have launched, relaunched and revised their Web presence multiple times in an effort to battle Redundant, Outdated and Trivial (ROT) data that continues to this day.

Having a website that is well-organized with working links and current content is the cornerstone of any business’s Internet presence, and is key to attracting and retaining customers. Customers value up-to-date information, usability and functionality much more than trendy looks. In theory, it should be easy enough to check your site for ROT data, but in practice it often consumes more time and resources than you might think as sites grow and become more complex.

In general, there are two approaches to dealing with ROT:

  • Manual Validation – viable only for the smallest sites. Unfortunately, even moderately-sized websites typically feature dozens of internal cross-links and many, many external reference links, as well. 
    • To put this in perspective, Google only indexes roughly the first 150 links on a single webpage, and pages with lots of links are extremely common.
  • Automated Validation –  the only approach that makes sense when even SMBs have sites with hundreds of pages, and enterprise sites can have millions of pages that potentially need to be validated. 
    • For example, Amazon has 12 million product listings, and every product has multiple pages that generate unique content. 

Without automation, it’s not practical to even attempt to check for ROT. This blog will discuss various automated methods to deal with ROT, including:

  1. Check the links for a single page
  2. Extend the Link Checker to search an entire Website
  3. Check for redundant and outdated content

All the code in this post can be found in my GitHub repo. Let’s get started.  

Before You Start: Install The Website ROT Python Environment

To follow along with the code in this article, you can download and install our pre-built Website ROT environment, which contains a version of Python 3.9 and the packages used in this post, along with already resolved dependencies!

In order to download this ready-to-use Python environment, you will need to create an ActiveState Platform account. Just use your GitHub credentials or your email address to register. Signing up is easy and it unlocks the ActiveState Platform’s many benefits for you!

Or you could also use our State tool to install this runtime environment.

runtime

For Windows users, run the following at a CMD prompt to automatically download and install our CLI, the State Tool along with the Website ROT runtime into a virtual environment:

powershell -Command "& $([scriptblock]::Create((New-Object Net.WebClient).DownloadString('https://platform.activestate.com/dl/cli/install.ps1'))) -activate-default Pizza-Team/Website-ROT"

For Linux users, run the following to automatically download and install our CLI, the State Tool along with the Website ROT runtime into a virtual environment:

sh <(curl -q https://platform.activestate.com/dl/cli/install.sh) --activate-default Pizza-Team/Website-ROT

1–Automated Link Checking on a Website Page

One of the most common (and frustrating) types of website ROT is broken links, which can display a number of errors to visitors, including:

  • 404 “page not found” 
  • 301 “permanent redirect” 
  • 302 “temporary redirect”

The best practice approach to resolving these kinds of errors is typically:

  1. Bad link/error found
  2. Content manager reviews the error and originating link
  3. Link is updated or content is replaced

Automated link checkers will also turn up a number of other codes, such as 200 (OK/success status), or others that indicate unexpected errors or authentication requirements rather than bad links.

While there are a number of Python-based programs that can check links for you (such as Dead_Link_Checker) you may prefer to write your own using Python using a combination of the Requests and BeautifulSoup packages such as those I’ve included in the Website ROT environment. The benefit of writing your own is that you can parse and loop through the various files that are retrieved in order to provide even more automation.

If you’ve installed the Website ROT environment, my GitHub repo was automatically cloned into the virtual environment, so all you need to do is cd into the directory and run DLChecker. 

Here is the output example from a traversal of multiple pages on a live site:

(dlc) Dead_Link_Checker% python3 ./src/DLChecker.py https://gifted-tesla-ec935f.netlify.app/

URL Checker is activated

PASSED [200] https://gifted-tesla-ec935f.netlify.app/ - Good


------------- Checking is done ---------------

------ The following links were working ------

| PASSED [200] https://gifted-tesla-ec935f.netlify.app/ - Good

2–Automated Link Checking of an Entire Website

The basic link checking Python script can run through a single page of a website. But you can easily extend this basic framework to include more loops that can recursively check an entire website automatically.

First, you need to retrieve the page that you want to parse from the website using urllib.request:

resp = urllib.request.urlopen(“http://some.site/page.html”)

Next, parse the page using BeautifulSoup:

soup = BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))

Then, loop through the results to work with all of the href links in the page:

for link in soup.find_all('a', href=True):

Finally, you need to loop check the pages for their HTTP status codes so that you can see if any of the links are bad:

for link in soup.find_all('a', href=True):

    tested = requests.get(link['href'], allow_redirects=False)

    print(tested.url + "\t" + str(tested.status_code))

The output looks like this:

(diy) % python adlc.py https://gifted-tesla-ec935f.netlify.app/random.html

http://www.craigsrandomwebsite.com/ 200

https://bit.ly/IqT6zt 301

http://www.wildmoodswings.co.uk/ 302

http://en.wikipedia.org/wiki/Special:Randompage 301

http://www.startpagerandomizer.com/ 404

https://gifted-tesla-ec935f.netlify.app/index.html 200

3–Outdated and Redundant Content

The content on your organization’s website can become outdated or redundant without something in place to check for such things. Sometimes, a page may appear to be redundant to an external checker because of an improper use of redirects. For example, if the page keeps returning as a temporary redirect, or as being handled in JavaScript, then search engines like Google will think that the content exists in both places, which can reduce your ranking in search results since it will no longer look like there is a single authoritative page. Such unintentional redundancies can be discovered through the same kind of automation that can find dead links.

In general, though, it’s far more difficult to automate the discovery of outdated and redundant content using a checker external to the content management system (CMS) being used to manage the website. This is mainly due to the fact that page content is (typically) being dynamically generated, so you can’t rely on the timestamp to determine how old the content is.

While you can still use Python to help automate content checking, your Python scripts will need to be tailored to your specific Web CMS. For example, the Intelligent Tools plugin for Drupal uses Python and NLP tools to help collect, analyze, and tag all of the content stored in a Drupal website, which makes the process of identifying potentially redundant content much easier.

Conclusions – Automate ROT Removal

Web content managers are busy enough optimizing the stream of new content that’s being onboarded to their company’s website. They don’t need to waste time combing through the site for old and broken content. By enabling automation based on a popular language like Python, content management teams will have more time to focus on adding value while still ensuring that existing content isn’t stale, and that bad links won’t cause a bad customer experience. Neither existing customers nor potential clients will be happy if your site is full of ROT.

Give it a try with your website by:

Configuration

With the ActiveState Platform, you can create your Python environment in minutes, just like the one we built for this project. Try it out for yourself or learn more about how it helps Python developers be more productive.

Recommended Reads

How to prevent TLS certificates from expiring with Python

Top 10 Tasks to Automate with Python

 

Vince Power

Vince Power

Vince Power is a Solution Architect who has a focus on cloud adoption and technology implementations using open source-based technologies. He has extensive experience with core computing and networking (IaaS), identity and access management (IAM), application platforms (PaaS), and continuous delivery.