How to do text summarization with deep learning and Python

Blog

How to do text summarization with deep learning and Python

In this blog: Learn how to do text summarization using Python. Download the Text Summarization Python environment, import the text to be summarized, build, test and run the routine to summarize the text.

Ever feel like you don’t have enough time to read everything that you want to? What if you could run a routine that summarized documents for you, whether it’s your favorite news source, academic articles, or work-related documents?

Text summarization is a Natural Language Processing (NLP) task that summarizes the information in large texts for quicker consumption without losing vital information. Your favourite news aggregator (such as Google News) takes advantage of text summarization algorithms in order to provide you with information you need to know whether the article is relevant or not without having to click the link.

This tutorial will walk you through a simple text summarization task. We’ll use Abstractive Text Summarization and packages like newspeper2k and PyPDF2 to convert the text into a format that Python understands. Finally, we’ll use SPaCy to summarize the text with deep learning. Once you understand how text summarization works, you can also try doing the same with audio files that need to be first transcribed to text.

There are two main text summarization methods:

Extractive Text Summarization – attempts to identify significant sentences and then adds them to the summary, which will contain exact sentences from the original text.
Abstractive Text Summarization – attempts to identify important sections, interpret the context and intelligently generate a summary.

In this article, we’ll use the abstractive method on a news article. The process is straightforward:

Install a Python environment that contains all of the packages that you’ll need for the task.
Import the text to be summarized.
Build, test and run the routine to summarize the text.

All of the code used in this article can be found on my GitLab repository. All set? Let’s go.

Step 1: Installing Text Summarization Python Environment

To follow along with the code in this article, you can download and install our pre-built Text Summarization environment, which contains a version of Python 3.8 and the packages used in this post.

In order to download this ready-to-use Python environment, you will need to create an ActiveState Platform account. Just use your GitHub credentials or your email address to register. Signing up is easy, and it unlocks the many benefits of the ActiveState Platform!

For Linux users: run the following to automatically download and install our CLI, the State Tool, along with the Text Summarization into a virtual environment:

sh <(curl -q https://platform.www.activestate.com/dl/cli/install.sh) --activate-default Pizza-Team/Text-Summarization

Step 2 – Choose a Text Source for Abstractive Text Summarization

The quality, type, and density of information conveyed via text varies from source to source. Textbooks tend to be low in density but high in quality, while academic articles are high in both quality and density. On the other hand, news articles can vary significantly from source to source.

Regardless of where the text comes from the goal here is to minimize the time you spend reading. Thus, we will build a tool that can easily be adapted to any number of sources.

For this example, we will use a news article on a recent global warming study from ScienceDaily as our text source. Feel free to use a different article.

To extract the text from the URL, we’ll use the newspaper3k package:

from newspaper import Article
url = 'https://www.sciencedaily.com/releases/2021/08/210811162816.htm'
article = Article(url)
article.download()
article.parse()

Now, we’ll download and parse the article to extract the relevant attributes. From here, we can view the article text:

article.text

Clearly, this is quite long and dense. This text will serve as our input for the summarization algorithm that we’ll write in the next step.

If your particular application requires extracting text from pdf documents, try out the PyPDF2 package. Alternatively, if you have audio files that need to be transcribed to text, try using the SpeechRecognition package. Once you have the text in a format that Python can understand, you can move on to summarizing it.

Step 3 – Summarizing Text with SpaCy

A human might approach the task of summarizing a document as follows:

Read the full text
Understand the concepts being conveyed
Pick out the most important concepts
Simplify them in a more concise manner

For a computer to perform the same task, a semantic understanding of the text is necessary. While semantic analysis is possible with current NLP algorithms, it often requires significant computational power and produces results similar in quality to other extractive techniques.

Rather than understanding the text, extractive summarization relies on quantitative metrics constructed from the text itself, without attaching any exogenous meaning. Our approach is to simply:

Look at the use frequency of specific words
Sum the frequencies within each sentence
Rank the sentences based on this sum

Of course, our assumption is that a higher-frequency word use implies a more ‘significant’ meaning. This may seem overly simplistic, but this approach often produces surprisingly good results.

To begin, we’ll first need to import the different packages:

import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from heapq import nlargest

We’ll use SpaCy to import a pre-trained NLP pipeline to help interpret the grammatical structure of the text. This will allow us to identify the most common words that are often useful to filter out (i.e. STOP_WORDS) as well as the punctuation (i.e. punctuation). We’ll also use the nlargest function to extract a percentage of the most important sentences. Our algorithm will use the following steps:

Tokenize the text with the SpaCy pipeline. This segments the text into words, punctuation, and so on, using grammatical rules specific to the English language.
Count the number of times a word is used (not including stop words or punctuation), then normalize the count. A word that’s used more frequently has a higher normalized count.
Calculate the sum of the normalized count for each sentence.
Extract a percentage of the highest ranked sentences. These serve as our summary.

We can write a function that performs these steps as follows:

def summarize(text, per):
    nlp = spacy.load('en_core_web_sm')
    doc= nlp(text)
    tokens=[token.text for token in doc]
    word_frequencies={}
    for word in doc:
        if word.text.lower() not in list(STOP_WORDS):
            if word.text.lower() not in punctuation:
                if word.text not in word_frequencies.keys():
                    word_frequencies[word.text] = 1
                else:
                    word_frequencies[word.text] += 1
    max_frequency=max(word_frequencies.values())
    for word in word_frequencies.keys():
        word_frequencies[word]=word_frequencies[word]/max_frequency
    sentence_tokens= [sent for sent in doc.sents]
    sentence_scores = {}
    for sent in sentence_tokens:
        for word in sent:
            if word.text.lower() in word_frequencies.keys():
                if sent not in sentence_scores.keys():                            
                    sentence_scores[sent]=word_frequencies[word.text.lower()]
                else:
                    sentence_scores[sent]+=word_frequencies[word.text.lower()]
    select_length=int(len(sentence_tokens)*per)
    summary=nlargest(select_length, sentence_scores,key=sentence_scores.get)
    final_summary=[word.text for word in summary]
    summary=''.join(final_summary)
    return summary

Note that per is the percentage (0 to 1) of sentences you want to extract. To test it out on the ScienceDaily article, run:

summarize(article.text, 0.05)

The output should look like this:

You can read the complete article for yourself to judge how well this reflects the complete text. However, a summary is already provided by the author at the top of the article. It reads, “Global warming begets more, extreme warming, new paleoclimate study finds. Researchers observe a ‘warming bias’ over the past 66 million years that may return if ice sheets disappear.”

Pretty much spot on, right?

Conclusion – Use text summarization and improve your Productivity with Python

Maximizing your efficiency by minimizing the time you spend reading can have a dramatic impact on productivity. Whether you’re reading textbooks, reports, or academic journals, the power of natural language processing with Python and SpaCy can reduce the time you spend without diluting the quality of information.

Next Steps:

All of the code used in the article can be found here.
Sign up for a free ActiveState Platform account so you can download our Text Summarization Python environment and try out some of the technologies mentioned in this post.

Related Blogs:

How to Build a Digital Virtual Assistant in Python

How To Build a Recommendation Engine in Python

Overview

Community Forum

Service Status

Featured

Introducing ActiveState’s Secure, Custom Container Images

Read post

Featured

Introducing ActiveState’s Secure, Custom Container Images

Read post

Vulnerability Blast Radius

Risk Prioritization Copilot

Precision Remediation Pipeline

USE CASES

Container Security

Vulnerability Management and Remediation

Software Supply Chain Security

Compliance and SBOM

Beyond End-of-Life Support

RESOURCES

Read

Watch

Attend

FEATURED

Introducing ActiveState’s Secure, Custom Container Images

Read More

FEATURED

Introducing ActiveState’s Secure, Custom Container Images

Read More

Docs

Support Overview

Community Forum

Service Status

Login