This is your last free member-only story this month.

Basic Text Summarization in Python

How to transform complex written reports into 300 word condensed summaries using text summarization

Photo by Patrick Tomasso on Unsplash

Background

Text summarization is a sub-set of text mining and natural language processing that aims to take long corpus of text and transform them into a summary that can be easily and quickly read and understood without losing the meaning of the original text.

In particular text summarisation “tokenizes” words (i.e. converts them into data) and then assesses the importance of each by looking at relative frequency and other factors. The word importance scores can then be aggregated back into values for sentences with the most important sentences bubbling up to the top of the summary.

If you would like a more in-depth exploration of the principles, this article is a great place to start — https://towardsdatascience.com/a-quick-introduction-to-text-summarization-in-machine-learning-3d27ccf18a9f

Objective

The objective of this article is to produce a basic interactive text summarization utility to demonstrate the principles and to provide a way of generating basic summaries for complex reports.

The summarization will have been successful if the summaries impart the majority of the meaning of the source report and can be read and understood in a fraction of the time.

Preparation

To start with we will need to import the libraries that will be used …

File Operations

The building blocks for our text summarization utility handle the leg-work of allowing the user to browse the local file system to select a file to summarize.

openfile() handles showing the File | Open dialog box, selecting a file, and returning the full path of the selected file.

Next getText takes the path, opens the file and reads the text contained in the file into a str. The code provided can read from .txt or .docx files.

Text Summarization

The main work is done in a single line of code by calling the gensim.summarization.summarize function. The same effect can be achieved by using the nltk natural language toolkit but it would be more involved and it would require a bit more low level work.

The summarise function provides the implementation and it follows 3 basic steps involved in text summarization -

  1. Pre-process and prepare the text.
  2. Perform the text summarization process.
  3. Post-process and tidy the result.

There will be many serious, industry-strength text processing engines that involve complex and comprehensive implementations of these 3 steps but this serves as a good example and it does provide summarization which can be quite useful as we will see.

Calling the Text Summarization

The last few helper functions are -

  • printmd which formats and prints text that includes markup to give us a nicely formatted heading.
  • openLocalFileAndSummarize which calls the functions we have defined above to select a file and summarize the contents.
  • on_button_clicked which handles the button click event to add interactivity to the Notebook and to enable several files to be selected and summarised sequentially.

Testing the Summarization

In order to test the text summarization I collected the text from two public web sites which have published freely available reports on the topics of online reporting and marketing.

The two web sites used to provide the test data are -

Running the Summarization

Lastly a single line of code creates a button that can be clicked repeatedly to select a .txt or a .docx from the local disk which will then be transformed into a 300 word summary which is printed out below.

image by Author

When the button is clicked a File | Open dialog box is displayed to select a file-

Image by Author

Once the file has been selected gensim.summarization.summarize does the work and the output is formatted in the cell output -

Executive Summary for Unleashing the Power of Online Reporting.txt

Conclusion

Text summarization can be a complex and involved process of pre-processing, summarization and post-processing and a real-world application that could summarize complex reports without losing the meaning of the original text would have commercial value.

However, in this article we have explored the basic concepts and quickly built a simple text summarization tool that can be used to open .txt or .docx files and then summarise the contents into 300 words based on evaluating the most popular sentences.

The test on two public reports proved that the basic text summarization process works and provided an example of what the summarization output looks like.

The full source code can be found here -

Thank you for reading!

If you enjoyed reading this article, why not check out my other articles at https://grahamharrison-86487.medium.com/?

Also, I would love to hear from you to get your thoughts on this piece, any of my other articles or anything else related to data science and data analytics.

If you would like to get in touch to discuss any of these topics please look me up on LinkedIn — https://www.linkedin.com/in/grahamharrison1 or feel free to e-mail me at GHarrison@lincolncollege.ac.uk.

Enjoy the read? Reward the writer.Beta

Your tip will go to Graham Harrison through a third-party platform of their choice, letting them know you appreciate their story.

Sign up for The Variable

By Towards Data Science

Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-edge research to original features you don't want to miss. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Your home for data science. A Medium publication sharing concepts, ideas and codes.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store