Published in

Towards Data Science

This is your last free member-only story this month.

Graham Harrison

May 28, 2021

5 min read

Basic Text Summarization in Python

How to transform complex written reports into 300 word condensed summaries using text summarization

Background

Text summarization is a sub-set of text mining and natural language processing that aims to take long corpus of text and transform them into a summary that can be easily and quickly read and understood without losing the meaning of the original text.

In particular text summarisation “tokenizes” words (i.e. converts them into data) and then assesses the importance of each by looking at relative frequency and other factors. The word importance scores can then be aggregated back into values for sentences with the most important sentences bubbling up to the top of the summary.

If you would like a more in-depth exploration of the principles, this article is a great place to start — https://towardsdatascience.com/a-quick-introduction-to-text-summarization-in-machine-learning-3d27ccf18a9f

Objective

The objective of this article is to produce a basic interactive text summarization utility to demonstrate the principles and to provide a way of generating basic summaries for complex reports.

The summarization will have been successful if the summaries impart the majority of the meaning of the source report and can be read and understood in a fraction of the time.

Preparation

To start with we will need to import the libraries that will be used …

File Operations

The building blocks for our text summarization utility handle the leg-work of allowing the user to browse the local file system to select a file to summarize.

openfile() handles showing the File | Open dialog box, selecting a file, and returning the full path of the selected file.

Next getText takes the path, opens the file and reads the text contained in the file into a str. The code provided can read from .txt or .docx files.

Text Summarization

The main work is done in a single line of code by calling the gensim.summarization.summarize function. The same effect can be achieved by using the nltk natural language toolkit but it would be more involved and it would require a bit more low level work.

The summarise function provides the implementation and it follows 3 basic steps involved in text summarization -

Pre-process and prepare the text.
Perform the text summarization process.
Post-process and tidy the result.

There will be many serious, industry-strength text processing engines that involve complex and comprehensive implementations of these 3 steps but this serves as a good example and it does provide summarization which can be quite useful as we will see.

Calling the Text Summarization

The last few helper functions are -

printmd which formats and prints text that includes markup to give us a nicely formatted heading.
openLocalFileAndSummarize which calls the functions we have defined above to select a file and summarize the contents.
on_button_clicked which handles the button click event to add interactivity to the Notebook and to enable several files to be selected and summarised sequentially.

Testing the Summarization

In order to test the text summarization I collected the text from two public web sites which have published freely available reports on the topics of online reporting and marketing.

The two web sites used to provide the test data are -

Running the Summarization

Lastly a single line of code creates a button that can be clicked repeatedly to select a .txt or a .docx from the local disk which will then be transformed into a 300 word summary which is printed out below.

image by Author

When the button is clicked a File | Open dialog box is displayed to select a file-

Once the file has been selected gensim.summarization.summarize does the work and the output is formatted in the cell output -

Executive Summary for Unleashing the Power of Online Reporting.txt

Unleashing the Power of Online Reporting Source: Sustainable Brands, 15 February 2018 Confronted with an ever-growing demand for transparency and materiality, companies need to find an adequate format to publish both financial and pre-financial information to their stakeholders in an effective way.In the real world, they are confronted with a multitude of information sources different in name, type and content, like Annual Report, CSR Report, Financial Statement, Sustainability Report, Annual Review, Corporate Citizenship Report, or Integrated Report.Online integrated reporting to the rescue Research from Message Group among Europe’s 800 largest companies shows that between 2015 and 2017 the number of businesses publishing financial and extra-financial information in one single integrated report increased by 34 percent, while the number of companies publishing separate sustainability and annual reports decreased by 30 percent.Unlike stand-alone annual or sustainability reports, online integrated reporting formats put an organization’s financial performance, business strategy and governance into the social, environmental and economic context within which it operates.Instead of directing readers from a webpage to separate PDF reports and resources located in different online places, Core & More provides all the relevant information on a single multi-layered website.Moreover, it is flexible enough to integrate multiple reporting frameworks, accounting standards and concepts, such as the International Financial Reporting Standard (IFRS) in combination with GRI, <IR>, the TCFD recommendations, or the SDGs.
 On top of all this flexibility, a digital reporting format is highly interactive and customizable, allowing the reader to create charts or compile selected content into personal PDF or printed reports.Turning reporting into a powerful communications tool In view of growing demand for ESG disclosure, the related surge in sustainability reporting, and a complex reporting landscape companies are challenged to find a disclosure format for both financial and extra-financial information that has a measurable value for their key stakeholders.

Conclusion

Text summarization can be a complex and involved process of pre-processing, summarization and post-processing and a real-world application that could summarize complex reports without losing the meaning of the original text would have commercial value.

However, in this article we have explored the basic concepts and quickly built a simple text summarization tool that can be used to open .txt or .docx files and then summarise the contents into 300 words based on evaluating the most popular sentences.

The test on two public reports proved that the basic text summarization process works and provided an example of what the summarization output looks like.

The full source code can be found here -

grahamharrison68/Public-Github

Repository for GH public projects. Contribute to grahamharrison68/Public-Github development by creating an account on…

github.com

Thank you for reading!

If you enjoyed reading this article, why not check out my other articles at https://grahamharrison-86487.medium.com/?

Also, I would love to hear from you to get your thoughts on this piece, any of my other articles or anything else related to data science and data analytics.

If you would like to get in touch to discuss any of these topics please look me up on LinkedIn — https://www.linkedin.com/in/grahamharrison1 or feel free to e-mail me at GHarrison@lincolncollege.ac.uk.

Enjoy the read? Reward the writer.Beta

Your tip will go to Graham Harrison through a third-party platform of their choice, letting them know you appreciate their story.

Sign up for The Variable

By Towards Data Science

Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-edge research to original features you don't want to miss. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

More from Towards Data Science

Your home for data science. A Medium publication sharing concepts, ideas and codes.

Corné Potgieter

·May 28, 2021

Interpretation, interpolation and other lies | How data speaks (not) for itself

Why we might always need a trusted interpretation layer over the raw data — Did you know that by the year 2030, 80% of the world will receive their news via social media. Here’s another even more staggering one, 9 out of 10 people will not fact check anything they read… People don’t deal with facts, only the interpretation of facts. The general guy…

Data

8 min read

Share your ideas with millions of readers.

Write on Medium

Jason Chong

·May 28, 2021

Beginner’s Introduction to NLP — Building a Spam Classifier

Start here if you are new to the exciting world of natural language processing — Words, sentences, paragraphs and essays. We use them almost every day of our adult lives. …

Data Science

15 min read

Beginner’s Introduction to NLP — Building a Spam Classifier

Andrew Zhu

·May 28, 2021

Understand Decision Tree Classifiers

Understand how Decision Tree Classifier works in plain language and minimum math equations. Figure out how Gini Impurity and Information Gain works from scratch. — Compare with machine learning models like Neural Network, I thought Decision Tree Classifier should be the most simple one. But I was wrong, this model is a bit complex than I thought. And the model also lands the foundation for other advanced models like LightBGM and Random Forest Decision Tree…

Machine Learning

5 min read

Kantajit Shaw

·May 28, 2021

Python Map, Filter, and Reduce

Writing elegant and efficient code using Map, Filter, Reduce, and lambda functions — In Python, a function can be defined using def. Another way to write small functionality is to use lambda. Lambda functions are inline anonymous functions. It contains only one expression has no explicit return statement. Let’s see some examples. from time import time_ns def cubed(x): return x**3 lambda_cubed = lambda…

Python

3 min read

Steve Cao

·May 28, 2021

Extract Trending Stories in News

How to effectively extract keywords and cluster massive news — Nowadays, millions of news articles and blogs are published online every day [1]. News data is created at the rate one cannot imagine a few years ago. Social media platforms have become the main source of news online to meet the information consumption needs of internet users [2]. However, a…

News Clustering

8 min read

Basic Text Summarization in Python

How to transform complex written reports into 300 word condensed summaries using text summarization

Background

Objective

Preparation

File Operations

Text Summarization

Calling the Text Summarization

Testing the Summarization

Running the Summarization

Conclusion

grahamharrison68/Public-Github

Repository for GH public projects. Contribute to grahamharrison68/Public-Github development by creating an account on…

Thank you for reading!

Enjoy the read? Reward the writer.Beta

Sign up for The Variable

By Towards Data Science

More from Towards Data Science

Interpretation, interpolation and other lies | How data speaks (not) for itself

Beginner’s Introduction to NLP — Building a Spam Classifier

Understand Decision Tree Classifiers

Python Map, Filter, and Reduce

Extract Trending Stories in News

Recommended from Medium

Designing a Data Science Platform upon Kubernetes

4 Life Saving Tips for Data Science Beginners

Acquire and Analyze Inflation Data in the US with an API, Python, and Tableau

Should I learn Big Data Hadoop with Java?

The Poisson Process: Everything you need to know

Exploratory v3.1 Released!

How to Improve User Acquisition & Conversion using Location Analytics

Bloomberg Says > 10 Years, but How Many Years Is It Actually?

Get the Medium app

Graham Harrison

More from Medium

Abstractive Text Summarization

Introducing IceCream: Never Use Print() To Debug Your Python Code Again

Twitter Sentiment Analysis Using Python for Beginners

Text Analysis & Topic Modelling with spaCy & GENSIM