Skip to main content

Summarize Text Using Machine Learning and Python

How to Summarize Text Using Machine Learning and Python?

The ability to extract significant insights and efficiently summarise enormous volumes of textual data is critical in today's digital world. Machine Learning techniques, combined with the flexibility and power of the Python programming language, provide a viable answer to this problem.

ML algorithms can be used to create intelligent systems that can analyze and reduce large materials into brief summaries. This will save time and improve comprehension as well.

This article delves into the fascinating field of text summarising and provides an overview of the essential concepts and strategies to construct accurate summarization models.


How to Summarize Text in Python and Machine Learning?

When it comes to summarizing text using machine learning and Python, various Python libraries and machine learning models are available.

Each library and model has a different code and accuracy ratio because of how it has been trained. So, for this guide, we’ll use the assistance of an advanced Python framework named Transformers and Facebook’s machine-learning model named BART LARGE CNN.

But as we’ve specified earlier, various Python libraries and machine learning models are available to automate the task of text summarization.

So, why have we chosen the Transformers and BART Large CNN? Let’s explore the answer to the ‘why’ part before getting to the steps of summarizing text using machine learning and Python.

Text summarization is an example of an advanced use case of Natural Language Processing (NLP). And advanced Python frameworks like Transformers have made it easier to deal with the advanced use cases of NLP. So, that’s the primary reason for choosing the Transformers Python framework here.

However, regarding the selection of BART Large CNN is concerned, this machine-learning model has given exceptional results for text summarizing when used with the combination of Transformers.

So, that’s the reason for choosing the BART Large CNN machine learning model in this guide.


Steps to Summarize Text Using BART Large CNN and Transformers Python Framework

The BART Large CNN machine learning model is easier to implement for smaller datasets and challenging to handle for larger problems and datasets of bigger sizes. So, let’s see the best way to use this machine learning model along with the selected Python framework (Transformers).

Note: We have listed the following steps assuming you’ve already installed Python in your system. But if you don’t know whether Python is installed in your system or not, we recommend checking its installation status first.

First, install the Transformers framework in Python because it is necessary to achieve advanced NLP tasks. You can do that by running the following command in the ‘Terminal’: pip install transformers

Upon running the above command, your system will download and install all the files essential for the proper functioning of the Transformers library.

Once the system has downloaded and installed all the files related to the ‘Transformers’ library.

The next step is to import the necessary classes from the installed ‘Transformers’ library. So, since text summarization is an NLP task, we’ll import the ‘pipeline’ class because it provides a high-level interface for the NLP tasks.

And since we want to use the assistance of the BART Large CNN model, we’ll also have to import its classes. So, type the following Python code:

from transformers import pipeline, BartForConditionalGeneration, BartTokenizer


Note: If you haven’t installed the necessary packages, the above line of code will give an error message upon executing. So, run the ‘pip install transformers torch numpy pandas’ command in the ‘Terminal’ before trying to import the classes from the installed library.

After importing the classes, it’s time to load the pre-trained BART model and tokenizer from the Hugging Face model hub. So, for that, we’ll use the ‘from_pretrained()’ method.

And that’s how the complete command for this step will look like:


model = BartForConditionalGeneration.from_pretrained(‘facebook/bart-large-cnn’)

tokenizer = BartTokenizer.from_pretrained(‘facebook/bart-large-cnn’)

 

In the next step, we’ll input the main text we want to summarize. For this example, we’ve selected the following paragraph about machine learning:

So, the following line represents the code for inputting a complete paragraph in Python:

input_text = “ 

The field of machine learning has gained a lot of attention in recent years.

It is a type of artificial intelligence that allows computers to learn from data without being explicitly programmed. Machine learning algorithms can be trained on large datasets to identify patterns and make predictions or decisions.

There are three main types of machine learning: supervised learning, unsupervised learning, and reinforcement learning.

Each type has its own unique characteristics and applications. Some popular machine-learning algorithms include decision trees, random forests, support vector machines, and neural networks.

With the increasing availability of data and computing power, the field of machine learning is expected to continue growing in the coming years.”

It’s time to tokenize the input text. So, we’ll use the assistance of a pre-trained BART tokenizer. Here is an example of how the code for this step will look like:

inputs = tokenizer(input_text, max_length=1024, truncation=True, return_tensors=‘pt’)

 

Note: In the above step, we’ve specified the ‘max_length’ and ‘truncation’ parameters. These parameters will ensure that if the text’s length exceeds the specified limit, it will automatically be truncated.

All the prerequisites are done. So, it’s time to generate the summary. And for that, we’ll use the assistance of the ‘generate()’ method of the BART Large CNN model.

So, the following represents the code for this step:


summary_ids = model.generate(inputs[‘input_ids’], num_beams=4, max_length=100, early_stopping=True)


The above line of code will generate the summary of the input. But we still need to decode the generated summary to convert the token IDs back into text.

And for that, we’ll use the assistance of the ‘decode()’ tokenizer method. So, the following line represents the complete code for this step:


summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

 

Note: As you can see, we’ve set the ‘skip_special_tokens’ parameter to ‘True. This instruction will skip or remove any unique tokens added by the tokenizer during encoding.

Last but not least, we’ll print the generated summary to the screen because it’s essential to show the output. So, here is the code for this step: print(summary)

Thus, once you’ve performed all the above steps, Python will generate the following output:

The output of the above program

As you can see, the generated summary looks accurate and covers all the important key points of the input text. So that's how you can summarize text using Python and machine learning.

The main example where you can test how ML works in Python to summarize text is by using a summarize tool available on the internet.

Summarize tools are also trained like these models and they automatically summarize documents by keeping all the important points in it.


Limitations of BART Large CNN Model

According to the output of the above example, it is clear that the BART Large CNN model is pretty accurate when it comes to text summarization, and the ‘Transformers’ is among the most accurate Python text summarization libraries.

However, there are some limitations associated with the BART Large CNN model. So, let’s check out those limitations because this way, you will have a better idea about whether using this machine learning model is an appropriate choice for you or not.

For starters, the complete size of the BART Large CNN model is around 1.65GB (as of now). So, depending on your internet connection speed, this machine-learning model will take a long time to download.

This machine-learning model can summarize a text of 1024 tokens in one go, which is equivalent to around 800 words. That’s because both words and punctuations fall into the category of tokens.

Although it performs incredibly well, the speed of this machine-learning model is quite slow, especially when it runs on the CPU with integrated graphics.

This model requires a storage and RAM capacity of at least 1.5GB in order to perform text summarization.


Conclusion

Summarizing the text is a handy way of quickly extracting the text. But this technique will only prove helpful if you know the right way to do it.

So, if you don’t know how to summarize the text manually, you can use machine learning models and Python libraries to perform accurate text summarization.

As the above blog post indicates, Python libraries and machine learning models can quickly and accurately generate summaries while preserving the original context. So, you can use the assistance of modern technology to generate text summaries and increase productivity at your work.

By Anil Singh | Rating of this article (*****)

Popular posts from this blog

nullinjectorerror no provider for httpclient angular 17

In Angular 17 where the standalone true option is set by default, the app.config.ts file is generated in src/app/ and provideHttpClient(). We can be added to the list of providers in app.config.ts Step 1:   To provide HttpClient in a standalone app we could do this in the app.config.ts file, app.config.ts: import { ApplicationConfig } from '@angular/core'; import { provideRouter } from '@angular/router'; import { routes } from './app.routes'; import { provideClientHydration } from '@angular/platform-browser'; //This (provideHttpClient) will help us to resolve the issue  import {provideHttpClient} from '@angular/common/http'; export const appConfig: ApplicationConfig = {   providers: [ provideRouter(routes),  provideClientHydration(), provideHttpClient ()      ] }; The appConfig const is used in the main.ts file, see the code, main.ts : import { bootstrapApplication } from '@angular/platform-browser'; import { appConfig } from ...

39 Best Object Oriented JavaScript Interview Questions and Answers

Most Popular 37 Key Questions for JavaScript Interviews. What is Object in JavaScript? What is the Prototype object in JavaScript and how it is used? What is "this"? What is its value? Explain why "self" is needed instead of "this". What is a Closure and why are they so useful to us? Explain how to write class methods vs. instance methods. Can you explain the difference between == and ===? Can you explain the difference between call and apply? Explain why Asynchronous code is important in JavaScript? Can you please tell me a story about JavaScript performance problems? Tell me your JavaScript Naming Convention? How do you define a class and its constructor? What is Hoisted in JavaScript? What is function overloadin...

25 Best Vue.js 2 Interview Questions and Answers

What Is Vue.js? The Vue.js is a progressive JavaScript framework and used to building the interactive user interfaces and also it’s focused on the view layer only (front end). The Vue.js is easy to integrate with other libraries and others existing projects. Vue.js is very popular for Single Page Applications developments. The Vue.js is lighter, smaller in size and so faster. It also supports the MVVM ( Model-View-ViewModel ) pattern. The Vue.js is supporting to multiple Components and libraries like - ü   Tables and data grids ü   Notifications ü   Loader ü   Calendar ü   Display time, date and age ü   Progress Bar ü   Tooltip ü   Overlay ü   Icons ü   Menu ü   Charts ü   Map ü   Pdf viewer ü   And so on The Vue.js was developed by “ Evan You ”, an Ex Google software engineer. The latest version is Vue.js 2. The Vue.js 2 is very similar to Angular because Evan ...

List of Countries, Nationalities and their Code In Excel File

Download JSON file for this List - Click on JSON file    Countries List, Nationalities and Code Excel ID Country Country Code Nationality Person 1 UNITED KINGDOM GB British a Briton 2 ARGENTINA AR Argentinian an Argentinian 3 AUSTRALIA AU Australian an Australian 4 BAHAMAS BS Bahamian a Bahamian 5 BELGIUM BE Belgian a Belgian 6 BRAZIL BR Brazilian a Brazilian 7 CANADA CA Canadian a Canadian 8 CHINA CN Chinese a Chinese 9 COLOMBIA CO Colombian a Colombian 10 CUBA CU Cuban a Cuban 11 DOMINICAN REPUBLIC DO Dominican a Dominican 12 ECUADOR EC Ecuadorean an Ecuadorean 13 EL SALVA...

55 Best TypeScript Interview Questions and Answers - JavaScript!

What Is TypeScript? By definition, "TypeScript is a typed superset of JavaScript that compiles to plain JavaScript." TypeScript is a superset of JavaScript which provides optional static typing, classes and interfaces. => The TypeScript was first made public in the year 2012. => Typescript is a modern age JavaScript development language. => TypeScript is a strongly typed, object oriented, compiled language. => TypeScript was designed by Anders Hejlsberg (designer of C#) at Microsoft. => TypeScript is both a language and a set of tools. As an Example of TypeScript, class Hello {     msg : string ;      constructor ( message : string ) {          this . msg = message ;      }       getMsg () {          return "Hello, " + this . msg ;      } } TypeScript introduced a great deal...