Text Summarization using MindsDB and OpenAI GPT-4

Text Summarization using MindsDB and OpenAI GPT-4

Summarizing various BBC News Articles

Introduction

In today's information age, we are overwhelmed with an abundance of information that can be difficult to digest. As a result, text summarization has become increasingly important. Text summarization refers to the process of reducing a large amount of text to a shorter, more concise version while retaining the most essential information. This tutorial will explore how to use MindsDB Cloud Editor and OpenAI GPT-4 to summarize BBC News articles.

Data Setup

There are two ways to retrieve BBC news articles for this tutorial. We can either use the BBC News API to fetch the articles in JSON format or we can simply gather the articles from Kaggle in the form of a CSV file for our use.

In this tutorial, we will be using the BBC news dataset that is available on Kaggle. You can download the dataset here. The dataset consists of 2225 news articles, each with a headline and body text.

Connecting the data

So, now we have a CSV file that contains the News articles. The next thing we want to do is to create a Table using this data. The following steps should create the table successfully.

  • Click on the Add button on the MindsDB Cloud Editor and then tap the Upload File button from the dropdown menu.

  • Now select the CSV file that we have created, provide a Table name in the Datasource Name field and then click on the Save and Continue button.

Once the table is created successfully, execute the first query written on the MindsDB Cloud Editor to fetch all available tables.

Query to fetch list of Tables

SHOW TABLES FROM files;

The result should contain the list of tables available.

List of Tables

Understanding the data

Let's explore the data now in the table we just created to find out more details about it. Simply, execute the second query i.e., Select query, to get the records present in the table.

Select * from files.BBCNews LIMIT 10;

The first 10 records from the table BBCNews are returned as shown below.

First 10 records of the Table

The table only contains one column.

  • news_articles: This column contains the complete news article.

We will try to summarize the text in the news_articles column using the model we will be creating below.

Creating an OpenAI GPT-4 Model

We can now create an OpenAI model to extract summaries from news articles. However, gtp-3.5-turbo gets used by default for the model. So, when we write the query for model creation, we need to pass an extra parameter namely model-name to specify the model to use gpt-4 explicitly.

So, let's look at the SQL statement now.

CREATE MODEL MODEL_NAME
PREDICT Target_Column_Name
USING
    engine = 'name_of_the_engine',
    model_name = 'name_of_the_model',
    max_tokens = number_of_tokens,              
    prompt_template = 'Summarize the following text in complete sentences. text:{{news_articles}}';

Before we proceed, let's take a look at all of the parameters we're passing in the USING statement.

  • MODEL_NAME: A name for the model we want to train.

  • Target_Column: The column name that you want to predict i.e., summary here.

  • name_of_the_engine: The engine to be used by the model i.e., openai here.

  • name_of_the_model (optional): You can explicitly specify a GPT model to use i.e., GPT-4 here.

  • number_of_tokens (optional): Specify the number of tokens i.e., word count separated by punctuations, to be generated by the Model in the response (Should be less than the tokens of input_data.

  • prompt_template: Specify a template that the model can use to interact with OpenAI GPT-4.

The actual query will become like this after replacing the placeholders with appropriate values.

CREATE MODEL news_summarizer
PREDICT summary
USING
    engine = 'openai',
    model_name = 'gpt-4',
    max_tokens = 200,           //can be any number based on the use-case        
    prompt_template = 'Summarize the following text in complete sentences. text:{{news_articles}}';

This should return the row record of the newly created model from the models table upon successful execution.

Model Creation

Status of the Model

The model might take a while to be ready for use. In the meantime, we can check its status with the query below.

SELECT status FROM models
WHERE name = 'news_summarizer';

Please note that we have to wait until the status is complete before we can start using the model.

Summarizing Texts

The model is now ready to summarize long texts/paragraphs into shorter versions making our reading lives easier. We will see two ways in which we can do this one after the other.

Summarizing a Single Text

Here we will provide the model with a larger piece of text and then ask it to return us a short summary of it. Let's see how we can do that below.

SELECT news_articles, summary
FROM news_summarizer
WHERE news_articles = "Dollar gains on Greenspan speech
The dollar has hit its highest level against the euro in almost three months after the Federal Reserve head said the US trade deficit is set to stabilise.
And Alan Greenspan highlighted the US government's willingness to curb spending and rising household savings as factors which may help to reduce it. In late trading in New York, the dollar reached $1.2871 against the euro, from $1.2974 on Thursday. Market concerns about the deficit has hit the greenback in recent months. On Friday, Federal Reserve chairman Mr Greenspan's speech in London ahead of the meeting of G7 finance ministers sent the dollar higher after it had earlier tumbled on the back of worse-than-expected US jobs data. I think the chairman's taking a much more sanguine view on the current account deficit than he's taken for some time, said Robert Sinche, head of currency strategy at Bank of America in New York. He's taking a longer-term view, laying out a set of conditions under which the current account deficit can improve this year and next.
Worries about the deficit concerns about China do, however, remain. China's currency remains pegged to the dollar and the US currency's sharp falls in recent months have therefore made Chinese export prices highly competitive. But calls for a shift in Beijing's policy have fallen on deaf ears, despite recent comments in a major Chinese newspaper that the time is ripe for a loosening of the peg. The G7 meeting is thought unlikely to produce any meaningful movement in Chinese policy. In the meantime, the US Federal Reserve's decision on 2 February to boost interest rates by a quarter of a point - the sixth such move in as many months - has opened up a differential with European rates. The half-point window, some believe, could be enough to keep US assets looking more attractive, and could help prop up the dollar. The recent falls have partly been the result of big budget deficits, as well as the US's yawning current account gap, both of which need to be funded by the buying of US bonds and assets by foreign firms and governments. The White House will announce its budget on Monday, and many commentators believe the deficit will remain at close to half a trillion dollars.";

So we have provided a long text to the model now. Once the query is executed successfully, we can get a summary as shown below.

Single Summarization

Summarizing a Batch of Texts

Now we will try to summarize multiple news articles at once. For this, we will feed the news_articles column from the BBCNews table as input and add their summaries next to them and return this whole data as an output table.

This can be done with the simple SQL query below.

SELECT input.news_articles, output.summary
FROM files.BBCNews AS input
JOIN news_summarizer AS output
LIMIT 5;

This should return a table containing 5 rows that will have two columns in them i.e, news_articles and summary .

Batch Summarization

Conclusion

In this tutorial, we explored how to use MindsDB Cloud Editor and OpenAI GPT-4 to summarize BBC News articles. We started by downloading the data from the BBC News dataset on Kaggle and then created a machine-learning model using MindsDB Cloud Editor and OpenAI GPT-4. We then checked the status of our model and made text summarizations.

Lastly, before you leave, don't forget to drop a like or a comment below if you had a great time reading it and feel free to share it on your socials too.

MindsDB has recently organized a Hackathon in collaboration with Hashnode. You can check all the details by clicking on the banner below.

Did you find this article valuable?

Support Rutam Prita Mishra by becoming a sponsor. Any amount is appreciated!