Text Summarization using MindsDB and OpenAI GPT-4
Summarizing various BBC News Articles
Introduction
In today's information age, we are overwhelmed with an abundance of information that can be difficult to digest. As a result, text summarization has become increasingly important. Text summarization refers to the process of reducing a large amount of text to a shorter, more concise version while retaining the most essential information. This tutorial will explore how to use MindsDB Cloud Editor and OpenAI GPT-4 to summarize BBC News articles.
Data Setup
There are two ways to retrieve BBC news articles for this tutorial. We can either use the BBC News API to fetch the articles in JSON format or we can simply gather the articles from Kaggle in the form of a CSV file for our use.
In this tutorial, we will be using the BBC news dataset that is available on Kaggle. You can download the dataset here. The dataset consists of 2225 news articles, each with a headline and body text.
Connecting the data
So, now we have a CSV file that contains the News articles. The next thing we want to do is to create a Table using this data. The following steps should create the table successfully.
Click on the
Add
button on the MindsDB Cloud Editor and then tap theUpload File
button from the dropdown menu.Now select the CSV file that we have created, provide a Table name in the
Datasource Name
field and then click on theSave and Continue
button.
Once the table is created successfully, execute the first query written on the MindsDB Cloud Editor to fetch all available tables.
SHOW TABLES FROM files;
The result should contain the list of tables available.
Understanding the data
Let's explore the data now in the table we just created to find out more details about it. Simply, execute the second query i.e., Select
query, to get the records present in the table.
Select * from files.BBCNews LIMIT 10;
The first 10 records from the table BBCNews
are returned as shown below.
The table only contains one column.
- news_articles: This column contains the complete news article.
We will try to summarize the text in the news_articles
column using the model we will be creating below.
Creating an OpenAI GPT-4 Model
We can now create an OpenAI model to extract summaries from news articles. However, gtp-3.5-turbo
gets used by default for the model. So, when we write the query for model creation, we need to pass an extra parameter namely model-name
to specify the model to use gpt-4
explicitly.
So, let's look at the SQL statement now.
CREATE MODEL MODEL_NAME
PREDICT Target_Column_Name
USING
engine = 'name_of_the_engine',
model_name = 'name_of_the_model',
max_tokens = number_of_tokens,
prompt_template = 'Summarize the following text in complete sentences. text:{{news_articles}}';
Before we proceed, let's take a look at all of the parameters we're passing in the USING
statement.
MODEL_NAME: A name for the model we want to train.
Target_Column: The column name that you want to predict i.e.,
summary
here.name_of_the_engine: The engine to be used by the model i.e.,
openai
here.name_of_the_model (optional): You can explicitly specify a GPT model to use i.e.,
GPT-4
here.number_of_tokens (optional): Specify the number of tokens i.e., word count separated by punctuations, to be generated by the Model in the response (Should be less than the tokens of input_data.
prompt_template: Specify a template that the model can use to interact with OpenAI GPT-4.
The actual query will become like this after replacing the placeholders with appropriate values.
CREATE MODEL news_summarizer
PREDICT summary
USING
engine = 'openai',
model_name = 'gpt-4',
max_tokens = 200, //can be any number based on the use-case
prompt_template = 'Summarize the following text in complete sentences. text:{{news_articles}}';
This should return the row record of the newly created model from the models
table upon successful execution.
Status of the Model
The model might take a while to be ready for use. In the meantime, we can check its status with the query below.
SELECT status FROM models
WHERE name = 'news_summarizer';
Please note that we have to wait until the status is complete
before we can start using the model.
Summarizing Texts
The model is now ready to summarize long texts/paragraphs into shorter versions making our reading lives easier. We will see two ways in which we can do this one after the other.
Summarizing a Single Text
Here we will provide the model with a larger piece of text and then ask it to return us a short summary of it. Let's see how we can do that below.
SELECT news_articles, summary
FROM news_summarizer
WHERE news_articles = "Dollar gains on Greenspan speech
The dollar has hit its highest level against the euro in almost three months after the Federal Reserve head said the US trade deficit is set to stabilise.
And Alan Greenspan highlighted the US government's willingness to curb spending and rising household savings as factors which may help to reduce it. In late trading in New York, the dollar reached $1.2871 against the euro, from $1.2974 on Thursday. Market concerns about the deficit has hit the greenback in recent months. On Friday, Federal Reserve chairman Mr Greenspan's speech in London ahead of the meeting of G7 finance ministers sent the dollar higher after it had earlier tumbled on the back of worse-than-expected US jobs data. I think the chairman's taking a much more sanguine view on the current account deficit than he's taken for some time, said Robert Sinche, head of currency strategy at Bank of America in New York. He's taking a longer-term view, laying out a set of conditions under which the current account deficit can improve this year and next.
Worries about the deficit concerns about China do, however, remain. China's currency remains pegged to the dollar and the US currency's sharp falls in recent months have therefore made Chinese export prices highly competitive. But calls for a shift in Beijing's policy have fallen on deaf ears, despite recent comments in a major Chinese newspaper that the time is ripe for a loosening of the peg. The G7 meeting is thought unlikely to produce any meaningful movement in Chinese policy. In the meantime, the US Federal Reserve's decision on 2 February to boost interest rates by a quarter of a point - the sixth such move in as many months - has opened up a differential with European rates. The half-point window, some believe, could be enough to keep US assets looking more attractive, and could help prop up the dollar. The recent falls have partly been the result of big budget deficits, as well as the US's yawning current account gap, both of which need to be funded by the buying of US bonds and assets by foreign firms and governments. The White House will announce its budget on Monday, and many commentators believe the deficit will remain at close to half a trillion dollars.";
So we have provided a long text to the model now. Once the query is executed successfully, we can get a summary as shown below.
Summarizing a Batch of Texts
Now we will try to summarize multiple news articles at once. For this, we will feed the news_articles
column from the BBCNews
table as input and add their summaries next to them and return this whole data as an output table.
This can be done with the simple SQL query below.
SELECT input.news_articles, output.summary
FROM files.BBCNews AS input
JOIN news_summarizer AS output
LIMIT 5;
This should return a table containing 5 rows that will have two columns in them i.e, news_articles
and summary
.
Conclusion
In this tutorial, we explored how to use MindsDB Cloud Editor and OpenAI GPT-4 to summarize BBC News articles. We started by downloading the data from the BBC News dataset on Kaggle and then created a machine-learning model using MindsDB Cloud Editor and OpenAI GPT-4. We then checked the status of our model and made text summarizations.
Lastly, before you leave, don't forget to drop a like or a comment below if you had a great time reading it and feel free to share it on your socials too.
MindsDB has recently organized a Hackathon in collaboration with Hashnode. You can check all the details by clicking on the banner below.