COVID-19 Outbreak Prediction using Machine Learning

Artificial Intelligence and Machine Lear... (27 Blogs) Become a Certified Professional

On day one, no one you know is sick. It feels like a normal day. But then one day, a few people you know are sick and suddenly, you see everyone is sick and it will feel like it happened so instantly. Everything looks fine until it isn’t. This is the paradox of pandemics. In this article, we shall analyse the outbreak of COVID-19 using Machine Learning.

Following is the outline of all you’re going to learn today:

What is COVID-19?
How does a Pandemic Work?
Case Study: Analysing the Outbreak of COVID-19 using Machine Learning
Conclusion

What is COVID-19?

The Problem

Corona Virus disease (COVID-19) is an infectious disease caused by a newly discovered virus, which emerged in Wuhan, China in December of 2019.

Most people infected with the COVID-19 virus will experience mild to moderate respiratory illness and recover without requiring special treatment. Older people and those with underlying medical problems like cardiovascular disease, diabetes, chronic respiratory disease, and cancer are more likely to develop serious illness.

The COVID-19 virus spreads primarily through droplets of saliva or discharge from the nose when an infected person coughs or sneezes, so you might have heard caution to practice respiratory etiquette (for example, by coughing into a flexed elbow).

How does a Pandemic Work?

To understand this better let’s look a small riddle.

There’s a glass slide held under a microscope which consists of a specific germ. This germ has a property to double every day. So on the first day, there’s one, on the second day there are two, on the third day there four and the fourth day eight, and so on.

On the 60th day, the slide is full. So on which day is the slide half full?

Day 59. But of course, you knew that.

But on which day, is the slide 1% full?

Surprisingly, not until the 54th day!

What it means that the slide goes from being 1% full to 100% in less than a week and hence, displays a property called exponential growth. And this is also how a Pandemic works. The outbreak is fairly unnoticeable in the beginning, then, once it reaches a significant value, the growth to maxima is extremely quick.

But it cannot go on forever. The virus will eventually stop finding people to infect and ultimate will slow down the count. This is called logistic growth and the curve is known as a sigmoid.

Now every point in the curve will give you to running total of cases of the current day. But if you delve a little into statistics, you’ll discover that by plotting the slope of each day, you shall get the new cases per day. There are fewer new cases right at the beginning and at the end, with a sharp rise in the stages in between.As you can see, the peak of the curve may greatly overwhelm our healthcare systems, which is the amount of resources available to us for the care of affected individuals at any given point in time.

Since we can’t really help the total number of individuals affected by the pandemic, the best solution is to flatten the curve so as to bring down the total number of cases, at any given point in time, as close to the healthcare line as possible.

This spreads the duration of this whole process a little longer, but since the healthcare system can tend to the number of cases at any given point in time, the casualties are way lower.

The Solution

Social Distancing. The logic here is, the virus can’t infect bodies if it cannot find bodies to infect!

World Leaders in all affected countries announced quarantines and lock-downs to keep their folks safe and away from anything or anyone that could infect them, all big social events were postponed and all major sports leagues cancelled as well.

On March 24, the Indian Prime Minister announced that the country would go under a lock-down to combat the spread of the virus, until further notice. Infections are rapidly rising in Italy, France, Germany, Spain, the United Kingdom and the United States. It has has a massive impact on the global economy and stock markets

The outbreak of COVID-19 is developing into a major international crisis, and it’s starting to influence important aspects of daily life.

For example:

Travel: Complete lock-down no domestic or international flights are allowed in India until decided by Ministry of Civil Aviation.
Grocery stores: In highly affected areas, people are starting to stock up on essential goods leading to a shortage of essentials.

You can also take a look at the following tutorial on “COVID 19 Outbreak Prediction using Machine Learning” to get to know the subject in a way more comprehensive manner.

COVID – 19 Outbreak Prediction using Machine Learning | Edureka

This Edureka Session explores and analyses the spread and impact of the novel coronavirus pandemic which has taken the world by storm with its rapid growth.

Case Study: Analysing the Outbreak of COVID 19 using Machine Learning

Problem Statement

We need a strong model that predicts how the virus could spread across different countries and regions. The goal of this task is to build a model that predicts the spread of the virus in the next 7 days.

🔴NOTE: The model was built on a test dataset updated till April,’20. But you can access the source to these datasets at the ‘John Hopkins University Coronavirus Resource Centre’ which gets updated on a daily basis, so you can run this model for the date you prefer.

Tasks to be performed:

Analysing the present condition in India
Is this trend similar to Italy/South Korea/ Wuhan
Exploring the world wide data
Forecasting the worldwide COVID-19 cases using Prophet

Before we begin with the model, let’s first import the libraries that we need. Consider this a Step 0 if you may.

# importing the required libraries
import pandas as pd
# Visualisation libraries
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import folium
from folium import plugins
# Manipulating the default plot size
plt.rcParams['figure.figsize'] = 10, 12
# Disable warnings
import warnings
warnings.filterwarnings('ignore')

In here we import a few important libraries that we shall use throughout the model. Pandas is an extremely fast and flexible data analysis and manipulation tool and allows you to allow you to store and manipulate tabular data. We also import visualisation libraries such as matplotlib, seaborn and plotly.

And finally, we determine the default plot size and disable warnings in our module.

Part 1: Analysing the present condition in India

So, how did it actually start in India?

The first COVID-19 case was reported on 30th January 2020 when a student arrived in Kerala, India from Wuhan, China. Just in the next 2 days, Kerela reported 2 more cases. For almost a month, no new cases were reported in India, however, on 2nd March 2020, five new cases of coronavirus were reported in Kerala again and since then the cases have only been rising.

1.1 Reading the Datasets

First, we’re going to start out by reading our datasets by creating a data frame using Pandas.

# Reading the datasets
df= pd.read_excel('/content/Covid cases in India.xlsx')
df_india = df.copy()
df

basic dataframe - covid 19 machine learning - edureka

# Coordinates of India States and Union Territories
India_coord = pd.read_excel('/content/Indian Coordinates.xlsx')
#Day by day data of India, Korea, Italy and Wuhan
dbd_India = pd.read_excel('/content/per_day_cases.xlsx',parse_dates=True, sheet_name='India')
dbd_Italy = pd.read_excel('/content/per_day_cases.xlsx',parse_dates=True, sheet_name="Italy")
dbd_Korea = pd.read_excel('/content/per_day_cases.xlsx',parse_dates=True, sheet_name="Korea")
dbd_Wuhan = pd.read_excel('/content/per_day_cases.xlsx',parse_dates=True, sheet_name="Wuhan")

1.2 Analysing COVID19 Cases in India

So, here we’re going to play around with the data frame and create a new attribute called ‘Total Case’.

This attribute is the total number of confirmed cases (Indian National + Foreign National)

df.drop(['S. No.'],axis=1,inplace=True)
df['Total cases'] = df['Total Confirmed cases (Indian National)'] + df['Total Confirmed cases ( Foreign National )']
total_cases = df['Total cases'].sum()
print('Total number of confirmed COVID 2019 cases across India till date (22nd March, 2020):', total_cases)

We are also going to highlight our data according to its geographical location in India.

df.style.background_gradient(cmap='Reds')

As you might have guessed, the redder the cell the bigger the value. So, the darker cells represent a higher number of affected cases and the lighter ones show otherwise.

1.3 Number of Active COVID-19 cases in affected State/Union Territories

#Total Active  is the Total cases - (Number of death + Cured)
df['Total Active'] = df['Total cases'] - (df['Death'] + df['Cured'])
total_active = df['Total Active'].sum()
print('Total number of active COVID 2019 cases across India:', total_active)
Tot_Cases = df.groupby('Name of State / UT')['Total Active'].sum().sort_values(ascending=False).to_frame()
Tot_Cases.style.background_gradient(cmap='Reds')

1.4 Visualising the spread geographically

Next, we shall use Folium to create a zoomable map corresponding to the number of cases in different geographies.

df_full&nbsp;=&nbsp;pd.merge(India_coord,df,on='Name&nbsp;of&nbsp;State&nbsp;/&nbsp;UT')
map&nbsp;=&nbsp;folium.Map(location=[20,&nbsp;70],&nbsp;zoom_start=4,tiles='Stamenterrain')
for&nbsp;lat,&nbsp;lon,&nbsp;value,&nbsp;name&nbsp;inzip(df_full['Latitude'],&nbsp;df_full['Longitude'],&nbsp;df_full['Total&nbsp;cases'],&nbsp;df_full['Name&nbsp;of&nbsp;State&nbsp;/&nbsp;UT']):
&nbsp;&nbsp;&nbsp;&nbsp;folium.CircleMarker([lat,&nbsp;lon],&nbsp;radius=value*0.8,&nbsp;popup&nbsp;=&nbsp;('<strong>State</strong>:&nbsp;'&nbsp;+&nbsp;str(name).capitalize()&nbsp;+&nbsp;'
''<strong>Total&nbsp;Cases</strong>:&nbsp;'&nbsp;+&nbsp;str(value)&nbsp;+&nbsp;'
'),color='red',fill_color='red',fill_opacity=0.3&nbsp;).add_to(map)
map

1.5 Confirmed vs Recovered figures

Next, we are going to use Seaborn for visualization.

f, ax = plt.subplots(figsize=(12, 8))
data = df_full[['Name of State / UT','Total cases','Cured','Death']]
data.sort_values('Total cases',ascending=False,inplace=True)
sns.set_color_codes("pastel")
sns.barplot(x="Total cases", y="Name of State / UT", data=data,label="Total", color="r")
sns.set_color_codes("muted")
sns.barplot(x="Cured", y="Name of State / UT", data=data, label="Cured", color="g")

# Add a legend and informative axis label
ax.legend(ncol=2, loc="lower right", frameon=True)
ax.set(xlim=(0, 35), ylabel="",xlabel="Cases")
sns.despine(left=True, bottom=True)

1.6 The Rise of the Coronavirus cases

Next, you’re going to use Plotly to obtain graphs depicting the trends of the rise of coronavirus cases across India.

#This cell's code is required when you are working with plotly on colab
import plotly
plotly.io.renderers.default = 'colab'

# Rise of COVID-19 cases in India
fig = go.Figure()
fig.add_trace(go.Scatter(x=dbd_India['Date'], y = dbd_India['Total Cases'], mode='lines+markers',name='Total Cases'))
fig.update_layout(title_text='Trend of Coronavirus Cases in India (Cumulative cases)',plot_bgcolor='rgb(230, 230, 230)')
fig.show()

import plotly.express as px
fig = px.bar(dbd_India, x="Date", y="New Cases", barmode='group', height=400)
fig.update_layout(title_text='Coronavirus Cases in India on daily basis',plot_bgcolor='rgb(230, 230, 230)')
fig.show()

Part 2: Is the trend Similar to Italy, Wuhan & South Korea?

At this point, India had already crossed 500 cases. It still is very important to contain the situation in the coming days. The numbers of coronavirus patients had started doubling after many countries hit the 100 marks, and almost starting increasing exponentially.

2.1 Cumulative cases in India, Italy, S.Korea, and Wuhan

# import plotly.express as px
fig = px.bar(dbd_India, x="Date", y="Total Cases", color='Total Cases', orientation='v', height=600,
             title='Confirmed Cases in India', color_discrete_sequence = px.colors.cyclical.IceFire)
'''Colour Scale for plotly
https://plot.ly/python/builtin-colorscales/
'''
fig.update_layout(plot_bgcolor='rgb(230, 230, 230)')
fig.show()
fig = px.bar(dbd_Italy, x="Date", y="Total Cases", color='Total Cases', orientation='v', height=600,
             title='Confirmed Cases in Italy', color_discrete_sequence = px.colors.cyclical.IceFire)
fig.update_layout(plot_bgcolor='rgb(230, 230, 230)')
fig.show()
fig = px.bar(dbd_Korea, x="Date", y="Total Cases", color='Total Cases', orientation='v', height=600,
             title='Confirmed Cases in South Korea', color_discrete_sequence = px.colors.cyclical.IceFire)
fig.update_layout(plot_bgcolor='rgb(230, 230, 230)')
fig.show()
fig = px.bar(dbd_Wuhan, x="Date", y="Total Cases", color='Total Cases', orientation='v', height=600,
             title='Confirmed Cases in Wuhan', color_discrete_sequence = px.colors.cyclical.IceFire)
fig.update_layout(plot_bgcolor='rgb(230, 230, 230)')
fig.show()

From the visualization above, one can infer the following:

Confirmed cases in India is rising exponentially with no fixed pattern (Very less test in India)
Confirmed cases in Italy is rising exponentially with a certain fixed pattern
Confirmed cases in S.Korea is rising gradually
There have been almost a negligible number confirmed cases in Wuhan a week.

2.2 Comparison between the rise of cases in Wuhan, S.Korea, Italy and India

# import plotly.graph_objects as go
from plotly.subplots import make_subplots
fig = make_subplots(
    rows=2, cols=2,
    specs=[[{}, {}],
           [{"colspan": 2}, None]],
    subplot_titles=("S.Korea","Italy", "India","Wuhan"))
fig.add_trace(go.Bar(x=dbd_Korea['Date'], y=dbd_Korea['Total Cases'],
                    marker=dict(color=dbd_Korea['Total Cases'], coloraxis="coloraxis")),1, 1)
fig.add_trace(go.Bar(x=dbd_Italy['Date'], y=dbd_Italy['Total Cases'],
                    marker=dict(color=dbd_Italy['Total Cases'], coloraxis="coloraxis")),1, 2)
fig.add_trace(go.Bar(x=dbd_India['Date'], y=dbd_India['Total Cases'],
                    marker=dict(color=dbd_India['Total Cases'], coloraxis="coloraxis")),2, 1)
# fig.add_trace(go.Bar(x=dbd_Wuhan['Date'], y=dbd_Wuhan['Total Cases'],
#                     marker=dict(color=dbd_Wuhan['Total Cases'], coloraxis="coloraxis")),2, 2)
fig.update_layout(coloraxis=dict(colorscale='Bluered_r'), showlegend=False,title_text="Total Confirmed cases(Cumulative)")
fig.update_layout(plot_bgcolor='rgb(230, 230, 230)')
fig.show()

2.3 Trend after crossing 100 cases

# import plotly.graph_objects as go
title = 'Main Source for News'
labels = ['S.Korea', 'Italy', 'India']
colors = ['rgb(122,128,0)', 'rgb(255,0,0)', 'rgb(49,130,189)']
mode_size = [10, 10, 12]
line_size = [1, 1, 8]
fig = go.Figure()
fig.add_trace(go.Scatter(x=dbd_Korea['Days after surpassing 100 cases'], 
                 y=dbd_Korea['Total Cases'],mode='lines',
                 name=labels[0],
                 line=dict(color=colors[0], width=line_size[0]),            
                 connectgaps=True))
fig.add_trace(go.Scatter(x=dbd_Italy['Days after surpassing 100 cases'], 
                 y=dbd_Italy['Total Cases'],mode='lines',
                 name=labels[1],
                 line=dict(color=colors[1], width=line_size[1]),            
                 connectgaps=True))
fig.add_trace(go.Scatter(x=dbd_India['Days after surpassing 100 cases'], 
                 y=dbd_India['Total Cases'],mode='lines',
                 name=labels[2],
                 line=dict(color=colors[2], width=line_size[2]),            
                 connectgaps=True))
annotations = []
annotations.append(dict(xref='paper', yref='paper', x=0.5, y=-0.1,
                              xanchor='center', yanchor='top',
                              text='Days after crossing 100 cases ',
                              font=dict(family='Arial',
                                        size=12,
                                        color='rgb(150,150,150)'),
                              showarrow=False))
fig.update_layout(annotations=annotations,plot_bgcolor='white',yaxis_title='Cumulative cases')
fig.show()

Part 3: Exploring Worldwide Data

The following code will give you tabular data about the location and status of confirmed cases by date.

df = pd.read_csv('/content/covid_19_clean_complete.csv',parse_dates=['Date'])
df.rename(columns={'ObservationDate':'Date', 'Country/Region':'Country'}, inplace=True)
df_confirmed = pd.read_csv("/content/time_series_covid19_confirmed_global.csv")
df_recovered = pd.read_csv("/content/time_series_covid19_recovered_global.csv")
df_deaths = pd.read_csv("/content/time_series_covid19_deaths_global.csv")
df_confirmed.rename(columns={'Country/Region':'Country'}, inplace=True)
df_recovered.rename(columns={'Country/Region':'Country'}, inplace=True)
df_deaths.rename(columns={'Country/Region':'Country'}, inplace=True)
df_deaths.head()

df2 = df.groupby(["Date", "Country", "Province/State"])[['Date', 'Province/State', 'Country', 'Confirmed', 'Deaths', 'Recovered']].sum().reset_index()
df2.head()

#Overall worldwide Confirmed/ Deaths/ Recovered cases 
df.groupby('Date').sum().head()

Visualizing: Worldwide COVID-19 cases

confirmed = df.groupby('Date').sum()['Confirmed'].reset_index()
deaths = df.groupby('Date').sum()['Deaths'].reset_index()
recovered = df.groupby('Date').sum()['Recovered'].reset_index()

fig = go.Figure()
#Plotting datewise confirmed cases
fig.add_trace(go.Scatter(x=confirmed['Date'], y=confirmed['Confirmed'], mode='lines+markers', name='Confirmed',line=dict(color='blue', width=2)))
fig.add_trace(go.Scatter(x=deaths['Date'], y=deaths['Deaths'], mode='lines+markers', name='Deaths', line=dict(color='Red', width=2)))
fig.add_trace(go.Scatter(x=recovered['Date'], y=recovered['Recovered'], mode='lines+markers', name='Recovered', line=dict(color='Green', width=2)))
fig.update_layout(title='Worldwide NCOVID-19 Cases', xaxis_tickfont_size=14,yaxis=dict(title='Number of Cases'))
fig.show()

Part 4: Forecasting Total Number of Cases Worldwide

In this segment, we’re going to generate a week ahead forecast of confirmed cases of COVID-19 using Prophet, with specific prediction intervals by creating a base model both with and without tweaking of seasonality-related parameters and additional regressors.

What is Prophet?

Prophet is open source software released by Facebook’s Core Data Science team. It is available for download on CRAN and PyPI.

We use Prophet, a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.

Why Prophet?

Accurate and fast: Prophet is used in many applications across Facebook for producing reliable forecasts for planning and goal setting. Facebook finds it to perform better than any other approach in the majority of cases. It fit models in Stan, so that you get forecasts in just a few seconds.
Fully automatic: Get a reasonable forecast on messy data with no manual effort. Prophet is robust to outliers, missing data, and dramatic changes in your time series.
Tunable forecasts: The Prophet procedure includes many possibilities for users to tweak and adjust forecasts. You can use human-interpretable parameters to improve your forecast by adding your domain knowledge
Available in R or Python: Facebook has implemented the Prophet procedure in R and Python. Both of them share the same underlying Stan code for fitting. You can use whatever language you’re comfortable with to get forecasts.

from fbprophet import Prophet
confirmed = df.groupby('Date').sum()['Confirmed'].reset_index()
deaths = df.groupby('Date').sum()['Deaths'].reset_index()
recovered = df.groupby('Date').sum()['Recovered'].reset_index()

The input to Prophet is always a data frame with two columns: ds and y. The ds (datestamp) column should be of a format expected by Pandas, ideally YYYY-MM-DD for a date or YYYY-MM-DD HH:MM:SS for a timestamp. The y column must be numeric and represents the measurement we wish to forecast.

confirmed.columns = ['ds','y']
#confirmed['ds'] = confirmed['ds'].dt.date
confirmed['ds'] = pd.to_datetime(confirmed['ds'])
confirmed.tail()

4.1 Forecasting Confirmed COVID-19 Cases Worldwide with Prophet (Base model)

Generating a week ahead forecast of confirmed cases of COVID-19 using Prophet, with a 95% prediction interval by creating a base model with no tweaking of seasonality-related parameters and additional regressors.

m = Prophet(interval_width=0.95) 
m.fit(confirmed) 
future = m.make_future_dataframe(periods=7) 
future.tail()

The predict method will assign each row in future a predicted value which it names yhat. If you pass on historical dates, it will provide an in-sample fit. The forecast object here is a new data-frame that includes a column yhat with the forecast, as well as columns for components and uncertainty intervals.

#predicting the future with date, and upper and lower limit of y value
forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()

You can plot the forecast by calling the Prophet.plot method and passing in your forecast data frame.

confirmed_forecast_plot = m.plot(forecast)

confirmed_forecast_plot =m.plot_components(forecast)

4.2 Forecasting Worldwide Deaths using Prophet (Base model)

Generating a week ahead forecast of confirmed cases of COVID-19 using the Machine Learning library – Prophet, with 95% prediction interval by creating a base model with no tweaking of seasonality-related parameters and additional regressors.

deaths.columns = ['ds','y']
deaths['ds'] = pd.to_datetime(deaths['ds'])
m = Prophet(interval_width=0.95)
m.fit(deaths)
future = m.make_future_dataframe(periods=7)
future.tail()

forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()

deaths_forecast_plot = m.plot(forecast)

deaths_forecast_plot = m.plot_components(forecast)

4.3 Forecasting Worldwide Recovered Cases with Prophet (Base model)

Generating a week ahead forecast of confirmed cases of COVID-19 using Prophet, with 95% prediction interval by creating a base model with no tweaking of seasonality-related parameters and additional regressors.

recovered.columns = ['ds','y']
recovered['ds'] = pd.to_datetime(recovered['ds'])
m = Prophet(interval_width=0.95)
m.fit(recovered)
future = m.make_future_dataframe(periods=7)
future.tail()


forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()


recovered_forecast_plot = m.plot(forecast)

recovered_forecast_plot = m.plot_components(forecast)

Conclusion

This is a humble request to all our learners.

Don’t take your cough and cold lightly as you would. If you look at the data, the number of cases in India is rising just like in Italy, Wuhan, S.Korea, Spain, or the USA. We have crossed 100,000 cases already. Don’t let lower awareness and fewer test numbers ruin the health of our world.

Currently, India is a deadly and risky zone as there are very few COVID-19 test centers available. Imagine how many infected people are still around you and are infecting others unknowingly.

Let’s give a hand in fighting this pandemic at least by quarantining ourselves by staying indoors and protecting ourselves and others around us.

Take precautions, stay indoors, and utilize this time to develop your Machine Learning skillset and maybe you’ll be the one to help the world with your Machine Learning models.

Ready to unlock the power of data? Dive into Python for Data Science today! Whether you’re analyzing trends, building predictive models, or extracting valuable insights, Python’s versatile libraries like NumPy, pandas, and scikit-learn empower you to conquer any data challenge

Also, If you’re interested in gaining expertise, Machine Learning Certification Training program can help you acquire the necessary skills and knowledge.

Got a question for us? Please mention them in the comments section and we will get back to you.