Hi everyone, I hope you haven't been bored by the articles I've published previously. This time, I'll discuss data science again—more precisely, one of the essential steps that data scientists and data analysts must go through: Exploratory Data Analysis (EDA). I'll explain it through a case study on COVID-19 using real data I obtained from Kaggle.
Introduction
In this article, I'll explain how to conduct exploratory data analysis on COVID-19 cases across every province in Indonesia. Before diving into practice, I'll first explain what EDA is so you understand what we'll be practicing in this article.
EDA is an essential step in any research analysis. The primary aim with exploratory analysis is to examine the data for distribution, outliers dan anomalies to direct specific testing of your hypothesis. It also provides tools for hypothesis generation by visualizing and understanding the data usually through graphical representation¹.
There are many EDA techniques, and each one is used depending on the problem at hand. For example, if we have continuous univariate data, we can use visualization techniques such as line plots and histograms; if the data is categorical, we can use descriptive statistics, and so on. Of course, this article won't implement all of these EDA techniques—perhaps I'll explain them another time.
Practice
Alright, I think that's enough theory—it's time to practice.
For the code, you can check my GitHub repository. The data used here contains COVID-19 information by province in Indonesia, which you can download from this link. The dataset includes the following variables: province_name, island, iso_code, capital_city, population, population_per_km, confirmed, deceased, released, longitude, and latitude.
First, we need to understand information about our data, such as what variables it contains, what data type each variable is, and whether there are any missing values. This information is crucial before conducting EDA, as it prevents misinterpretation of the results from the techniques we'll implement.
Note that the object data type can be interpreted as a string. There are several cases we can explore in this dataset, such as confirmed COVID-19 patients (confirmed), recovered patients (released), and deceased patients (deceased). It's also important to obtain descriptive statistics—such as mean, median, quantiles, and so on—for every numerical variable we have.
Next, I'll check if there are outliers in the confirmed variable using a box plot.
It can be seen that there is one outlier in the confirmed variable, namely in the province of Jakarta, meaning that Jakarta has the highest number of people confirmed to have COVID-19, which is 598 people.
As shown in the plot, there is one outlier in the confirmed variable: Jakarta, which has the highest number of confirmed COVID-19 cases at 598. This is significantly higher than other provinces, which is reasonable given that Jakarta is Indonesia's capital city and a major entry point for tourists. This could warrant further investigation, but let's explore something else.
One of the goals of EDA is to test hypotheses we have. One method that can be used for this is the Spearman correlation method, which measures the strength and direction of the relationship between two variables.
The relationship between the confirmed variable and population is quite strong compared to other variables, at 0.62. Additionally, the relationship between the two variables is positive, meaning that as a province's population increases, so does the number of confirmed COVID-19 cases.
To understand how COVID-19 was handled in DKI Jakarta, I'll visualize the proportion of COVID-19 cases—confirmed, recovered, and deceased—and compare it to West Java (Jawa Barat), the province with the second-highest number of confirmed cases.
Percentage of each cases in DKI Jakarta
Percentage of each cases in Jawa Barat
In Jakarta, 8.53% of affected people died and 5.18% recovered. In West Java, 14.3% died and 5.1% recovered. From this visualization, we can hypothesize that DKI Jakarta handled COVID-19 cases better than West Java. This hypothesis could be tested using several EDA methods, though I won't conduct that test in this article.
Conclusions
From the EDA techniques we applied, we gained several insights. First, Jakarta is the province with the most COVID-19 cases (598), significantly ahead of West Java, which has the second-highest number with a difference of approximately 500 cases. Second, we observed a correlation between confirmed cases and population. Third, regarding COVID-19 case management in DKI Jakarta and West Java, we can hypothesize that DKI Jakarta handled COVID-19 cases more effectively than West Java—a hypothesis that could be tested further.