Content
This unit will cover the topic of statistics at an algebra 1 level. The three main mathematical standards covered in this unit will be:
- Represent data with plots on the real number line (histograms, and box plots). HSS-ID.A.1
- Use statistics appropriate to the shape of the data distribution to compare center (median, mean) and spread (interquartile range, standard deviation) of two or more different data sets. HSS-ID.A.2
- Interpret differences in shape, center, and spread in the context of the datasets, accounting for possible effects of extreme data points (outliers). HSS-ID.A.31
The standards will be covered through teaching students about both the 1918 and 2020 Pandemics and looking at the datasets associated with both.
Influenza and COVID:
The first human coronavirus was found by June Dalziel Almeida in 1966. There are approximately 100 known strains of coronaviruses, but only seven types are known to cause disease in humans. Four of them (HCov-HKu1, HCov-Oc43, HcoV-NL63, and HCoV-229E) are endemic and cause the common cold. The other three have more severe and variable symptoms: SARS-CoV-1 (SARS), MERS-CoV (MERS), and SARS-CoV-2 (COVID-19)2. There are four distinct genera of Influenza viruses: A, B, C, and D. Influenza A and B cause the most illness during flu season. Influenza A is the only type that has caused influenza pandemics. Influenza C tends to have only mild effects on humans and Influenza D is only found in cattle3.
1918 H1N1 Pandemic
The 1918 Pandemic occurred towards the end of WWI. It is often called the “Spanish Flu” as Spain was one of the only western countries reporting the illness. The reason was that countries involved in WW1 did not want to acknowledge an illness that would affect the war effort. Spain did not fight in WW1. Where the Pandemic of 1918-1919 started is not known. It is theorized that it may have started in East Asia, since that’s where future strains of influenza began. However, one of the first reported cases in the world came from Haskell County in Kansas where it spread quickly to other U.S mobilization camps and then to France. The rapid spread and mobility of the virus was most likely accelerated due to the increased closeness of soldiers during WW1 and from soldiers returning home after the war4.
The pandemic was caused by influenza strain H1N1, a subtype of Influenza A5. Even though we don’t know the exact origins of the virus, it likely originated in birds which then infected a mammal, that then infected humans. Pigs are one possible animal since the first reported cases in Kansas were near a pig farm6. The pandemic had three distinct waves, meaning three times where the cases increased, peaked, and then dissipated. The first wave illness was similar to the seasonal flu. The second and third waves had more severe symptoms and often led to respiratory symptoms and many died of a secondary bacterial infection. The estimated number of deaths from this pandemic is between 20-50 million7, but some resources say that number could be as high as 100 million8. Over half a million died in the US and devastated many small villages. The village of Brevig Mission in Alaska had a population of 80, between November 15-20 of 1918, a total of 72 residents died. Over 50 years later the village of Brevig Mission would play a key role of allowing scientists to discover the type of virus by extracting lung tissue from a deceased villager who was preserved due to being buried under permafrost9.
The mortality rate by age for this pandemic was unique as most Influenza outbreaks had mortality curves that were “U” shaped, meaning most deaths occurred in either the very young or very old. However, this Influenza outbreak had a “W” shape instead where the highest mortality rate was in the 20–40-year-olds10. The exact reason for a high mortality rate is unknown however some theorize that it could have been due to a hyper-active immune system in that a strong immune system may have been more susceptible to bacterial pneumonia11.
2019 SARS-CoV-2 Pandemic
The 2019 Pandemic (COVID-19) had its first reported cases in December of 2019 and on January 10th, 2020, SARS-CoV-2 was named. The origin location was Wuhan, China. It was likely originated in bats and then spread to a mammal, however, the exact mammal it was spread to before spreading to humans is currently unknown12. COVID-19 is ongoing and according to the world health organization there are over 7 million deaths associated with COVID-19 as of July 202513. The mortality curve of this pandemic was also different than the “U” shaped curve of a typical influenza outbreak, which impacts the young and old, as most COVID-19 deaths occurred in people over the age of 70 and thus created more of a “J” shape. The variability of the mortality rates also increased for age groups over 6014.
Comparing the Two Pandemics
When quantitatively comparing mortality in the 1918 and the 2020 Pandemics it is important to note some key differences. 1918 Pandemic spread rapidly, did not have health interventions in place, nor the ability of a vaccine which all contributed to a much higher mortality rate than COVID-1915. 1918 Pandemic also had many deaths due to the secondary bacteria pneumonia where COVID-19 did not16. If the 2020 COVID-19 Pandemic had the same death rate as the 1918 H1N1 Pandemic the death toll would have been between 100 and 500 million deaths17.
Dataset Descriptions
Dataset 1: Changes in life expectancy by year (1917-1920) for 183 Countries
This dataset compares the life expectancy of 183 countries during the years from 1917-1920. Life expectancy is the average number a newborn is expected to live, if the pattern of mortality continues till death. It is important to note that this dataset considers border changes and uses the estimates to apply to modern county borders as of 2020. This dataset is to show how different countries were disproportionally affected by the 1918 Pandemic. The years represented show the end of WW1 and the effect the pandemic had on soldiers returning home after the war, especially to more vulnerable countries. It is important to note the country of Samoa where almost one quarter of newborns, in 1918, died within 2 months, bringing their life expectancy that year to just 1 year. This was most likely due to a passenger ship arriving from New Zealand in November of 1918. Other countries with low life expectancies to note were present-day Afghanistan, Congo, Fiji, Guatemala, Kenya, Micronesia, Serbia, Tonga and Uganda18. Students will use this dataset during activity 6 to show, using statistical software, outliers in the datasets.
Dataset 2: Changes in life expectancy by year (1917-1919) for 30 Countries
This is a subset of dataset 1. This dataset will allow students to manually determine Mean and Median and well as create histograms and boxplots all without the need of statistical software. This dataset will be used in Activity 2 and 3 to show differences in means and medians as well as graphical differences of boxplots and histograms between the years of 1917-19.
Dataset 3: US Mortality rates by year (1915-1919)
This dataset represents Influenza mortality rates during the 1918 Pandemic (Spanish Flu) in select US states from 1915 to 1919 (per 100,000 people). Due to the high incidence of people dying from bacterial pneumonia after getting influenza the mortality rates in this dataset include both influenza and pneumonia. It is important to note Pennsylvania having the highest mortality rate. A rate of 880 fatalities per 100,000 people means that 0.9 percent of the state’s population died from the 1918 pandemic. Also note the drastic increase in influenza deaths in 1918 from the sates of both California and Pennsylvania. In 1919 mortality rates decreased, but there was not a single state’s rate decease to that before the pandemic and some states, California, South Carolina, and Washington had influenza rates in 1919 that were doubled those in 191519. Students will use the dataset in Activities 2 and 3 as it is also a dataset that allows students to calculate means and medians and create boxplots and histograms without the need of statistical software. This dataset will also be used in Activity 5 to show the variability in the dataset through the SD (Standard Deviation) as the dataset is normalized. This dataset also includes an outlier for 1919 and will be used in Activity 6. For Activity 5 and 6 statistical software is recommended.
Dataset 4: Excess Mortality by age group across thirteen selected countries (1918-1920)
This dataset is a chart and shows the median excess mortality by age and sex for the 1918 pandemic. It should be noted the unique distribution where the highest mortality rate for both genders is between 25-29 years, which is unlike any previous influenza pandemic. The overall distribution of both genders, when graphed, will create a “W” shape20. This dataset will be useful in Activity 4 to show the skewness of data by showing how this dataset is skewed right. Students will use statistical software (like GeoGebra) to determine the skewness of the data.
Dataset 5: Number of coronavirus (COVID-19) deaths in the U.S. as of June 14, 2023, by Age
This dataset represents the number of COVID-19 deaths, by age in the US from the beginning of January 2020 till June 14, 2023. It is important to note that the number of deaths in this dataset are only the deaths coded as COVID-19 deaths and does not include all deaths. Also, there is some room for error in this data due to the lag between deaths and reporting via a death certificate, therefore some of the actual numbers may be higher. This dataset will be useful in Activity 4 as is shows the skewness of data as this data set is skewed left. Students will again use statistical software (like GeoGebra) to determine the skewness of the data21.
Dataset 6: Mortality rates by sex and age per 1,000 of the US (not Hawaii)
This dataset includes deaths from both influenza and pneumonia per 1,000 people in the US (not including Hawaii), separated by age and sex for 1918 and 191922 23. The dataset will be using in Activity 5 as it is a good dataset to show the variability of data using the IQR since the data is skewed. Statistical Software is recommended when analyzing this dataset.
Math Concepts
To understand the statistics behind the pandemic related datasets and to extract robust conclusion and knowledge, the following math topics will need to be emphasized.
Mean
The mean is one way to measure the center of a dataset and is best used if the data is normalized. For example, when looking at the following dataset:
2, 5, 8, 15
The mean of those numbers is: 7.5.
This value is obtained by taking the sum of the numbers and then dividing by the number of numbers.
In this example the sum of 2, 5, 8, 15 is 30 and 30 divided by 4 is 7.5.
Median
The median in another way to measure the center of a data set. It is useful if the dataset is not normally distributed. It is the middle number with the dataset if listed in order, often from smallest to largest. For example, when looking at the following dataset:
2, 8, 9, 11, 21
The median would be 9 since 9 is the middle number as there are 2 values both larger and smaller than 9 in the set. For this example:
2, 8, 9, 11, 16, 21
The dataset is even so you take the average of the two middle numbers 9 and 11 which is 10. The median of this data set is 10
Histogram
A histogram is another way to represent a dataset visually. It also helps in seeing data distributions. Data values are grouped by ranges and the data is continuous. For example, if the first bar in the histogram ranges from 0-5, then the next bar would range 5-10. The bars have the same range, and the next bar begins where the first one ends. If you have a value on the edge of a bar (i.e. the data value 5) it is included in the higher bar (i.e. 5-10). The height of the bar shows how many data values are in that group. For example, using the dataset from before the histogram of the data would look like the example in figure 1a. Often histograms also include a frequency table like the one in figure 1b using the same dataset.
Boxplot
A boxplot is another way to represent a dataset visually. Boxplots use the median of data to visually show data, so it is particularly useful when looking at skewed data. It also is a good visual to see data spread and outliers. Using the same dataset, the boxplot of the data would look like the example in figure 1c. The boxplot was created by dividing the data into four sections. The sides of the box represent the first and third quartiles, which are the medians of the lower and upper half of the data. A line inside the box represents the median. Lines outside the box connect to the minimum and maximum values. For the example in figure 1c, the boxplot shows a dataset with a minimum of 1 and a maximum of 21. The median is 14, the first quartile (Q1) is 5.5, and the third quartile (Q3) is 14.
Figure 1

Figure 1a is a histogram, 1b is a frequency table and 1c is a boxplot. All graphics were created using www.geogebra.org
Interquartile-Range (IQR)
Interquartile range is used to measure how spread a dataset is. It is used most often when a dataset is skewed rather than normally distributed since IQR is based on the median of the data. To find the interquartile range, subtract the first quartile (found when making a box plot, often called Q1) from the third quartile (Q3). For example, using the same dataset. The IQR is 8.5, found by subtracting 5.5 (Q1) from 14 (Q3).
Standard Deviation (SD)
Standard Deviation is also used to measure how spread a dataset is. It is used most often when your dataset is normally distributed since SD is based on the mean of the data. There is a formula for finding the standard deviation, however, in the Algebra 1 course SD will be found using an online statistical calculator (like GeoGebra or Desmos).
Normal Distribution
A dataset has normal distribution when the mean of the data is equal to the median. On a graphical representation this can be shown if the visual as a vertical line of symmetry in the center.
Skewed Distribution
A dataset has a skewed distribution when the mean does not equal the median. On a graphical representation this shows one side of the distribution having more data values on either the left or right side of the visual. Skewed Distributions can either be left or right. Left Distribution is when most of the data points are on the right side of the distribution. Visually this would mean that the data has a tail on the left side. This also means that the mean is less than the median. Right Distribution is when most of the data points are on the left side of the distribution. Visually this would mean that the data has a tail on the right side. This also means that the mean is greater than the median.
Comments: