Introduction
Introduction
Measures of average such as the median and mean represent the typical value for a dataset. Within the dataset the actual values usually differ from one another and from the average value itself. The extent to which the median and mean are good representatives of the values in the original dataset depends upon the variability or dispersion in the original data. Datasets are said to have high dispersion when they contain values considerably higher and lower than the mean value.
In figure 1 the number of different sized tutorial groups in semester 1 and semester 2 are presented. In both semesters the mean and median tutorial group size is 5 students, however the groups in semester 2 show more dispersion (or variability in size) than those in semester 1.
Dispersion within a dataset can be measured or described in several ways including the range, inter-quartile range and standard deviation.
The Range
The range is the most obvious measure of dispersion and is the difference between the lowest and highest values in a dataset. In figure 1, the size of the largest semester 1 tutorial group is 6 students and the size of the smallest group is 4 students, resulting in a range of 2 (6-4). In semester 2, the largest tutorial group size is 7 students and the smallest tutorial group contains 3 students, therefore the range is 4 (7-3).
An example of the use of the range to compare spread within datasets is provided in table 1. The scores of individual students in the examination and coursework component of a module are shown.
To find the range in marks the highest and lowest values need to be found from the table. The highest coursework mark was 48 and the lowest was 27 giving a range of 21. In the examination, the highest mark was 45 and the lowest 12 producing a range of 33. This indicates that there was wider variation in the students’ performance in the examination than in the coursework for this module.
Since the range is based solely on the two most extreme values within the dataset, if one of these is either exceptionally high or low (sometimes referred to as outlier) it will result in a range that is not typical of the variability within the dataset. For example, imagine in the above example that one student failed to hand in any coursework and was awarded a mark of zero, however they sat the exam and scored 40. The range for the coursework marks would now become 48 (48-0), rather than 21, however the new range is not typical of the dataset as a whole and is distorted by the outlier in the coursework marks. In order to reduce the problems caused by outliers in a dataset, the inter-quartile range is often calculated instead of the range.
The Inter-quartile Range
The inter-quartile range is a measure that indicates the extent to which the central 50% of values within the dataset are dispersed. It is based upon, and related to, the median.
In the same way that the median divides a dataset into two halves, it can be further divided into quarters by identifying the upper and lower quartiles. The lower quartile is found one quarter of the way along a dataset when the values have been arranged in order of magnitude; the upper quartile is found three quarters along the dataset. Therefore, the upper quartile lies half way between the median and the highest value in the dataset whilst the lower quartile lies halfway between the median and the lowest value in the dataset. The inter-quartile range is found by subtracting the lower quartile from the upper quartile.
For example, the examination marks for 20 students following a particular module are arranged in order of magnitude.
The median lies at the mid-point between the two central values (10th and 11th)= half-way between 60 and 62 = 61The lower quartile lies at the mid-point between the 5th and 6th values= half-way between 52 and 53 = 52.5The upper quartile lies at the mid-point between the 15th and 16th values= half-way between 70 and 71 = 70.5
The inter-quartile range for this dataset is therefore 70.5 - 52.5 = 18 whereas the range is: 80 - 43 = 37.
The inter-quartile range provides a clearer picture of the overall dataset by removing/ignoring the outlying values.
Like the range however, the inter-quartile range is a measure of dispersion that is based upon only two values from the dataset. Statistically, the standard deviation is a more powerful measure of dispersion because it takes into account every value in the dataset. The standard deviation is explored in the next section of this guide.
Calculating the Inter-quartile range using Excel
The method Excel uses to calculate quartiles is not commonly used and tends to produce unusual results particularly when the dataset contains only a few values. For this reason you may be best to calculate the inter-quartile range by hand.
The Standard Deviation
The standard deviation is a measure that summarises the amount by which every value within a dataset varies from the mean. Effectively it indicates how tightly the values in the dataset are bunched around the mean value. It is the most robust and widely used measure of dispersion since, unlike the range and inter-quartile range, it takes into account every variable in the dataset. When the values in a dataset are pretty tightly bunched together the standard deviation is small. When the values are spread apart the standard deviation will be relatively large. The standard deviation is usually presented in conjunction with the mean and is measured in the same units.
In many datasets the values deviate from the mean value due to chance and such datasets are said to display a normal distribution. In a dataset with a normal distribution most of the values are clustered around the mean while relatively few values tend to be extremely high or extremely low. Many natural phenomena display a normal distribution.
For datasets that have a normal distribution the standard deviation can be used to determine the proportion of values that lie within a particular range of the mean value. For such distributions it is always the case that 68% of values are less than one standard deviation (1SD) away from the mean value, that 95% of values are less than two standard deviations (2SD) away from the mean and that 99% of values are less than three standard deviations (3SD) away from the mean. Figure 3 shows this concept in diagrammatical form.
If the mean of a dataset is 25 and its standard deviation is 1.6, then
If the dataset had the same mean of 25 but a larger standard deviation (for example, 2.3) it would indicate that the values were more dispersed. The frequency distribution for a dispersed dataset would still show a normal distribution but when plotted on a graph the shape of the curve will be flatter as in figure 4.
Population and sample standard deviations
There are two different calculations for the Standard Deviation. Which formula you use depends upon whether the values in your dataset represent an entire population or whether they form a sample of a larger population. For example, if all student users of the library were asked how many books they had borrowed in the past month then the entire population has been studied since all the students have been asked. In such cases the population standard deviation should be used. Sometimes it is not possible to find information about an entire population and it might be more realistic to ask a sample of 150 students about their library borrowing and use these results to estimate library borrowing habits for the entire population of students. In such cases the sample standard deviation should be used.
Formulae for the standard deviation
Whilst it is not necessary to learn the formula for calculating the standard deviation, there may be times when you wish to include it in a report or dissertation.
The standard deviation of an entire population is known as σ (sigma) and is calculated using:
Where x represents each value in the population, μ is the mean value of the population, Σ is the summation (or total), and N is the number of values in the population.
The standard deviation of a sample is known as S and is calculated using:
Where x represents each value in the population, x is the mean value of the sample, Σ is the summation (or total), and n-1 is the number of values in the sample minus 1.
Standard Deviation
There is a problem with variances. Recall that the deviations were squared. That means that the units were also squared. To get the units back the same as the original data values, the square root must be taken.
There is a problem with variances. Recall that the deviations were squared. That means that the units were also squared. To get the units back the same as the original data values, the square root must be taken.
Population Standard Deviation=
(4)
Sample Standard Deviation=
(5)
The sample standard deviation is not the unbiased estimator for the population standard deviation.
eg.
Joseph's midterm grades, Statistics 96 pts、Math 90 pts、English 85 pts, Geography 78pts、 History 92 points、Chemistry 67 points,what is the variance on Joseph's midterm grades? what is the Standard deviation?
Solution:
We first determine the data is drawn directly from the population, not the sample. thus, using the population variance formula above we get:
Solution:
We first determine the data is drawn directly from the population, not the sample. thus, using the population variance formula above we get:
(96+90+85+78+92+67)/6 = 508/6 = 84.67
Average grade (mean= mu) = 84.67
Average grade (mean= mu) = 84.67
Therefore the variance is:
(6)
The Standard deviation is:
(7)
The calculator does not have a variance key on it. It does have a standard deviation key. You will have to square the standard deviation to find the variance.
Sum of Squares (shortcuts)
The sum of the squares of the deviations from the means is given a shortcut notation and several alternative formulas.
(8)
A little algebraic simplification returns:
(9)
Coefficient of Variation
The coefficient of variation (CV), also known as “relative variability”, equals the standard deviation divided by the mean. CV is often presented as the given ratio multiplied by 100. The CV for a single variable aims to describe the dispersion of the variable in a way that does not depend on the variable's measurement unit. The higher the CV, the greater the dispersion in the variable. The CV for a model aims to describe the model fit in terms of the relative sizes of the squared residuals and outcome values. The lower the CV, the smaller the residuals relative to the predicted value. This is suggestive of a good model fit.
(10)
eg:
Data of height and weight of 5 students. Comparing the dispersion of the two
N=5
Height:172、168、164、170、176(cm)
Weight:62、57、58、64、64(kg)
N=5
Height:172、168、164、170、176(cm)
Weight:62、57、58、64、64(kg)
Since the unit for two kinds of datas are different, in order to compare the dispersion, we need to calculate the coefficent of variations of both the height and the weight.
Coefficient of Variation for Height of 5 students
(4.47/170)x100% = 2.63%
(11)(4.47/170)x100% = 2.63%
Coefficient of Variationfor Weight of 5 students
(3.31/61)x100% = 5.4%
(12)(3.31/61)x100% = 5.4%