In previous sections of this guide to descriptive statistics, we introduced you to the fundamental concepts underlying measures of central tendency and variability. Namely, we walked you through the formulas of these measures as well as some intermediate applications of them. Here, we’ll expand upon these topics, presenting the concept of outliers as well as providing you with some practice problems.

The Interquartile Range

In the previous sections, you were introduced to quartiles and the interquartile range, otherwise known as the IQR. To briefly recap, the interquartile range is defined as the distance between the first and third quartiles, which contains both the median and 50% of the data. Recall the image below, used as an example illustration of the IQR.

Quartiles

While the IQR has many applications, including ones tied to the discussion on outliers explained further on in this section, what is important to note is how the measures of central tendency play into the IQR. This is easiest to see when looking at data plotted on a boxplot.

Boxplot 1         Boxplot 2

Boxplots can be an effective way of displaying the IQR because they can display many measures of central tendency and variability. The mean and median can be seen in both plots, where the boxplot on the left shows a boxplot where the mean is greater than the median and the boxplot on the right shows a distribution where the median and mean are equal.

The distribution, defined as how the variables are spread out, is best interpreted by the IQR. The boxplot on the left shows a boxplot where the first quartile is closer to the median than the third quartile. The boxplot on the right, on the other hand, shows a distribution where the median and mean are equidistant from both quartiles 1 and 3.

These differences in where the measures lie on the boxplot are due to differences in distributions. Where the distribution on the right is indicative of a normal distribution, the one on the left signals a skewed distribution. We’ll go more into more detail on distributions later. For now, you can find a recap of the measures of central tendency and variability you can observe from boxplots in the table below.

MeasureLocation on BoxplotInterpretation
MeanTypically located above or below the mean and within the IQR, although there are exceptionsThe average of the data
MedianLocated at quartile 2Half the data fall above and below this point (the 50% mark)
MinimumLocated at Q0The lowest value of the data set
MaximumLocated at Q4The highest value of the data set
Interquartile RangeBetween Q1 and Q3Holds 50% of the data, the median and information about the centre 50% of the data set
Superprof

Outliers

If you’ve never heard of outliers in a mathematical or statistics setting, you’re bound to have heard it used in other disciplines. This is due mainly because of the fact that the definition of outliers is broad and can therefore be applied to situations beyond mathematics.

An outlier is defined as a point that diverges from the typical pattern. In other words, an outlier is different from the rest of the data set.

Influential Observation

It’s easy to confuse outliers with influential observations. However, it can be easier to separate the two by thinking of outliers as a measure belonging mainly to descriptive statistics while influential observations are typically used when utilizing inferential statistics.

An influential observation is a data point or points that have an impact on the slope of a regression line. Reserving the details of regression for our guide on inferential statistics, you can get a basic understanding of the difference between these two statistical concepts from the images below.

Scatterplot Outlier          Scatterplot Influential Obs

As you can see, the regression line on the left is not affected by the inclusion of the red point, whereas on the right, we can see that the regression line changes significantly with the inclusion of the pink point. This suggests the red point is an outlier and the pink point is an influential observation.

How to Identify Outliers

In statistics, there are many different ways to identify whether or not a point is an outlier. There are two basic methods you can employ to identify an outlier, which are summarized in the table below.

MethodDescriptionExample
Standard Deviation MethodIf the data has a normal distribution, we can use the 68-95-99.7 rule to determine outliers. This means we can arbitrarily set limits, typically 3 \sigma and above, to identify outliers.If we set it at 3 \sigma, this means that any point 3 \sigma away from the mean and beyond can be considered outliers.
Interquartile Range MethodIf the data doesn’t have a normal distribution, we can use the IQR as a benchmark for outliers as it contains 50% of the data. Typically, the limits are, again, arbitrarily set at IQR * n away from the 25th and 75th quartiles, where n is typically set at 1.5.If Q3 is 10 and Q1 is 3, the IQR would be 10 - 3 = 7. Then, the lower limit and upper limit for the data set would be 7*1.5 = 10.5. This means that any point below 3-10.5 = -7.5  and above 10+10.5 = 20.5 could be considered an outlier.

Practice Problem 1

Calculate the following descriptive statistics from the data given in the table below:

  • Median
  • Mean
  • Interquartile Range
ObservationValue
15
216
324
428
530
631
732
835
995

Problem 2

You are trying to decide whether or not you have an outlier in your data set. Use the standard deviation method in order to determine if there are any outliers in your data, given in the data table below.

ObservationValue
14
26
33
49
560
Mean16.4
Standard Deviation24.5

Problem 3

Interpret the chart below.

Quartiles 3

Solution Problem 1

ObservationValue
15
216
324
428
530
631
732
835
995
Total296

The mean is calculated as,

    \[ \bar{x} = \dfrac{296}{9} = 32.9 \]

The median is the midpoint of the data set. Because our data is already ordered form least to greatest, we simply need to find the middle value. In this case, it is the 5th observation, which has a value of 30.

The interquartile range is found by splitting the data into fourths. Doing this gives us the following quartiles:

  • Q0 = 5
  • Q1 = 24
  • Q2 = 30
  • Q3 = 32
  • Q4 = 95

Next, the IQR can be calculated as,

    \[ IQR = Q3 - Q1 = 32-24 = 8 \]

Solution Problem 2

Find the step-by-step solution below.

ObservationValue
14
26
33
49
560
Mean16.4
Standard Deviation24.5

Using the standard deviation method to identify an outlier can be done by standardizing the data point. We suspect the fifth observation may be an outlier.

    \[ z_{i} = \dfrac{60-16.4}{24.5} = 1.78 \]

This means that the 60 is about 1.8 \sigma away from the mean. While this is still well within the 3 \sigma normally used for finding outliers in the standard deviation method, you may want to consider setting the limit at a lower \sigma since the sample size is small.

Solution Problem 3

QuartileInterpretation
Q0The minimum, located at 0
Q125% of the data is below 35
Q250% of the data is above and below 50
Q375% of the data is below 65
Q4The maximum, located at 100

 

Did you like the article?

1 Star2 Stars3 Stars4 Stars5 Stars (1 votes, average: 5.00 out of 5)
Loading...

Danica

Located in Prague and studying to become a Statistician, I enjoy reading, writing, and exploring new places.

Did you like
this resource?

Bravo!

Download it in pdf format by simply entering your e-mail!

{{ downloadEmailSaved }}

Your email is not valid

Leave a Reply

avatar
  Subscribe  
Notify of