Contents
- SPSS System Missing Values
- SPSS User Missing Values
- Setting User Missing Values
- Inspecting Missing Values per Variable
- SPSS Data Analysis with Missing Values
What are “Missing Values” in SPSS?
In SPSS, “missing values” may refer to 2 things:
- System missing values are values that are completely absent from the data. They are shown as periods in data view.
- User missing values are values that are invisible while analyzing or editing data. The SPSS user specifies which values -if any- must be excluded.
This tutorial walks you through both. We'll use bank.sav -partly shown below- throughout. You'll get the most out of this tutorial if you try the examples for yourself after downloading and opening this file.
SPSS System Missing Values
System missing values are values that are
completely absent from the data.
System missing values are shown as dots in data view as shown below.
System missing values are only found in numeric variables. String variables don't have system missing values. Data may contain system missing values for several reasons:
- some respondents weren't asked some questions due to the questionnaire routing;
- a respondent skipped some questions;
- something went wrong while converting or editing the data;
- some values weren't recorded due to equipment failure.
In some cases system missing values make perfect sense. For example, say I ask
“do you own a car?”
and somebody answers “no”. Well, then my survey software should skip the next question:
“what color is your car?”
In the data, we'll probably see system missing values on color for everyone who does not own a car. These missing values make perfect sense.
In other cases, however, it may not be clear why there's system missings in your data. Something may or may not have gone wrong. Therefore, you should try to
find out why some values are system missing
especially if there's many of them.
So how to detect and handle missing values in your data? We'll get to that after taking a look at the second type of missing values.
SPSS User Missing Values
User missing values are values that are excluded
when analyzing or editing data.
“User” in user missing refers to the SPSS user. Hey, that's you! So it's you who may need to set some values as user missing. So which -if any- values must be excluded? Briefly,
- for categorical variables, answers such as “don't know” or “no answer” are typically excluded from analysis.
- For metric variables, unlikely values -a reaction time of 50ms or a monthly salary of € 9,999,999- are usually set as user missing.
For bank.sav, no user missing values have been set yet, as can be seen in variable view.
Let's now see if any values should be set as user missing and how to do so.
User Missing Values for Categorical Variables
A quick way for inspecting categorical variables is running frequency distributions and corresponding bar charts. Make sure the output tables show both values and value labels. The easiest way for doing so is running the syntax below.
set tnumbers both.
*Basic frequency table for q1.
frequencies q1 to q9.
Result
First note that q1 is an ordinal variable: higher values indicate higher levels of agreement. However, this does not go for 11: “No answer” does not indicate more agreement than 10 - “Totally agree”. Therefore, only values 1 through 10 make up an ordinal variable and 11 should be excluded.
The syntax below shows the right way to do so.
missing values q1 to q9 (11).
*Rerun frequencies table.
frequencies q1 to q9.
Result
Note that 11 is shown among the missing values now. It occurs 6 times in q1 and there's also 14 system missing values. In variable view, we also see that 11 is set as a user missing value for q1 through q9.
User Missing values for Metric Variables
The right way to inspect metric variables is running histograms over them. The syntax below shows the easiest way to do so.
frequencies whours
/format notable
/histogram.
Result
Some respondents report working over 150 hours per week. Perhaps these are their monthly -rather than weekly- hours. In any case, such values are not credible. We'll therefore set all values of 50 hours per week or more as user missing. After doing so, the distribution of the remaining values looks plausible.
missing values whours (50 thru hi).
*Rerun histogram.
frequencies whours
/format notable
/histogram.
Inspecting Missing Values per Variable
A super fast way to inspect (system and user) missing values per variable is running a basic DESCRIPTIVES table. Before doing so, make sure you don't have any WEIGHT or FILTER switched on. You can check this by running SHOW WEIGHT FILTER N. Also note that there's 464 cases in these data. So let's now inspect the descriptive statistics.
descriptives q1 to q9.
*Note: (464 - N) = number of missing values.
Result
The N column shows the number of non missing values per variable. Since we've 464 cases in total, (464 - N) is the number of missing values per variable. If any variables have high percentages of missingness, you may want to exclude them from -especially- multivariate analyses.
Importantly, note that Valid N (listwise) = 309. These are the cases without any missing values on all variables in this table. Some procedures will use only those 309 cases -known as listwise exclusion of missing values in SPSS.
Conclusion: none of our variables -columns of cells in data view- have huge percentages of missingness. Let's now see if any cases -rows of cells in data view- have many missing values.
Inspecting Missing Values per Case
For inspecting if any cases have many missing values, we'll create a new variable. This variable holds the number of missing values over a set of variables that we'd like to analyze together. In the example below, that'll be q1 to q9.
We'll use a short and simple variable name: mis_1 is fine. Just make sure you add a description of what's in it -the number of missing...- as a variable label.
count mis_1 = q1 to q9 (missing).
*Set description of mis_1 as variable label.
variable labels mis_1 'Missing values over q1 to q9'.
*Inspect frequency distribution missing values.
frequencies mis_1.
Result
In this table, 0 means zero missing values over q1 to q9. This holds for 309 cases. This is the Valid N (listwise) we saw in the descriptives table earlier on.
Also note that 1 case has 8 missing values out of 9 variables. We may doubt if this respondent filled out the questionnaire seriously. Perhaps we'd better exclude it from the analyses over q1 to q9. The right way to do so is using a FILTER.
SPSS Data Analysis with Missing Values
So how does SPSS analyze data if they contain missing values? Well, in most situations,
SPSS runs each analysis on all cases it can use for it.
Right, now our data contain 464 cases. However, most analyses can't use all 464 because some may drop out due to missing values. Which cases drop out depends on which analysis we run on which variables.
Therefore, an important best practice is to
always inspect how many cases are actually used
for each analysis you run.
This is not always what you might expect. Let's first take a look at pairwise exclusion of missing values.
Pairwise Exclusion of Missing Values
Let's inspect all (Pearson) correlations among q1 to q9. The simplest way for doing so is just running correlations q1 to q9. If we do so, we get the table shown below.
Note that each correlation is based on a different number of cases. Precisely, each correlation between a pair of variables uses all cases having valid values on these 2 variables. This is known as pairwise exclusion of missing values. Note that most correlations are based on some 410 up to 440 cases.
Listwise Exclusion of Missing Values
Let's now rerun the same correlations after adding a line to our minimal syntax:
correlations q1 to q9
/missing listwise.
After running it, we get a smaller correlation matrix as shown below. It no longer includes the number of cases per correlation.
Each correlation is based on the same 309 cases, the listwise N. These are the cases without missing values on all variables in the table: q1 to q9. This is known as listwise exclusion of missing values.
Obviously, listwise exclusion often uses far fewer cases than pairwise exclusion. This is why we often recommend the latter: we want to use as many cases as possible. However, if many missing values are present, pairwise exclusion may cause computational issues. In any case, make sure you
know if your analysis uses
listwise or pairwise exclusion of missing values.
By default, regression and factor analysis use listwise exclusion and in most cases, that's not what you want.
Exclude Missing Values Analysis by Analysis
Analyzing if 2 variables are associated is known as bivariate analysis. When doing so, SPSS can only use cases having valid values on both variables. Makes sense, right?
Now, if you run several bivariate analyses in one go, you can exclude cases analysis by analysis: each separate analysis uses all cases it can. Different analyses may use different subsets of cases.
If you don't want that, you can often choose listwise exclusion instead: each analysis uses only cases without missing values on all variables for all analyses. The figure below illustrates this for ANOVA.
The test for q1 and educ uses all cases having valid values on q1 and educ, regardless of q2 to q4.
All tests use only cases without missing values on q1 to q4 and educ.
We usually want to use as many cases as possible for each analysis. So we prefer to exclude cases analysis by analysis. But whichever you choose, make sure you know how many cases are used for each analysis. So check your output carefully. The Kolmogorov-Smirnov test is especially tricky in this respect: by default, one option excludes cases analysis by analysis and the other uses listwise exclusion.
Editing Data with Missing Values
Editing data with missing values can be tricky. Different commands and functions act differently in this case. Even something as basic as computing means in SPSS can go very wrong if you're unaware of this.
The syntax below shows 3 ways we sometimes encounter. With missing values, however, 2 of those yield incorrect results.
compute mean_a = mean(q1 to q9).
*Compute mean - wrong way 1.
compute mean_b = (q1 + q2 + q3 + q4 + q5 + q6 + q7 + q8 + q9) / 9.
*Compute mean - wrong way 2.
compute mean_c = sum(q1 to q9) / 9.
*Check results.
descriptives mean_a to mean_c.
Result
Final Notes
In real world data, missing values are common. They don't usually cause a lot of trouble when analyzing or editing data but in some cases they do. A little extra care often suffices if missingness is limited. Double check your results and know what you're doing.
Thanks for reading.
THIS TUTORIAL HAS 37 COMMENTS:
By Ruben Geert van den Berg on June 1st, 2016
Hi Alaa!
First off, do your system missings indicate zeroes? If so, then RECODE them to zeroes and try again.
If system missing do not indicate zeroes, use SUM instead of "+".
I'll add a few tiny examples below. Let me know whether that solves your problem, ok?
data list free/v1 v2 v3.
begin data
5 2 7
6 '' 2
'' '' 5
'' '' ''
end data.
*Plus operator returns sysmis when missing in arguments.
compute plustotal = v1 + v2 + v3.
execute.
*Sum operator returns sysmis only if all arguments are missing.
compute sumtotal = sum(v1,v2,v3).
execute.
*Recode and sum.
recode v1 to v3 (sysmis = 0).
compute recodetotal = v1 + v2 + v3.
execute.
By Kay on July 28th, 2016
If you have missing data that is greater than 5%,would it be more realistic to delete the data versus using the mean in each missing data box?
By Ruben Geert van den Berg on July 28th, 2016
Hi Kay!
Unfortunately, it's not that simple. The first question you should ask yourself is why data are missing in the first place. Second, what are you going to do with the data? Missing values tend to be more problematic as more variables are involved in an analysis because they tend to reduce the number of complete case in -for instance- factor analysis.
I wouldn't propose any simple rule such as > 5% or > 10% for all different scenarios. Also, replacing missing values with a variable may alleviate some trouble but it obviously biases results as well so try and use it sparsely, ok?
By Boushra on October 7th, 2016
Hello,
A few variables of my data have quite a lot of system missing value because the survey was designed in a way that it surveyed all participants for some questions and only 20% of participants for other set of questions. the question is how can I deal with these system missing values because I think that I can not conduct expectation maximization (EM) because my data is categorical data.
Thanks in advance for your help.
By Ruben Geert van den Berg on October 8th, 2016
Hi Boushra!
When analyzing variables separately or perhaps in pairs (bivariate analyses), this doesn't usually pose too much of a problem. That's obviously different for analyses involving many variables at once (such as factor analysis or regression).
There's no ideal solution.
You can perhaps treat your sample as two separate samples, with and without the system missings and analyze them separately with different sets of variables.
In some cases, you can RECODE the system missings into a valid category and treat the recoded variables as nominal variables or perhaps use dummy coding for them.
If the overall percentage of (n*k) missing data points for n cases and k variables is low, perhaps around 5%-10%, you could consider (multiple) imputation of the missing values.
These are some basic ways to handle the situation but -again- none of those are ideal. You'll have to make some sacrifices for carrying on with your analyses I'm afraid.
Best,
Ruben