Chapter 4 Missing values
4.1 Column pattern
After implementing our self-defined missing value plot function here, the plot is shown as below.
According the plot, we found that variables below have a large ratio of missing values. We put the basic descriptions in corresponding variables:
- PARKS_NM: Name of NYC park, playground or greenspace of occurrence, if applicable (state parks are not included).
- HADEVELOPT: Name of NYCHA housing development of occurrence, if applicable.
- HOUSING_PSA: Development Level Code.
- TRANSIT_DISTRICT: Transit district in which the offense occurred.
- STATION_NAME: Transit station name
The descriptions show that it makes senses that the former 3 variables have missing values since they are all possibly applicable variables, which means that when being designed, it is presumed that there might be no values here.
As the latter two variables, the logic is similar that not every crime event happens near a transit station.
Left missing variables include:
- SUSP_AGE_GROUP,
- SUSP_RACE,
- SUSP_SEX,
- LOC_OF_OCCUR_DESC,
- COMPLNT_TO_DT,
- COMPLNT_TO_TM
They can still be used in our analysis since the proportion is small.