Handout 05
Date: 2022-10-30
Topic: Data Wrangling
Literature
Handout
Ismay & Kim (2022) Chapter 3
Recap
Ismay & Kim Introduction Chapter 3: five named graphs
Wrangling functions from dplyr
The dplyr package comes with great number of functions to wrangle
(transform) data as preparation for further analysis.
- select() (3.8.1); to select variables
- the %>% (pipe) operator (3.1)
- filter() (3.2); to filter rows based on one or more conditions
- summarize() (3.3); to generate summary statistics (see below)
- group_by() (3.4); to generate summary statistics per category
- mutate() (3.5); to add new variables
- arrange() (3.6); to sort the data set
- join functions (3.7 this paragraph only covers inner_join()); see
script join_functions.R
- inner_join()
- left_join()
- right_join()
- full_join()
- semi_join()
- anti_join()
- transmute(); adds new variables and drops existing ones
Summary Statistics
The most common summary statistics are mentioned.
Categorical Variable
- Number of Observations
- Frequency Table with absolute or relative frequencies per
category
- Modal Class; the class with the highest frequency
Numerical Variable
- Number of Observations
- Measure Center
- Median
- Average or Mean
- Trimmed Mean
- Measure Spread
- Range
- IQR (Inter Quartile Range)
- Variance/ Standard Deviation; measure for spread around the
Mean
- Coefficient of Variation (= SD/Mean)
- Measure Skewness
Two Categorical Variables
- Number of Observations
- Two-Way Table or Contingency Table
- Absolute Frequencies
- Relative Frequencies as proportions of total number of
observations
- Relative Frequencies as proportions of row totals
- Relative Frequencies of column totals
- Measure Correlation (Association) between the two variables
Two Numerical Variables
- Number of Observations
- Two-Way Table, only in case of limited number of values for the two
categories
- Measure Correlation (Association) between the two variables
- Pearson’s Correlation Coefficient (or in short: Correlation
Coefficient)
- Spearman’s Rank Correlation Coefficient
- Kendall’s Rank Correlation Coefficient
One Categorical and one Numerical Variable
- Number of Observations
- Number of Observations per Category
- Summary Statistics for the Numerical Variable per Category
- Correlation between the two variables
- no one-size-fits-all measure
- perform ANOVA to analyse differences between group means
EXERCISE
Work through script: ppd_london.R
ppd: price paid data; the script analyses property sold data from
London.
Meta data can be found here.