Take Home Ex 3

Author

Daisy

Published

February 26, 2023

Overview

1 Getting Started

Installing and loading packages

pacman::p_load(DT,plotly,tidyverse,patchwork,ggiraph,ggstatsplot, performance)

2 The Dataset

The code chunk below imports the dataset into R by using read_csv() of readr and save it as an tibble data frame called data

data<- read_csv("resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv")
head(data)
# A tibble: 6 × 11
  month   town     flat_…¹ block stree…² store…³ floor…⁴ flat_…⁵ lease…⁶ remai…⁷
  <chr>   <chr>    <chr>   <chr> <chr>   <chr>     <dbl> <chr>     <dbl> <chr>  
1 2017-01 ANG MO … 2 ROOM  406   ANG MO… 10 TO …      44 Improv…    1979 61 yea…
2 2017-01 ANG MO … 3 ROOM  108   ANG MO… 01 TO …      67 New Ge…    1978 60 yea…
3 2017-01 ANG MO … 3 ROOM  602   ANG MO… 01 TO …      67 New Ge…    1980 62 yea…
4 2017-01 ANG MO … 3 ROOM  465   ANG MO… 04 TO …      68 New Ge…    1980 62 yea…
5 2017-01 ANG MO … 3 ROOM  601   ANG MO… 01 TO …      67 New Ge…    1980 62 yea…
6 2017-01 ANG MO … 3 ROOM  150   ANG MO… 01 TO …      68 New Ge…    1981 63 yea…
# … with 1 more variable: resale_price <dbl>, and abbreviated variable names
#   ¹​flat_type, ²​street_name, ³​storey_range, ⁴​floor_area_sqm, ⁵​flat_model,
#   ⁶​lease_commence_date, ⁷​remaining_lease

Below is a screenshot of the official website to give a clear view of the column meaning.

3 Data Cleaning

Since we only need the data in 2022 and the record of 3/4/5 room units, we need to filter the data.

data <-filter(data,month %in% c("2022-01","2022-02","2022-03","2022-04","2022-05","2022-06","2022-07","2022-08","2022-09","2022-10","2022-11","2022-12"))
data <-filter(data,flat_type %in% c("3 ROOM","4 ROOM","5 ROOM")) %>%
  mutate("resale_price(KSGD)"=resale_price/1000)

4 Visualization

4.1 The Distribution of Resale Price

ks.test(data$'resale_price(KSGD)',"pnorm")

    Asymptotic one-sample Kolmogorov-Smirnov test

data:  data$"resale_price(KSGD)"
D = 1, p-value < 2.2e-16
alternative hypothesis: two-sided

A p-value less than the significance level (0.05) indicates that the null hypothesis (that the sample data comes from a normal distribution) should be rejected.

gghistostats(
  data = data,
  x = 'resale_price(KSGD)',
  binwidth = 25,
  type="robust",
  test.value = 500,
  xlab = "Resale Price",
  title="The distribution of Resale price (2022,3/4/5 Room-unit)",
  centrality.line.args = list(color = "#D55E30", linetype = "dashed"),
  )+
  ggthemes::theme_fivethirtyeight()

The skewness of the data distribution is high, with 95% confidence that the trimmed mean is 516.01 KSGD.

4.2 Line Graph of the Resale Price Trend in 2022

data1=data %>% group_by(flat_type,month) %>%
         summarise(mean_resale_price_per_month = mean(resale_price),
                              .groups = 'drop')
p<- plot_ly(data = data1,
        x = ~month, 
        y = ~mean_resale_price_per_month,
        color = ~flat_type , 
        colors = "Set1",
        title='Line graph of the resale price trend in 2022 ') 
add_trace(p, type = "scatter",
          mode = "markers+lines")

In this graph, users can observe the resale price trend of 3 room, 4 room, 5 room unit seperatly in 2022.Each plot means the average price of that kind of unit for the single month, by hovering the mouse, user can view the specific data. Overall all three room types are increasing in price, but the three bedrooms are increasing in price to a lesser extent than the other two.

4.3 Violin Boxplot of the Resale Price in 2022

ggbetweenstats(
  data=data,
  x='flat_type',
  y='resale_price(KSGD)',
  plot.type = "boxviolin",
  outlier.tagging = TRUE, ## whether outliers should be flagged
  outlier.coef = 1.5, ## coefficient for Tukey's rule
  outlier.label = 'town',
  outlier.label.args = list(color = "red",size=2),
  package = "ggsci",
  palette = "nrc_npg",
  xlab = "Flat Type",
  ylab='Resale Price (KSGD)',
  title = "violin boxplot of the resale price in 2022 (3/4/5 room unit)",
  
  )

This chart shows the distribution of prices for each of the three room types, with a greater concentration of prices for the 3 room style. The other two have a more skewed distribution and the 4 room style has the highest number and more outlier. The outliers in the graph are marked with a town label.

4.4 Relationship between Floor Area and Resale Price

ggscatterstats(
  data = data, 
  x = 'floor_area_sqm', 
  y = 'resale_price(KSGD)',
  xlab = " Floor area per unit (sqm)", 
  ylab = "Resale Price (KSGD)", 
  label.var = 'town', 
  label.expression = x < 100 & y > 1000,
  point.label.args = list(alpha = 0.7, size = 3, color = "grey50"),
  xfill = "#EBD3E8", 
  yfill = "#C4E1E1", 
  title = "Relationship between floor area and resale price",

)

plot_ly(data = data, 
  x = ~floor_area_sqm, 
  y = ~resale_price,
  text = ~paste( "<br>Town:", town),  
  color = ~flat_type,
  colors = "Set2")

The two diagrams above show the relationship between floor area and resale price, with the different colours indicating the different flat types. And by hovering the mouse, user can know which town an exact unit is in.

plot_ly(data = data, 
  x = ~flat_model, 
  y = ~resale_price,
  color = ~flat_type,
  colors = "Set2")