library(tidyverse)
library(rvest)
library(httr)
library(scales)
library(broom)

I was wondering what is the current price of my car on the market. One way how to get some at least bit objective value is based on current situation on the market. I´m about to use simple (yet effective) web scraping technique to retrieve data from local best known internet car marketplace.

First step - how many pages?

As a first step we need to find out on how many pages are the offers related to the same type of my car. By piping the root path to read_html and selecting node “a.hide-S”" we can find it out.

path <- "https://www.tipcars.com/skoda-rapid/benzin/"
# get max. page
max <- path %>%
  html_session() %>% 
  read_html %>% 
  html_nodes("a.hide-S") %>% 
  html_text()
max

## [1] "8"

..pages, each with 100 cars. Now you may ask how did I know which node to use? And an answer would be simple - using SelectorGadget tool. It is pretty cool tool allowing you to select any item on web page of your interest and provide “node” to it. You can find out how to use it here.

Load all URLs

Now we can use information about number of pages to create vector of sites we will scrape.

url <- paste0(path,"?str=",1:max, "-100")
url

## [1] "https://www.tipcars.com/skoda-rapid/benzin/?str=1-100"
## [2] "https://www.tipcars.com/skoda-rapid/benzin/?str=2-100"
## [3] "https://www.tipcars.com/skoda-rapid/benzin/?str=3-100"
## [4] "https://www.tipcars.com/skoda-rapid/benzin/?str=4-100"
## [5] "https://www.tipcars.com/skoda-rapid/benzin/?str=5-100"
## [6] "https://www.tipcars.com/skoda-rapid/benzin/?str=6-100"
## [7] "https://www.tipcars.com/skoda-rapid/benzin/?str=7-100"
## [8] "https://www.tipcars.com/skoda-rapid/benzin/?str=8-100"

To get HTML structure of each page in list is map useful function.

html_list <- url %>% 
  map(html_session) %>% 
  map(read_html)

This simple function extracts text from defined node and html (both input parameters).

getText <- function(html, node) {
  html %>%
    html_nodes(node) %>% 
    html_text(trim = TRUE) %>% 
    as.tibble()
}

Map map map

# initiating df with extraction of car prize
df <- html_list %>% map(getText, ".fs-tluste") %>%  map(~.[2:nrow(.),]) %>% bind_rows()

## Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.

df <- df %>%
  rename(cost = value) %>% 
  # extract engine
  mutate(model = html_list %>% map(getText, ".motor") %>% unlist) %>% 
  # extract mileage
  mutate(mileage = html_list %>% map(getText, ".najeto") %>% unlist) %>% 
  # extract year of production
  mutate(year = html_list %>% map(getText, ".rok_vyroby") %>% unlist)
df

## # A tibble: 772 x 4
##    cost       model                    mileage year 
##    <chr>      <chr>                    <chr>   <chr>
##  1 125 122 Kč benzin, 42 kW            39 tkm  1984 
##  2 129 999 Kč benzin, 1 197 ccm, 63 kW 276 tkm 2014 
##  3 134 400 Kč benzin, 1 197 ccm, 63 kW 119 tkm 2014 
##  4 149 900 Kč benzin, 1 198 ccm, 55 kW 87 tkm  2013 
##  5 155 000 Kč benzin, 1 197 ccm, 63 kW 153 tkm 2015 
##  6 159 000 Kč benzin, 1 197 ccm, 63 kW 156 tkm 2014 
##  7 159 000 Kč benzin, 1 197 ccm, 63 kW 132 tkm 2013 
##  8 160 000 Kč benzin, 1 197 ccm, 63 kW 122 tkm 2013 
##  9 164 000 Kč benzin, 1 197 ccm, 63 kW 102 tkm 2013 
## 10 168 000 Kč benzin, 1 197 ccm, 66 kW 197 tkm 2015 
## # ... with 762 more rows

Here we have cars of interest (all Škoda Rapid gasoline variants). Let´s decompose it more using a bit of regular expression.

final <- df %>% 
  # select numbers only
  mutate(cost = str_extract(cost, "^[0-9]+ [0-9]+")) %>%
  # replace white space
  mutate(cost = str_replace_all(cost, "\\p{WHITE_SPACE}", "")) %>%
  # conversion to double
  mutate(cost = cost %>% as.numeric()) %>%
  # extract power
  mutate(kW = str_extract(model, "[0-9]{2} kW")) %>% 
  # get rid off "kW"
  mutate(kW = str_extract(kW, "[0-9]{2}")) %>% 
  # conversion to factor
  mutate(kW = kW %>% as.factor()) %>% 
  # extract mileage 
  mutate(mileage = str_extract(mileage, "^[0-9]+")) %>%
  # conversion to double
  mutate(mileage = mileage %>% as.numeric() * 1000) %>% 
  select(model, mileage, cost, kW, year)
final

## # A tibble: 772 x 5
##    model                    mileage   cost kW    year 
##    <chr>                      <dbl>  <dbl> <fct> <chr>
##  1 benzin, 42 kW              39000 125122 42    1984 
##  2 benzin, 1 197 ccm, 63 kW  276000 129999 63    2014 
##  3 benzin, 1 197 ccm, 63 kW  119000 134400 63    2014 
##  4 benzin, 1 198 ccm, 55 kW   87000 149900 55    2013 
##  5 benzin, 1 197 ccm, 63 kW  153000 155000 63    2015 
##  6 benzin, 1 197 ccm, 63 kW  156000 159000 63    2014 
##  7 benzin, 1 197 ccm, 63 kW  132000 159000 63    2013 
##  8 benzin, 1 197 ccm, 63 kW  122000 160000 63    2013 
##  9 benzin, 1 197 ccm, 63 kW  102000 164000 63    2013 
## 10 benzin, 1 197 ccm, 66 kW  197000 168000 66    2015 
## # ... with 762 more rows

Now the data are ready to be plotted.

Plot plot plot

final %>% 
  ggplot(aes(x = mileage, y = cost, color = kW)) +
  geom_point() + 
  scale_x_continuous(labels = comma) + 
  scale_y_continuous(labels = comma) + 
  coord_cartesian(xlim = c(0,200000), ylim = c(100000, 400000)) + 
  geom_smooth(method = "lm", se = FALSE) + 
  labs(title = "Škoda Rapid (gasoline)", subtitle = "based on www.tipcars.com", x = "mileage [km]", y = "cost [Kč]")

Focused on my car model only:

final %>% 
  filter(kW %in% c("77")) %>% 
  
  ggplot(aes(x = mileage, y = cost, color = kW)) +
  geom_point() + 
  scale_x_continuous(labels = comma) + 
  scale_y_continuous(labels = comma) + 
  # coord_cartesian(xlim = c(0,200000), ylim = c(100000, 400000)) + 
  geom_smooth(method = "lm", se = FALSE) + 
  labs(title = "Škoda Rapid 1.2TSI 77kW", subtitle = "based on www.tipcars.com", x = "mileage [km]", y = "cost [Kč]")

So, what is the value? We need a bit of linear regression..

In order to get single value we have to calculate regression coefficients.

coeff <- final %>% 
  filter(kW %in% c("77")) %>% 
  lm(cost~mileage, data = .) %>% 
  tidy()
coeff

## # A tibble: 2 x 5
##   term          estimate std.error statistic  p.value
##   <chr>            <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept) 275560.    8747.         31.5  2.07e-23
## 2 mileage         -0.540    0.0910     -5.94 2.16e- 6

Here we can infer that with each 1km on the road the car looses its value by ~0,5 Kč (~2 Eurocent).

So assuming my car has around 100Tkm mileage its current value is around:

my_mileage = 100000
y = coeff$estimate[1] + (my_mileage * coeff$estimate[2])
paste(round(y), "Kč")

## [1] "221534 Kč"

Or visually:

final %>% 
  filter(kW %in% c("77")) %>% 
  
  ggplot(aes(x = mileage, y = cost, color = kW)) +
  geom_point() + 
  scale_x_continuous(labels = comma) + 
  scale_y_continuous(labels = comma) + 
  geom_hline(aes(yintercept = y)) +
  geom_vline(aes(xintercept = my_mileage)) + 
  geom_label(aes(x = my_mileage, y = y,label = paste(round(y), "Kč")), hjust = 0, vjust = 0) + 
  # coord_cartesian(xlim = c(0,200000), ylim = c(100000, 400000)) + 
  geom_smooth(method = "lm", se = FALSE) + 
  labs(title = "Škoda Rapid 1.2TSI 77kW", subtitle = "based on www.tipcars.com", x = "mileage [km]", y = "cost [Kč]")

Ok, that´s it. Hope you enjoyed:)

What is the current value of my car? (web scrape basics)

First step - how many pages?

Load all URLs

Map map map

Plot plot plot

So, what is the value? We need a bit of linear regression..