Part One: Managing Data

Back to Home

Introduction

This notebook forms the first of two outputs in the analysis of the TfL London Tube network. It is primarily concerned with the transformation and manipulation of the required datasets in comparison to Part Two which focuses on visualisation and algorithms.

Due to the complexity of the dataset being investigated, there are a few external references to analysis files in this repository which go into greater detail regarding the transformations and data extraction performed which do not form part of the notebook. Further, the data manipulation code shown here is wholly R however when it came to extracting the data in practice this was re-written in Python for performance reasons. Therefore this code is to illustrate the initial extraction and data discovery thought-process more-so than the efficient (or recommended) extraction process.

APIs and Internet Data

We start the project with a reproducible way of pulling the dataset from the API download link. The results shown in this notebook relate to the API feed as at 12th December 2017.

To access it, download the zip file at the link below

# Use paths relative to RStudio Project
root.dir <- rprojroot::find_rstudio_root_file()
# Set where we want the extracted files to end up
data.dir <- file.path(root.dir, "1_data/1_1_raw_data")
dir.create(data.dir, showWarnings = FALSE)
# Load app_id and app_key variables from a local passwords file
source(file.path(root.dir, "2_analysis/r/tfl-developer-passwords.R"))
stopifnot(exists("app_id"), exists("app_key"))
dataset_url <- "http://data.tfl.gov.uk/tfl/syndication/feeds/journey-planner-timetables.zip"
dataset_url <- paste0(dataset_url, "?app_id=", app_id, "&app_key=", app_key)
download.file(dataset_url, file.path(data.dir, "timetables.zip"))

Note: You may need to register for an API key and update the file tfl-developer-passwords.R to access the zip file. In the event that you cannot gain access there is a small demo sample here or you can use the full extract provided with this repository

Unpack the download once to split out the train/ferry timetables from the bus timetables and again to unpack the tube/DLR timetables.

unzip(file.path(data.dir, "timetables.zip"),
exdir = file.path(data.dir, "timetables"))
unzip(list.files(file.path(data.dir, "timetables"),
pattern = "LULDLR", full.names = TRUE),
exdir = file.path(data.dir, "/timetables/data"))
data.files <- list.files(file.path(data.dir, "/timetables/data"),
pattern = "tfl_1-[^.]+\\.xml", full.names = TRUE)
head(basename(data.files))
## [1] "tfl_1-BAK-_-y05-430200.xml"  "tfl_1-CEN-_-y05-1495234.xml"
## [3] "tfl_1-CEN-_-y05-1495235.xml" "tfl_1-CEN-_-y05-1495236.xml"
## [5] "tfl_1-CEN-_-y05-690544.xml"  "tfl_1-CIR-_-y05-465922.xml"

The result is 600-800MB of XML files containing the most up-to-date timetable information for the London Underground and DLR services. The pattern of London Underground files we require are XML files that start with “tfl_1” as the files with other prefixes are ferry, DLR or TfL rail timetables.

Data Exploration and Transformation

For this section, we’ll investigate the structure of the API feed, ways to extract the important data, manipulate it into the required shape and add any necessary calculations. This will be demonstrated for one of the XML files in a manner which can be generalised for any other timetable.

XML Metadata

Begin by reading in the first XML file.

NB: The xml2, XML and xmltools package namespaces are not loaded via library in order to illustrate which functions belong to each package more clearly by using :: for each function call.

library(magrittr)
doc <- data.files[1] %>%
xml2::read_xml()
doc
## {xml_document}
## <TransXChange CreationDateTime="2018-01-09T11:32:22.9783722Z" ModificationDateTime="2018-01-09T11:32:22.993972Z" Modification="new" FileName="tfl_1-BAK-_-y05-430200.xml" RevisionNumber="3" SchemaVersion="2.1" lang="en" schemaLocation="http://www.transxchange.org.uk/ http://www.transxchange.org.uk/schema/2.1/TransXChange_general.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.transxchange.org.uk/">
## [1] <NptgLocalities>\n  <AnnotatedNptgLocalityRef>\n    <NptgLocalityRef ...
## [2] <StopPoints>\n  <StopPoint CreationDateTime="2013-08-29T00:00:00">\n ...
## [3] <RouteSections>\n  <RouteSection id="RS_1-BAK-_-y05-430200-O-1">\n   ...
## [4] <Routes>\n  <Route id="R_1-BAK-_-y05-430200-O-1">\n    <PrivateCode> ...
## [5] <JourneyPatternSections>\n  <JourneyPatternSection id="JPS_1-BAK-_-y ...
## [6] <Operators>\n  <Operator id="OId_LUL">\n    <OperatorCode>LUL</Opera ...
## [7] <Services>\n  <Service>\n    <ServiceCode>1-BAK-_-y05-430200</Servic ...
## [8] <VehicleJourneys>\n  <VehicleJourney>\n    <PrivateCode>tfl-1-BAK-_- ...

This metadata contains two important pieces of information:

  1. The top level node contains eight distinct child nodes which appear to be completely disparate datasets all relating to this particular file:

  2. The namespace of the XML file is non-standard. The schemaLocation attribute suggests that the XML structure adheres to the TransXChange schema provided by the UK Department for Transport for timetable data. We can configure this namespace for use with any future XPaths (the alternative is to simply strip it entirely from the file however this takes additional processing time).

namespace <- c(txc = "http://www.transxchange.org.uk/")

Once we set up an extraction procedure for this schema and namespace it can then be applied to any timetable that adheres to DfT standards.

Dataset Structure

To quickly get an idea of the structure we can traverse the document as a list:

xml_list <- doc %>%
xml2::as_list()

and inspect the structure of a given observation for each of the 8 tables along with an example of the data that each XML node (or attribute) contains.

str(xml_list$NptgLocalities$AnnotatedNptgLocalityRef)
## List of 2
##  $ NptgLocalityRef:List of 1
##   ..$ : chr "N0077657"
##  $ LocalityName   :List of 1
##   ..$ : chr "Elephant & Castle"
str(xml_list$StopPoints$StopPoint)
## List of 6
##  $ AtcoCode             :List of 1
##   ..$ : chr "9400ZZLUEAC1"
##  $ Descriptor           :List of 1
##   ..$ CommonName:List of 1
##   .. ..$ : chr "Elephant & Castle Station"
##  $ Place                :List of 2
##   ..$ NptgLocalityRef:List of 1
##   .. ..$ : chr "N0077657"
##   ..$ Location       :List of 2
##   .. ..$ Easting :List of 1
##   .. .. ..$ : chr "531920"
##   .. ..$ Northing:List of 1
##   .. .. ..$ : chr "179140"
##   .. ..- attr(*, "Precision")= chr "1m"
##  $ StopClassification   :List of 2
##   ..$ StopType :List of 1
##   .. ..$ : chr "RPL"
##   ..$ OffStreet:List of 1
##   .. ..$ Rail:List of 1
##   .. .. ..$ Platform: list()
##  $ AdministrativeAreaRef:List of 1
##   ..$ : chr "000"
##  $ Notes                :List of 1
##   ..$ : chr "Elephant & Castle Station"
##  - attr(*, "CreationDateTime")= chr "2013-08-29T00:00:00"
str(xml_list$RouteSections$RouteSection[[1]])
## List of 4
##  $ From     :List of 1
##   ..$ StopPointRef:List of 1
##   .. ..$ : chr "9400ZZLUEAC1"
##  $ To       :List of 1
##   ..$ StopPointRef:List of 1
##   .. ..$ : chr "9400ZZLULBN1"
##  $ Distance :List of 1
##   ..$ : chr "852"
##  $ Direction:List of 1
##   ..$ : chr "outbound"
##  - attr(*, "id")= chr "RL_1-BAK-_-y05-430200-O-1-1"
str(xml_list$Routes$Route)
## List of 3
##  $ PrivateCode    :List of 1
##   ..$ : chr "R_1-BAK-_-y05-430200-O-1"
##  $ Description    :List of 1
##   ..$ : chr "Elephant & Castle Station - Queen's Park Station (London)"
##  $ RouteSectionRef:List of 1
##   ..$ : chr "RS_1-BAK-_-y05-430200-O-1"
##  - attr(*, "id")= chr "R_1-BAK-_-y05-430200-O-1"
str(xml_list$JourneyPatternSections$JourneyPatternSection[[1]])
## List of 4
##  $ From        :List of 3
##   ..$ Activity    :List of 1
##   .. ..$ : chr "pickUp"
##   ..$ StopPointRef:List of 1
##   .. ..$ : chr "9400ZZLUEAC1"
##   ..$ TimingStatus:List of 1
##   .. ..$ : chr "PTP"
##   ..- attr(*, "SequenceNumber")= chr "1"
##  $ To          :List of 4
##   ..$ WaitTime    :List of 1
##   .. ..$ : chr "PT1M"
##   ..$ Activity    :List of 1
##   .. ..$ : chr "pickUpAndSetDown"
##   ..$ StopPointRef:List of 1
##   .. ..$ : chr "9400ZZLULBN1"
##   ..$ TimingStatus:List of 1
##   .. ..$ : chr "PTP"
##   ..- attr(*, "SequenceNumber")= chr "2"
##  $ RouteLinkRef:List of 1
##   ..$ : chr "RL_1-BAK-_-y05-430200-O-1-1"
##  $ RunTime     :List of 1
##   ..$ : chr "PT1M"
##  - attr(*, "id")= chr "JPL_1-BAK-_-y05-430200-1-O-1-2"
str(xml_list$Operators$Operator)
## List of 4
##  $ OperatorCode         :List of 1
##   ..$ : chr "LUL"
##  $ OperatorShortName    :List of 1
##   ..$ : chr "London Underground"
##  $ OperatorNameOnLicence:List of 1
##   ..$ : chr "London Underground"
##  $ TradingName          :List of 1
##   ..$ : chr "London Underground"
##  - attr(*, "id")= chr "OId_LUL"
str(xml_list$Services$Service$StandardService$JourneyPattern)
## List of 4
##  $ Direction                :List of 1
##   ..$ : chr "outbound"
##  $ Operational              :List of 1
##   ..$ VehicleType:List of 2
##   .. ..$ VehicleTypeCode:List of 1
##   .. .. ..$ : chr "UT"
##   .. ..$ Description    :List of 1
##   .. .. ..$ : chr "Underground Train"
##  $ RouteRef                 :List of 1
##   ..$ : chr "R_1-BAK-_-y05-430200-O-1"
##  $ JourneyPatternSectionRefs:List of 1
##   ..$ : chr "JPS_1-BAK-_-y05-430200-1-1-O"
##  - attr(*, "id")= chr "JP_1-BAK-_-y05-430200-1-O-1"
str(xml_list$VehicleJourneys$VehicleJourney)
## List of 8
##  $ PrivateCode       :List of 1
##   ..$ : chr "tfl-1-BAK-_-y05-430200-1-UP"
##  $ Operational       :List of 1
##   ..$ VehicleType:List of 2
##   .. ..$ VehicleTypeCode:List of 1
##   .. .. ..$ : chr "UT"
##   .. ..$ Description    :List of 1
##   .. .. ..$ : chr "Underground Train"
##  $ OperatingProfile  :List of 2
##   ..$ RegularDayType      :List of 1
##   .. ..$ DaysOfWeek:List of 1
##   .. .. ..$ Sunday: list()
##   ..$ BankHolidayOperation:List of 1
##   .. ..$ DaysOfNonOperation:List of 1
##   .. .. ..$ AllBankHolidays: list()
##  $ VehicleJourneyCode:List of 1
##   ..$ : chr "VJ_1-BAK-_-y05-430200-1-UP"
##  $ ServiceRef        :List of 1
##   ..$ : chr "1-BAK-_-y05-430200"
##  $ LineRef           :List of 1
##   ..$ : chr "1-BAK-_-y05-430200"
##  $ JourneyPatternRef :List of 1
##   ..$ : chr "JP_1-BAK-_-y05-430200-1-O-1"
##  $ DepartureTime     :List of 1
##   ..$ : chr "21:29:00"
# Show service data without the child node JourneyPatterns
service <- xml_list$Services$Service
service$StandardService <- NULL
str(service)
## List of 9
##  $ ServiceCode          :List of 1
##   ..$ : chr "1-BAK-_-y05-430200"
##  $ PrivateCode          :List of 1
##   ..$ : chr "1-BAK-_-y05-430200"
##  $ Lines                :List of 1
##   ..$ Line:List of 1
##   .. ..$ LineName:List of 1
##   .. .. ..$ : chr "Bakerloo"
##   .. ..- attr(*, "id")= chr "1-BAK-_-y05-430200"
##  $ OperatingPeriod      :List of 2
##   ..$ StartDate:List of 1
##   .. ..$ : chr "2018-01-06"
##   ..$ EndDate  :List of 1
##   .. ..$ : chr "2018-12-23"
##  $ OperatingProfile     :List of 3
##   ..$ RegularDayType      :List of 1
##   .. ..$ DaysOfWeek:List of 1
##   .. .. ..$ MondayToSunday: list()
##   ..$ SpecialDaysOperation:List of 1
##   .. ..$ DaysOfOperation:List of 1
##   .. .. ..$ DateRange:List of 3
##   .. .. .. ..$ StartDate:List of 1
##   .. .. .. .. ..$ : chr "2018-04-02"
##   .. .. .. ..$ EndDate  :List of 1
##   .. .. .. .. ..$ : chr "2018-04-02"
##   .. .. .. ..$ Note     :List of 1
##   .. .. .. .. ..$ : chr "East Monday"
##   ..$ BankHolidayOperation:List of 2
##   .. ..$ DaysOfOperation   :List of 4
##   .. .. ..$ GoodFriday                      : list()
##   .. .. ..$ LateSummerBankHolidayNotScotland: list()
##   .. .. ..$ MayDay                          : list()
##   .. .. ..$ SpringBank                      : list()
##   .. ..$ DaysOfNonOperation: list()
##  $ RegisteredOperatorRef:List of 1
##   ..$ : chr "OId_LUL"
##  $ StopRequirements     :List of 1
##   ..$ NoNewStopsRequired: list()
##  $ Mode                 :List of 1
##   ..$ : chr "underground"
##  $ Description          :List of 1
##   ..$ : chr "Elephant & Castle - Queen's Park - Harrow & Wealdstone"

Four of the tables contain particularly crucial information other than Reference fields and linking identifiers:

Stop Points

A tube station at which a given vehicle can stop.

Journey Pattern Sections

A trip link between two Stop Points (From and To) with a given RunTime and optional WaitTime. These combine to an overall Journey Pattern.

Services

Provides information about the operating rhythm of the timetable

VehicleJourneys

A tube vehicle’s Journey Pattern and the time of departure from its origin station.

It’s also worth noting that the StopPoints and NptgLocalities tables both contain duplicate information across every Service (XML file) for a given line and also contain duplicates across lines that share the same stops. Every other table has unique information per XML file because the routes, trip links, vehicles and journeys are all defined by Service. This would mean that the timetable for Christmas Day has a completely different set of trip links to that of Boxing Day (despite the trains physically running on exactly the same journey patterns) purely because there is a slight variation in the frequency of Vehicle Journeys between the dates. This clearly leads to a vast amount of duplicated information which can be easily cut down by a factor of up to 40x by extracting a subset of timetables of interest.

For a comprehensive description of the schema if not to appreciate the scale and complexity of the dataset, see the official DfT documentation here. Page 63 in particular details how many of the elements discussed above will be joined in a PostgreSQL database later in this notebook.

Subsetting

As seen in the previous section, the entire file shown is only relevant for one day of the year. By first filtering the files based on the Operating Period and Operating Profile, it’s possible to only read in the files that relate to a Monday or New Years Day or 17th January.

Here is a table containing every file along with its starting and ending operating period dates:

library(data.table)
xpath.op_period <- ".//txc:Services/txc:Service/txc:OperatingPeriod/*"
dates <- sapply(data.files, function(xml_file) {
XML::xmlParse(xml_file) %>%
XML::xmlRoot() %>%
XML::xpathApply(xpath.op_period, namespaces = namespace,
fun = function(y) XML::xmlValue(y) %>% as.Date)
}, simplify = FALSE) %>%
{data.table(names(.), rbindlist(.))}
names(dates) <- c("File", "StartDate", "EndDate")
dates[["File"]] <- basename(dates[["File"]])
head(dates, 10)
##                            File  StartDate    EndDate
##  1:  tfl_1-BAK-_-y05-430200.xml 2018-01-06 2018-12-23
##  2: tfl_1-CEN-_-y05-1495234.xml 2018-04-02 2018-04-02
##  3: tfl_1-CEN-_-y05-1495235.xml 2018-05-07 2018-05-07
##  4: tfl_1-CEN-_-y05-1495236.xml 2018-05-28 2018-05-28
##  5:  tfl_1-CEN-_-y05-690544.xml 2018-01-06 2018-12-23
##  6:  tfl_1-CIR-_-y05-465922.xml 2018-01-06 2018-12-23
##  7: tfl_1-CIR-_-y05-9176502.xml 2018-01-20 2018-01-21
##  8: tfl_1-CIR-_-y05-9176503.xml 2018-02-17 2018-02-18
##  9: tfl_1-CIR-_-y05-9176504.xml 2018-02-11 2018-02-11
## 10: tfl_1-CIR-_-y05-9176505.xml 2018-03-10 2018-03-11

The next step is to filter this dataframe based on whether a date of interest falls within the time interval of a file:

library(lubridate)
timetable_date <- lubridate::today()
relevant_dates <- dates %$%
# Test whether the timetable_date falls within StartDate to EndDate
interval(StartDate, EndDate) %>%
{timetable_date %within% .} %>%
# Subset dates to only these records
dates[.]
head(relevant_dates, 10)
##                            File  StartDate    EndDate
##  1:  tfl_1-BAK-_-y05-430200.xml 2018-01-06 2018-12-23
##  2:  tfl_1-CEN-_-y05-690544.xml 2018-01-06 2018-12-23
##  3:  tfl_1-CIR-_-y05-465922.xml 2018-01-06 2018-12-23
##  4: tfl_1-DIS-_-y05-1490111.xml 2018-01-06 2018-12-23
##  5:  tfl_1-HAM-_-y05-465922.xml 2018-01-06 2018-12-23
##  6:  tfl_1-JUB-_-y05-520522.xml 2018-01-06 2018-12-23
##  7: tfl_1-MET-_-y05-3400500.xml 2018-01-06 2018-12-23
##  8:  tfl_1-NTN-_-y05-560800.xml 2018-01-06 2018-01-28
##  9: tfl_1-PIC-_-y05-1705800.xml 2018-01-06 2018-02-02
## 10:  tfl_1-PIC-_-y05-580800.xml 2018-01-06 2018-12-23

Now we’ve cut down the 84 files to roughly a quarter depending on the chosen timetable_date.

The other option (to filter based on the days of the week) is unfortunately even more convoluted. It should be as simply as filtering on the tags underneath Journey Operating Profile however here’s a list of the unique set of tags:

xpath.days_of_week <- ".//txc:VehicleJourneys/txc:VehicleJourney/txc:OperatingProfile/txc:RegularDayType/txc:DaysOfWeek/*"
daysofweek <- sapply(data.files, function(xml_file) {
XML::xmlParse(xml_file) %>%
XML::xmlRoot() %>%
XML::xpathApply(xpath.days_of_week, namespaces = namespace, fun = XML::xmlName)
}, simplify = FALSE) %>%
unlist %>%
unique
daysofweek
## [1] "Sunday"         "Monday"         "Saturday"       "MondayToFriday"
## [5] "Wednesday"      "Thursday"       "Friday"         "Tuesday"
## [9] "Weekend"

The “days of the week” also include values such as “MondayToFriday”, “Weekend” as well as “MondayToSaturday” and “MondayToSunday” under the Service Operating Profile. So we would need to create a mapping function between a given date and all of the possible “days of the week” it could fall under. As an example of how this can be accounted for, see this PostgreSQL table.

Extraction

Now that we have a good understanding of what each file contains, we can attempt to extract the key information from one of them before generalising this process for all files. As we’ve already seen, the tree structure of these XML files is particularly complex thereby making one-size-fits-all functions such as xmlToDataFrame rather unhelpful. The level of control required is really only achievable by defining XPaths for the location of each piece of information we need and then looping through the document extracting every instance of each path. On the plus side, the files are well-formed and do not contain any errors which we may find in malformed web data.

Firstly, we’ll want to retrieve the parent XPaths for each of the nodes containing useful information. This is done simplest with a helper function xml_get_paths from the xmltools library.

# Get top level nodeset for first file
doc <- data.files[1] %>%
xml2::read_xml() %>%
xml2::xml_ns_strip()
nodeset <- doc %>%
xml2::xml_children()
# Get the xpaths to parents of a terminal node
terminal_xpaths <- nodeset %>%
xmltools::xml_get_paths(only_terminal_parent = TRUE) %>%
# Filter to unique paths
unlist() %>%
unique()
head(terminal_xpaths)
## [1] "/TransXChange/NptgLocalities/AnnotatedNptgLocalityRef"
## [2] "/TransXChange/StopPoints/StopPoint"
## [3] "/TransXChange/StopPoints/StopPoint/Descriptor"
## [4] "/TransXChange/StopPoints/StopPoint/Place"
## [5] "/TransXChange/StopPoints/StopPoint/Place/Location"
## [6] "/TransXChange/StopPoints/StopPoint/StopClassification"

These are the XPaths to every possible terminal branch of the tree however a lot of the information isn’t particularly useful. For some of the tables, we’re only interested in specific sections. For example, the Operational tag isn’t interesting as it contains information about the vehicle which in this case will always be “Underground Train”. This is a similar story for the Operator which is always “London Underground”.

Therefore we can define a subset of the terminal_xpaths grouped by the table to which they each relate.

required_xpaths <- list(
NptgLocalities = "/TransXChange/NptgLocalities/AnnotatedNptgLocalityRef"
,StopPoints = c("/TransXChange/StopPoints/StopPoint",
"/TransXChange/StopPoints/StopPoint/Descriptor",
"/TransXChange/StopPoints/StopPoint/Place",
"/TransXChange/StopPoints/StopPoint/Place/Location")
,RouteLinks = c("/TransXChange/RouteSections/RouteSection/RouteLink",
"/TransXChange/RouteSections/RouteSection/RouteLink/From",
"/TransXChange/RouteSections/RouteSection/RouteLink/To")
,Routes = "/TransXChange/Routes/Route"
,JourneyPatternTimingLinks = c("/TransXChange/JourneyPatternSections/JourneyPatternSection/JourneyPatternTimingLink",
"/TransXChange/JourneyPatternSections/JourneyPatternSection/JourneyPatternTimingLink/From",
"/TransXChange/JourneyPatternSections/JourneyPatternSection/JourneyPatternTimingLink/To")
,Services = "/TransXChange/Services/Service"
,JourneyPatterns = "/TransXChange/Services/Service/StandardService/JourneyPattern"
,VehicleJourneys = "/TransXChange/VehicleJourneys/VehicleJourney"
)

This way, when it comes to extracting the data, all of the information found under the 8th, 9th and 10th paths of terminal_xpaths will be combined together under the RouteSections table (see elements 6-8 above).

Now it is just a matter of looping over this list, finding all of the matches for each path, extracting the underlying data and combining it into a single dataset for each of the tables.

build_tfl <- function(doc, terminal_xpaths, required_terminal_xpaths) {
purrr::map(required_terminal_xpaths, function(xpaths) {
# Use the provided paths
xpaths %>%
# Find all matches of each path
lapply(xml2::xml_find_all, x = doc) %>%
# Extract underlying data
purrr::map(xmltools::xml_dig_df) %>%
# Combine extracted data into one dataframe
purrr::map(dplyr::bind_rows) %>%
dplyr::bind_cols()
})
}
tfl <- build_tfl(doc, terminal_xpaths, required_xpaths)
str(tfl, 1)
## List of 8
##  $ NptgLocalities           :Classes 'tbl_df', 'tbl' and 'data.frame':   25 obs. of  2 variables:
##  $ StopPoints               :Classes 'tbl_df', 'tbl' and 'data.frame':   43 obs. of  7 variables:
##  $ RouteLinks               :Classes 'tbl_df', 'tbl' and 'data.frame':   248 obs. of  4 variables:
##  $ Routes                   :Classes 'tbl_df', 'tbl' and 'data.frame':   16 obs. of  3 variables:
##  $ JourneyPatternTimingLinks:Classes 'tbl_df', 'tbl' and 'data.frame':   4743 obs. of  9 variables:
##  $ Services                 :Classes 'tbl_df', 'tbl' and 'data.frame':   1 obs. of  5 variables:
##  $ JourneyPatterns          :Classes 'tbl_df', 'tbl' and 'data.frame':   234 obs. of  3 variables:
##  $ VehicleJourneys          :Classes 'tbl_df', 'tbl' and 'data.frame':   1947 obs. of  6 variables:

The result is the tfl list which contains the eight tables corresponding to the top-level nodes of this single XML file.

Due to the flexibility of XML and the TransXChange schema however, there’s still a lot missing…

  1. A number of important fields are found in the “id” attribute of certain tags, not in the tag’s text as we might expect.
  2. The stop sequence numbers are contained in a “SequenceNumber” attribute
  3. Some tables have a parent which contains join information. For example, the JourneyPatterns table is defined at the level of each Journey Pattern Section however we are missing the foreign key Journey Pattern ID as this is the parent node of each Section and therefore unable to be parsed with the same XPath.

We can define a function retrieve_additional_fields which solves these three cases and call it after build_tfl for each document. Due to its length and complexity it is not shown here but can be found in the complete analysis file.

library(magrittr)
tfl %<>% retrieve_additional_fields(doc)

We can put the whole extraction process together by calling read_xml, build_tfl and retrieve_additional_fields for each XML file:

extract_file <- function(file){
doc <- xml2::read_xml(file) %>%
xml2::xml_ns_strip()
tfl <- build_tfl(doc, terminal_xpaths, required_xpaths)
tfl %<>% retrieve_additional_fields(doc)
tfl
}
str(extract_file(data.files[1]), 1)
## List of 8
##  $ NptgLocalities           :Classes 'tbl_df', 'tbl' and 'data.frame':   25 obs. of  2 variables:
##  $ StopPoints               :Classes 'tbl_df', 'tbl' and 'data.frame':   43 obs. of  7 variables:
##  $ RouteLinks               :Classes 'tbl_df', 'tbl' and 'data.frame':   248 obs. of  6 variables:
##   ..- attr(*, ".internal.selfref")=<externalptr>
##  $ Routes                   :Classes 'tbl_df', 'tbl' and 'data.frame':   16 obs. of  3 variables:
##  $ JourneyPatternTimingLinks:Classes 'tbl_df', 'tbl' and 'data.frame':   4743 obs. of  13 variables:
##   ..- attr(*, ".internal.selfref")=<externalptr>
##  $ Services                 :Classes 'tbl_df', 'tbl' and 'data.frame':   1 obs. of  5 variables:
##  $ JourneyPatterns          :Classes 'tbl_df', 'tbl' and 'data.frame':   234 obs. of  4 variables:
##  $ VehicleJourneys          :Classes 'tbl_df', 'tbl' and 'data.frame':   1947 obs. of  6 variables:

and then loop this function over the entire directory of files:

tfl_all <- purrr::map(data.files, extract_file)

Unfortunately, the bad news is this line of code takes a very long time to run so don’t execute it unless you plan on waiting all night. Literally…

To make this code more efficient it may benefit from a performance improvement by parsing the nested list structure using syntax such as doc[[c("TransXChange", "StopPoints", "StopPoint")]] however in its current form it is much slower when compared to the lxml Python library. The versatility of lxml is also better when it comes to traversing XML trees efficiently and extracting data in nested nodes, parent nodes and attributes as we require.

To cope with this problem, the dataset was instead scraped (with even more fields in more hard-to-reach places and about 500x faster) with this file which calls a custom-built TfL Class.

Variable Creation

There are a few crucial variables which we’ll need to add along with correctly typecasting the current information for any further analysis. This is best done in R before handing over to a database. Due to the potential scale of the join operations (had we not already subsetted the dataset), additional variables which are calculations across tables will be left to the database stage. This is also to preserve the normalised nature of the tables within the database.

Latitude and Longitude

To make any use of the station locations, it’s necessary to convert the provided data to a more renowned coordinate system.

The TransXChange website confirms that the “Easting” and “Northing” fields in the StopPoints dataset are British National Grid (BNG) coordinates whereas we require the more popular latitude and longitude.

This can be achieved with the R Geospatial Data Abstraction Library which is able to convert between the two coordinate projections via the PROJ.4 library:

library(rgdal)
modify_StopPoints <- function(df){
# Define proj4 coordinate systems
bng <- "+init=epsg:27700"
latlon <- "+proj=longlat +datum=WGS84"
# Create spatial object using BNG coordinates
coord.bng <-
with(df,
SpatialPointsDataFrame(
cbind(Place_Location_Easting, Place_Location_Northing) %>%
as.integer() %>% matrix(ncol = 2),
data = tibble(AtcoCode,
Descriptor_CommonName,
Place_NptgLocalityRef),
proj4string = CRS(bng)))
# Cast as latlon coordinates
coord.latlon <- spTransform(coord.bng, CRS(latlon))
# Replace dataframe
df <- as_tibble(coord.latlon)
names(df) <- c("AtcoCode","CommonName","NptgLocalityRef","Longitude","Latitude")
df
}

JourneyTime and DepartureMins

Perhaps the most significant variable is the RunTime of each journey section as it supposedly tells us the travel time between two StopPoints.

table(tfl$JourneyPatternTimingLinks$RunTime)
##
## PT0S PT1M PT2M PT3M PT4M PT5M PT6M PT7M
##  263 2315 1856  269   30    6    1    3

Unfortunately this frequency table shows that the duration is presented as a text field where it’s assumed the “M” indicates the RunTime is rounded to the nearest minute. Judging by the “0S” field, this would also suggest that a number of links take zero seconds! Clearly the duration has been rounded to the nearest minute however it also turns out that every instance of zero RunTime has at least 1 minute of WaitTime at the next station.

# Check that there aren't records with both no WaitTime and zero RunTime
tfl$JourneyPatternTimingLinks %>%
filter(is.na(WaitTime) && RunTime == "PT0S") %>%
nrow == 0
## [1] TRUE

Therefore, we can convert the RunTime and WaitTime to integer variables by extracting the third character from every observation and build a new variable “JourneyTime” which takes into account the total RunTime and WaitTime which we now know will always be at least 1 minute:

modify_JourneyPatternTimingLinks <- function(df){
within(df, {
From_SequenceNumber %<>% as.integer
To_SequenceNumber %<>% as.integer
# Extract timings from RunTime and WaitTime to create JourneyTime
RunTime %<>% substr(3, 3) %>% as.integer
WaitTime %<>% substr(3, 3) %>% as.integer
JourneyTime <- RunTime + ifelse(is.na(WaitTime), 0, WaitTime)
})
}

The other most important field is the departure times of every train from their origin station, DepartureTime, which is currently stored as text.

head(tfl$VehicleJourneys$DepartureTime)
## [1] "21:29:00" "12:36:00" "13:49:00" "14:59:00" "21:37:00" "16:36:00"

So it looks like this variable is a time stamp which is also rounded to the nearest minute.

Since we are now dealing solely in minutes on a day-by-day basis, it would make the most sense to cast this DepartureTime field as “the number of minutes since midnight”. This makes it less human-readable however addition of DepartureTime with JourneyTime is now very straightforward and doesn’t rely on date-time typecasting or datediff functions. That said, in the event that we’re using a database that handles date-times well, it doesn’t hurt to correctly cast DepartureTime and keep it in the data frame separately.

modify_VehicleJourneys <- function(df){
within(df, {
DepartureTime %<>% as.POSIXct(format = "%T")
# Extract number of minutes since midnight from DepartureTime
DepartureMins <- DepartureTime %>%
{lubridate::minute(.) + 60 * lubridate::hour(.)}
})
}

Normalisation and Relational Databases

As found in the previous section, the dataset is provided in 3rd normal form already making it very easy to simply push the tables as-is to a database.

DBMS Setup

Due to the information given to us, the queries to be computed in-database are quite advanced and require running cumulative calculations and row offsets by group which are only available in an advanced DBMS and not SQLite or MySQL.

For this reason, we’ll use an open-source database PostgreSQL for its windowing functions though other proprietary databases such as Oracle or Microsoft also provide this functionality. Thanks to Docker containers, provisioning a clean lightweight Postgres database on your local machine is one line of code without any additional setup other than downloading Docker itself.

docker run --name postgres_london_tube -p 5432:5432 -d -e POSTGRES_PASSWORD=mysecretpassword postgres:alpine

The alpine image is about 1/10th the size of the official postgres container as it uses a significantly smaller distribution of Linux with fewer of PostgreSQL’s auxilliary features.

Test that the server is running and the connection works by setting up a development database for the project:

con <- dbConnect(drv, host = host, user = user, password = password,
dbname = "postgres")
dbSendQuery(con, "CREATE DATABASE londontube_r;")
dbDisconnect(con)
get_con <- function() {
dbConnect(drv,
host = host, user = user, password = password,
dbname = "londontube_r")
}

Build database

We can write a function that loops over all eight tables sequentially adding them to the database whilst dropping any existing tables.

First define the primary keys of the tables:

primarykeys <- list(
NptgLocalities = '"NptgLocalityRef"',
StopPoints = '"AtcoCode"',
RouteLinks = '"RouteLinkID"',
Routes = '"PrivateCode"',
JourneyPatternTimingLinks = '"JourneyPatternTimingLinkID"',
Services = '"ServiceCode"',
JourneyPatterns = '"JourneyPatternID"',
VehicleJourneys = '"VehicleJourneyCode"'
)

and upload the data:

create_tables <- function(tablename, tabledata, primary_key_cols, con){
# Insert data
dbWriteTable(con, tablename, tabledata, overwrite = TRUE, row.names = FALSE)
# Add Primary Keys
query <- sprintf(
"ALTER TABLE %1$s
ADD PRIMARY KEY (%2$s);",
tablename, # Current table name
paste(primary_key_cols, collapse = ", ") # PK columns
)
# Run queries
dbExecute(con, query)
}
# Build database with the same tablenames as input data
con <- get_con()
tablenames <- tolower(names(tfl))
table_creation <- purrr::pmap(list(tablenames, tfl, primarykeys),
create_tables, con)
dbDisconnect(con)
## [1] TRUE

Finally we can check that the tables were loaded successfully:

con <- get_con()
dbListTables(con)
## [1] "nptglocalities"            "stoppoints"
## [3] "routelinks"                "routes"
## [5] "journeypatterntiminglinks" "services"
## [7] "journeypatterns"           "vehiclejourneys"
dbDisconnect(con)
## [1] TRUE

Joining Tables

To build out the timetable, we’ve seen that there are two crucial tables:

What we’re after is a network-wide departures board which has the time of departure from each station and not just the train origin. Therefore, it’s a matter of expanding out every Vehicle Journey by the number of sequences in the Journey Pattern to calculate each Vehicle Link departure.

The departures board table is achieved as follows:

  1. Join VehicleJourneys to JourneyPatternTimingLinks via JourneyPatterns
  2. Calculate ArrivalMins_Link (arrival time of each train into each station) as the origin departure time (DepartureMins) plus the cumulative sum of JourneyTime ordered by link sequence for each vehicle
  3. Calculate DepartureMins_Link as the preceding link’s arrival time. For the first link, it’s the origin departure time.

For analysis later on, we also want to flag whether the link is the last trip in the vehicle’s journey. This just involves checking whether the link sequence is the largest value for that vehicle.

The table-valued function departureboard is as follows:

CREATE OR REPLACE FUNCTION departureboard(IN _timetable_date text)
RETURNS TABLE (
"VehicleJourneyCode" text,
"Line" text,
"From_VehicleSequenceNumber" integer,
"From_StopPointRef" text,
"From_StopPointName" text,
"From_Longitude" double precision,
"From_Latitude" double precision,
"To_VehicleSequenceNumber" integer,
"To_StopPointRef" text,
"To_StopPointName" text,
"To_Longitude" double precision,
"To_Latitude" double precision,
"JourneyTime" double precision,
"DepartureMins_Link" double precision,
"ArrivalMins_Link" double precision,
"Flag_LastStop" boolean
)
AS
$BODY$
SELECT
"VehicleJourneyCode"
-- Extract three-letter line abbreviation
,SUBSTRING("LineRef" FROM 3 FOR 3) "Line"
,CAST("From_VehicleSequenceNumber" as int) "From_VehicleSequenceNumber"
,"From_StopPointRef"
,"From_StopPointName"
,"From_Longitude"
,"From_Latitude"
-- Re-create SequenceNumber for each vehicle
,CAST("From_VehicleSequenceNumber" + 1 as int) "To_VehicleSequenceNumber"
,"To_StopPointRef"
,"To_StopPointName"
,"To_Longitude"
,"To_Latitude"
,"JourneyTime"
-- DepartureMins_Link is the ArrivalMins_Link of the previous row
-- partitioned by vehicle and ordered by stop sequence
,LAG("ArrivalMins_Link", 1, "DepartureMins") OVER
(PARTITION BY "VehicleJourneyCode" ORDER BY "From_SequenceNumber")
AS "DepartureMins_Link"
,"ArrivalMins_Link"
,"Flag_LastStop"
FROM (
SELECT
v."VehicleJourneyCode"
,v."LineRef"
,v."DepartureMins"
,j."From_SequenceNumber"
-- Re-create SequenceNumber for each vehicle by ranking the current SequenceNumber
-- within a vehicle partition
,RANK() OVER (PARTITION BY v."VehicleJourneyCode" ORDER BY j."From_SequenceNumber")
"From_VehicleSequenceNumber"
,j."From_StopPointRef"
,p1."CommonName" "From_StopPointName"
,p1."Longitude" "From_Longitude"
,p1."Latitude" "From_Latitude"
,j."To_StopPointRef"
,p2."CommonName" "To_StopPointName"
,p2."Longitude" "To_Longitude"
,p2."Latitude" "To_Latitude"
,j."JourneyTime"
-- Create ArrivalMins_Link as DepartureMins plus the cumulative JourneyTime
,v."DepartureMins" + SUM(j."JourneyTime") OVER
(PARTITION BY v."VehicleJourneyCode" ORDER BY j."From_SequenceNumber")
AS "ArrivalMins_Link"
-- Flag whether the current To_SequenceNumber is the greatest within a
-- Vehicle partition
,j."To_SequenceNumber" = MAX(j."To_SequenceNumber") OVER
(PARTITION BY v."VehicleJourneyCode") AS "Flag_LastStop"
/* Journey and Timing Link Tables */
FROM "VehicleJourneys" v
LEFT JOIN "JourneyPatterns" s
ON s."JourneyPattern" = v."JourneyPatternRef"
LEFT JOIN "JourneyPatternTimingLinks" j
ON j."JourneyPatternSections" = s."JourneyPatternSectionRefs"
/* Stop Names */
LEFT JOIN "StopPoints" p1
ON p1."AtcoCode" = j."From_StopPointRef"
LEFT JOIN "StopPoints" p2
ON p2."AtcoCode" = j."To_StopPointRef"
/* Operating Periods and Operating Profiles */
LEFT JOIN "Services_RegularDayType_DaysOfWeek" d1
ON d1."Services" = v."ServiceRef"
LEFT JOIN "VehicleJourneys_RegularDayType_DaysOfWeek" d2
ON d2."VehicleJourneys" = v."VehicleJourneyCode"
LEFT JOIN "Services" b
ON b."ServiceCode" = v."ServiceRef"
WHERE
-- Filter to timetables that have services and journeys on _timetable_date day of the week
d1."DaysOfWeek" IN (SELECT "DayGroup" FROM DaysOfWeek_Groups
WHERE "DayIndex" = DATE_PART('ISODOW',
CAST(_timetable_date AS date)))
AND d2."DaysOfWeek" IN (SELECT "DayGroup" FROM DaysOfWeek_Groups
WHERE "DayIndex" = DATE_PART('ISODOW',
CAST(_timetable_date AS date)))
-- Filter to timetables that are operating on _timetable_date
AND CAST(_timetable_date AS date) BETWEEN b."OpPeriod_StartDate" AND b."OpPeriod_EndDate"
) arrival_calc
$BODY$ LANGUAGE sql;

As you can imagine, when pushing every timetable to the database this table blows out to 6.63M rows. As discussed earlier, the vast majority of it is redundant information. Entire XML files are duplicated for special calendar days such as Bank Holidays, New Years, Christmas etc. hence the need to sample the data.

If using the Python methodology of scraping the dataset for all files, this sampling process discussed earlier now occurs as part of the Departures Board query WHERE clause.

Essentially we just filter on the Operating Profile of the Services and Vehicle Journeys to be one day of the week and filter the Operating Periods to be those that contain a given date. Now, by passing in the date “Wednesday 20th Dec 2017” for example the table size has reduced to 244K rows.

As we now have the dataset in a cleaned and use-able format, this completes the Managing Data notebook. We will now go on to visualise this newly created Departure Board table in Part Two