Files

Raw data and the source code as a compressed zip file (1.8 GB)
Precomputed CSV files with 6 commercial geolocation services and their majority vote (850 MB)
The scripts and data for generating the plots in the paper (133 MB)

Introduction

This is the public release of the code and data accompanying the paper Why is the Internet so Slow?! which appeared at Passive and Active Measurements Conference (2017). The file PAM2017-PlanetLabMeasurements.tar.gz contains the raw data collected from PlanetLab nodes and the source code which analyzes the raw data and generates a summary CSV file for analysis. After unzipping, it will have the following structure:

        PAM2017-PlanetLabMeasurements
	├── data
	│   ├── geolocations               // Geo-location of all unique IPs from 6 providers     
	│   │   ├── dbip.txt
	│   │   ├── es.txt
	│   │   ├── GeoDatabases.txt
	│   │   ├── ip2l.txt
	│   │   ├── iplg.txt
	│   │   ├── mml.txt
	│   │   ├── mm.txt
	│   │   └── mv.txt
	│   ├── pl-nodes-16Jun2016.txt     // List of Planet-Lab nodes as of Jun 16,2016
	│   ├── PL-raw-data                // Raw-data, as downloaded from PL nodes
	│   └── test-urls.txt              // All the websites and their URLs used in the tests
	├── README
	└── src                            // Source code generating a summary CSV from raw data 
	    ├── CurlData.py
	    ├── Distance.py
	    ├── Geolocation.py
	    ├── LogParser.py
	    ├── Main.py
	    ├── PingData.py
	    ├── PlNode.py
	    ├── TcpData.py
	    ├── TestData.py
	    ├── TracerouteData.py
	    └── UrlData.py

PlanetLab Data

Experiments were performed from 102 PlanetLab nodes in 81 unique locations in June 2016. Data from each PlanetLab node is in a separate folder inside PL-raw-data folder. Folder names are IP addresses of PlanetLab nodes. The file pl-nodes-16Jun2016.txt includes all PlanetLab nodes and their information.

The data in each folder is presented as downloaded from the corresponding PlanetLab node, i.e. in small chunks of 1-2 MBs each. The files contain measurements performed using the URL list in the file test-urls.txt. The Url file contains an identifier for each website, a final unique Url used in fetches, the website's global Alexa rank. Some lines in the URL file has a 4th field mooc, which indicates the URLs that are also used for the MOOC-recruited end-user experiments described in Section 3.7 of the paper. The URLs were obtained from Alexa's top 500 websites listed for each country. URL list was crawled in June 2016. URLs which caused errors while fetching over cURL and URLs of adult websites were discarded, the latter being arbitrary and unnecessary.

Data collection in each experiment using a specific URL include the following:

Fetching the base HTML using cURL.
Capturing the TCP traffic while fetching the HTML.
Pinging the web server 30 times.
Running a traceroute to the web server.

Each new experiment is marked with the following header:

	########################### NEW LOG BEGINS #############################################

Following the header, the destination URL, time of test and destination IP address is given:

	DEST http://www.yepi.com/
	TIME_OF_TEST 1464604288
	DESTIP 72.21.91.39

Each HTML is fetched twice, and the timings of each fetch is recorded with cURL. The HTML page sometimes is served from a different server in the second fetch. Pings and traceroute's are run towards only to the IP address of the web server which served the HTML during the first fetch. Information obtained from cURL during each fetch is printed in the following form:

	http_code: 200
	time_namelookup:  0.112
	time_total:  0.803
	size_download: 127889
	url_effective: http://www.yepi.com/
	time_redirect: 0.000
	num_redirects: 0
	time_connect: 0.188
	time_appconnect: 0.000
	time_pretransfer: 0.481
	time_starttransfer: 0.564

For the meaning of these items and values, please consult cURL manual at https://curl.haxx.se/docs/manpage.html, and the section describing the -w option, i.e. --write-out. Fetches that didn't result in a HTTP 200 status code or which caused redirects were discarded, since they are not useful for the purposes of our measurements.

The ### START TCP DATA ### header starts the section which includes information captured with tcpdump. We were only interested in detecting packet loss, and only recorded the arrival times of each TCP byte stream along with the sequence numbers marking the beginning and end of the received window.

Following the TCP data, output of running 30 pings and one traceroute is dumped, marked with self-explanatory section header names.

For inflation analysis as presented in our paper, we need geolocations of PlanetLab nodes, the web servers, and the router IP addresses seen in traceroute output. Geolocations of all the unique IP addresses seen in the data is obtained from 6 different commercial geolocation service, and their majority vote is also obtained for comparison. This data is in the folder called geolocations.

Analysis Code

The provided Python source code analyzes all the files given in the folder PL-raw-data using a geolocation service. The particular geolocation service has to be chosen at the file Main.py. The code produces a comma separated file with the following format:

	Field 1  - Time of test
	Field 2  - Planet lab node hostname
	Field 3  - Planet lab IP address
	Field 4  - Fetched page
	Field 5  - Destination server IP address
	Field 6  - Boolean indicating whether prot. is https (True = prot. is https)
	Field 7  - Site rank
	Field 8  - Boolean indicating whether page was ALSO used in MOOC end-user measurements
	Field 9  - Distance (in kilometers) between origin and destination
	Field 10 - Estimated loss, Boolean
	Field 11 - Number of fetched bytes
	Field 12 - Name resolution time (DNS) in seconds
	Field 13 - TCP handshake time in seconds
	Field 14 - SSL handshake time in seconds, 0 if field 6 is False
	Field 15 - Request response time in seconds (time between HTTP request sent and first byte received)
	Field 16 - TCP transfer time in seconds
	Field 17 - Total fetch time in seconds
	Field 18 - Minimum ping time in seconds between origin and destination
	Field 19 - Router path latency in seconds
	Field 20 - cRtt in seconds (minimum possible RTT between origin and destination)

Precomputed CSV files using all 7 geolocation services is provided in the zipped file Summary_CSV.tar.gz.

Contact Information

For questions and comments related to the data and the source code, please contact Ilker Nadi Bozkurt from ilker at cs dot duke dot edu.