This is the public release of the code and data accompanying the paper Why is the Internet so Slow?! which appeared at Passive and Active Measurements Conference (2017). The file PAM2017-PlanetLabMeasurements.tar.gz contains the raw data collected from PlanetLab nodes and the source code which analyzes the raw data and generates a summary CSV file for analysis. After unzipping, it will have the following structure:
PAM2017-PlanetLabMeasurements ├── data │ ├── geolocations // Geo-location of all unique IPs from 6 providers │ │ ├── dbip.txt │ │ ├── es.txt │ │ ├── GeoDatabases.txt │ │ ├── ip2l.txt │ │ ├── iplg.txt │ │ ├── mml.txt │ │ ├── mm.txt │ │ └── mv.txt │ ├── pl-nodes-16Jun2016.txt // List of Planet-Lab nodes as of Jun 16,2016 │ ├── PL-raw-data // Raw-data, as downloaded from PL nodes │ └── test-urls.txt // All the websites and their URLs used in the tests ├── README └── src // Source code generating a summary CSV from raw data ├── CurlData.py ├── Distance.py ├── Geolocation.py ├── LogParser.py ├── Main.py ├── PingData.py ├── PlNode.py ├── TcpData.py ├── TestData.py ├── TracerouteData.py └── UrlData.py
Experiments were performed from 102 PlanetLab nodes in 81 unique locations in June 2016. Data from each PlanetLab node is in a separate folder inside PL-raw-data folder. Folder names are IP addresses of PlanetLab nodes. The file pl-nodes-16Jun2016.txt includes all PlanetLab nodes and their information.
The data in each folder is presented as downloaded from the corresponding PlanetLab node, i.e. in small chunks of 1-2 MBs each. The files contain measurements performed using the URL list in the file test-urls.txt. The Url file contains an identifier for each website, a final unique Url used in fetches, the website's global Alexa rank. Some lines in the URL file has a 4th field mooc, which indicates the URLs that are also used for the MOOC-recruited end-user experiments described in Section 3.7 of the paper. The URLs were obtained from Alexa's top 500 websites listed for each country. URL list was crawled in June 2016. URLs which caused errors while fetching over cURL and URLs of adult websites were discarded, the latter being arbitrary and unnecessary.
Data collection in each experiment using a specific URL include the following:
Each new experiment is marked with the following header:
########################### NEW LOG BEGINS #############################################
Following the header, the destination URL, time of test and destination IP address is given:
DEST http://www.yepi.com/ TIME_OF_TEST 1464604288 DESTIP 22.214.171.124
Each HTML is fetched twice, and the timings of each fetch is recorded with cURL. The HTML page sometimes is served from a different server in the second fetch. Pings and traceroute's are run towards only to the IP address of the web server which served the HTML during the first fetch. Information obtained from cURL during each fetch is printed in the following form:
http_code: 200 time_namelookup: 0.112 time_total: 0.803 size_download: 127889 url_effective: http://www.yepi.com/ time_redirect: 0.000 num_redirects: 0 time_connect: 0.188 time_appconnect: 0.000 time_pretransfer: 0.481 time_starttransfer: 0.564
For the meaning of these items and values, please consult cURL manual at https://curl.haxx.se/docs/manpage.html, and the section describing the -w option, i.e. --write-out. Fetches that didn't result in a HTTP 200 status code or which caused redirects were discarded, since they are not useful for the purposes of our measurements.
### START TCP DATA ### header starts the section which includes information captured with tcpdump. We were only
interested in detecting packet loss, and only recorded the arrival times of each TCP byte stream along with the sequence
numbers marking the beginning and end of the received window.
Following the TCP data, output of running 30 pings and one traceroute is dumped, marked with self-explanatory section header names.
For inflation analysis as presented in our paper, we need geolocations of PlanetLab nodes, the web servers, and the router IP addresses seen in traceroute output. Geolocations of all the unique IP addresses seen in the data is obtained from 6 different commercial geolocation service, and their majority vote is also obtained for comparison. This data is in the folder called geolocations.
The provided Python source code analyzes all the files given in the folder PL-raw-data using a geolocation service. The particular geolocation service has to be chosen at the file Main.py. The code produces a comma separated file with the following format:
Field 1 - Time of test Field 2 - Planet lab node hostname Field 3 - Planet lab IP address Field 4 - Fetched page Field 5 - Destination server IP address Field 6 - Boolean indicating whether prot. is https (True = prot. is https) Field 7 - Site rank Field 8 - Boolean indicating whether page was ALSO used in MOOC end-user measurements Field 9 - Distance (in kilometers) between origin and destination Field 10 - Estimated loss, Boolean Field 11 - Number of fetched bytes Field 12 - Name resolution time (DNS) in seconds Field 13 - TCP handshake time in seconds Field 14 - SSL handshake time in seconds, 0 if field 6 is False Field 15 - Request response time in seconds (time between HTTP request sent and first byte received) Field 16 - TCP transfer time in seconds Field 17 - Total fetch time in seconds Field 18 - Minimum ping time in seconds between origin and destination Field 19 - Router path latency in seconds Field 20 - cRtt in seconds (minimum possible RTT between origin and destination)
Precomputed CSV files using all 7 geolocation services is provided in the zipped file Summary_CSV.tar.gz.
For questions and comments related to the data and the source code, please contact Ilker Nadi Bozkurt from ilker at cs dot duke dot edu.