One of the bigger projects I have undertaken recently was to build a custom network monitoring tool to give detailed network insight. This project was important as the current network monitoring was very basic and did not allow the network operations team to be proactive. The probes were to be placed in all 100 telephone exchanges and large customer sites within the service provider network.
The technical requirements were as followed:
- Basic Reachability Tests
- ICMP Latency
- Application Latency
- DNS Performance
- Inter-Probe measurements (TWAMP)
- Ookla Speed Tests
- Iperf Speed Tests
- Tracking Network Routing Changes
The overall solution was to provide a dashboard-based system that takes all the collected data and computes it into useful statistics. An added benefit of this project is it provides a great troubleshooting platform within the network.
The High Level Overview
The probes themselves are small PCB boards that run a minimal Ubuntu 18.04 image that is lightweight. The probe has a multitude of packages installed, including Python3 which is the base of this project. The entire library of scripts is written in Python3 mainly due to the fact it's the language I know the most. When the probes were deployed updating the code proved to be an issue, so they now auto pull the newest version of code from GitLab.
When the probe is run, a report is created that is sends to a 'server' that reads the results and store’s them in a MySQL database. A Grafana front-end is then used to read the results from the DB and creates 'interesting' statistics. As well as the probe results a Zabbix instants runs to monitor the probes to alert to any issues and give picture of the probes overall health. These alarms and graphs are also displayed via the Grafana dashboard.
The final feature that proved to be very useful for the NOC team was the ability to be able to SSH and remote desktop to the devices. This allowed for another troubleshooting point that support more complex features that switches couldn't match.
From the start it was important that some form of code control was used otherwise when there is code on laptops and dev probes it can become complicated to know what code is correct and up to date. I decided to use GitLab hosted in our internal network as it has features that go well beyond Git such as issue creation/tracking and a web-based code editor.
Using GitLab code could easily be developed and then pushed to a handful of development probes for testing and then finally merged into the mast branch for all probes to download.
The probes as mentioned are based on Ubuntu 18.04 but the actual program they run is written in pure Python3. The main program file is run every X number of minutes via the help of a Cron job and a random sleep at the beginning. We found that running 100 1Gbps Iperf speed tests caused some noticeable congestion so the sleep at the start prevents this from happening.
The Python code is a mixture of my own library’s and other peoples code that I have pulled apart (https://github.com/nokia/twampy, https://github.com/farrokhi/dnsdiag). The main shown below pulls in all the custom modules written for the program and controls the execution. In most cases A set of values is passed into a function contained within the module, a task is executed, and the result returned to the main.
Once all tests have been executed, they are bundled up into a JSON object and sent to the Server via SCP for processing.
The server program is very simple so there isn't much to see here. The server program unpacks the JSON object, performs some basic logic checking and inserts it into the database.
The current MySQL database is sufficient for the number of queries that are being thrown it's way.
In order to monitor the probes during their deployment a Zabbix instance was created. Each probe has the Zabbix agent installed and SNMP configured which gives basic info on the device’s health.
Grafana is an open-source, general purpose dashboard and graph composer, which runs as a web application. It supports many different data sources such as graphite, InfluxDB or opentsdb but for this project the MySQL datastore is used. This allows for standard SQL queries to be used for visualisation.
Apache Guacamole is a clientless remote desktop gateway that supports standard protocols like VNC, RDP, and SSH. Thanks to HTML5, once Guacamole is installed, all you need to access your desktops is a web browser.
This allows for all the probes to be accessed from a single place, meaning the NOC team do not need to look up probe details every time they want to access them.
After following the process described below for several weeks I had a working product.
The dashboard element of the project was structured as shown below.
The overal view pages does that! it gives a nice stats page that gives a general overview of the entire netwroks health. It shows recent Iperf speed tests, average DNS latency, average Iperf and Ookla speed tests and average TWAMP statistics. This current look is for 24Hr but can be changed to days, weeks, years.
A Per-Probe Report
This is a more detailed look at each probe over a period of time, showing DNS, latency, reachability and speed tests. This can useful to identify network congestion and service degregation.
The probe map at the moment is nothing but a map showing current deployments for people to look at! In the future this will be used to show outages and issues in the network.
TWAMP RFC - https://tools.ietf.org/html/rfc5357
This dashboard shows the TWAMP results for all probes and highlighting the paths with the highest jitter and RTT times. Every probe discovers every other probe in the network and performs a TWAMP test with every probe. This creates a full mesh of tests and can help identify which paths are under-utilised. For examples paths from the South of the UK to the North are more congested compared to paths that are North to South (All of our peerings are down in the South) so can we maybe peer in the North of the UK to change this?
Finally for the dashboard is Zabbix and it shows how each probe is behaving in the network but at the moment it is still under construction.
The second aim was to provide a troubleshooting tool for the NOC so here it is! A web gui where they can browse to check for content blocks and use the CLI to perform advance tests.
Apache Guacamole provides a nice interface for easy connections and when using VNC multiple can connect and view what someone is doing (useful when the team is in different place).
Speed Test Script
A little side part was creating a script that is on every probe with the alias 'hyperspeed' added. This allows the engineers to run Ookla and Iperf speed tests from the CLI quickly.
Shows the Ookla details including connected server.
And Iperf shows results including CPU load.
Moving forward these are the plans and the dreams for the project.
- Move the backend database from MySQL to Elastic Search
- Create each test inside a docker container which then runs the test, returns the results to the main and destroys itself. This would allow for higher asynchronism during the program execution
- Support IPv6 tests - this is important for us
- Use the probe map to show speed results for each exchange, this could help us to diagnose issues with backhaul links
- Test our transit providers by sending the same traffic to the same destination via each peering
- Use this tool as a base to gather statistics from routers and switches in our network as currently our corporate Zabbiz server does not capture many of stats the network team really needs
- Add the ability to track network route changes