From one to 1 million packet/second: scaling global Internet monitoring
- Track:
- DevOps, Cloud, Scalable Infrastructure
- Type:
- Talk (long session)
- Level:
- intermediate
- Duration:
- 45 minutes
Abstract
This is the real story about the design, benchmark, tuning and operation of a 24x7 multi-tenant network monitoring platform, that started out at only 1 packet/second, but then with the power of Python it scaled to 1 million packet/second.
It is like an high performance version of ICMP ping and mtr/traceroute, supports dualstack IP v4/v6, and runs 24x7 on a set of probing nodes distributed Worldwide.
Each ICMP packet needs to be scheduled, crafted, sent, analyzed and stored for query in near-time.
To make things harder, the probing platform has been designed and tuned to use the least amount of resources as CPU, memory and storage, in order to run on cheap burstable cloud instances.
Python is used everywhere: packet scheduling, generate randomized ICMP packets, send them via RAW sockets, and match results from probing.
Probing results are then pushed to ClickHouse in near-time, and visualized via Grafana.
And rest assured that every packet counts.
Some of the concept explored during the talk:
- ICMP protocol, BGP protocol, routing table
- ping, mtr/traceroute
- RAW socket in Python, crafting ICMP packets
- tcpdump, PCAP dump file
- msgpack, zstd compression
- object storage
- Apache Parquet as optimized data storage format
- near-time batch and streaming data workflow
- performance profiling and tuning
- Python vs PyPy performance comparison
- monitoring of the monitor: probing node performance monitoring