Week 07/16/2017 - 07/23/2017 Instrumentation
Spend the week improve the instrumentation of the server.
The java based socket server hosts room based MMO game and we did not have transaction flow instrumentation in place before.
A simple image would explain how amazing the result is.
![]() |
| transaction flows by response over time |
The tool used is datadog, which is a one-stop shop for instrumenting & monitoring infrastructure as well as application.
As described by the image, multiple flows are covered in this dashboard (title partially wiped out). Light blue stands for number of requests received, light purple for successful responses and dark blue for failed ones.
Good news is for the majority of the flows, we're responding with successful responses. Bad news is for a particular flow, the number of responses does not match the number of requests, meaning the handling of the some requests are interrupted due to exceptions.
Also something else surfaced after I visualized the flows:
![]() |
| Unnatural traffic for failed responses |
Through out the day, during certain hours I saw spikes for failed responses with unnatural traffic spike. This is a sign we're using bot attack, which is something I was not aware of before.
This pretty much covered my week. Adding instrumentation in legacy code base is not trivial. Requests are handled asynchronously and pieces of a particular flow scattered everywhere in the code base.
I end up fixing some unhandled cases & re-defined the contract in addition to adding instrumentation. I'm pretty happy with the result: I can simply look at the dashboard to see real-time metrics for game healthiness.


Comments
Post a Comment