2 minute read

Microservices troubleshooting misery

< continued from page 7 high level of granularity in terms of the data that is collected, such as tracing individual requests across multiple services to identify the source of an issue, rather than just seeing a high-level metric for the entire system it also lets developers define and track custom metrics specific to their apps and needs auto-instrumentation, in particular, involves the automatic injection of code into an application or service without the need for manual coding With auto-instrumentation, developers can track and monitor their aPi calls and services, getting access to the necessary data, without needing to manually instrument each service auto-instrumentation can be implemented through openTelemetry, which supports automatic instrumentation of aPis and services, enabling developers to gain valuable insights into their application’s behavior, troubleshoot errors, and optimize their system

2.Add distributed tracing to your monitoring stack. a useful best practice that has gained popularity is to add distributed tracing on top of logging and metrics distributed tracing refers to the process of tracking the flow of requests as they move through a distributed system

With distributed tracing, develop- ers can identify performance bottlenecks and troubleshoot errors by following a request’s journey through the system, from the initial aPi call to the final response. by using related tools, developers can observe the flow of requests, measure latency and error rates, and gain a holistic view of their system’s performance. rather than relying on timeline views to understand the flow of requests through a system, (which isn’t optimal for aPis) developers can use visualization tools that help them understand the context of the data they are analyzing smart visualization of distributed tracing data, allows developers to easily understand the context of each trace and span, as well as view rich contextual error data it allows developers not only to easily implement distributed tracing, but to also make it actionable by maximizing its potential

3.Enrich observability with trace visualization and granular error data. using tools that smartly visualize spans and traces and add enriching data is another best practice for achieving aPi observability in microservices architectures.

4.Fight data overload with automated insights and error alerts. one of the main challenges in achieving aPi observability is the sheer volume of data gener- eli Cohen is Ceo and co-founder of Helios before co-founding Helios, a production-readiness platform for developers, eli served as director of engineering, product manager, and engineering team leader at a variety of successful startups eli is an alumnus of the elite israeli intelligence unit 8200, and he holds both a b sc in Computer science and an Mba from the Hebrew university of Jerusalem ated by microservices architectures With so much data being collected from various components in the system, it’s easy for developers to become overwhelmed and miss important insights

To avoid data overload, it’s important to insist on automated insights and error alerts intelligent insights that highlight which areas of the system require attention, such as slow-performing aPis or high-error-rate services, can minimize MTTr important insights include:

1 auto-generated aPi spec

2 dashboard that shows aPi behavior over time

3 detection of changes in aPi behavior

4 anomaly detection of aPi latency and behavior

Par ting words

achieving aPi observability in microservices is painful due to lack of control, data overload, and more but this new world calls for new ways and developers adopt observability approaches that include distributed tracing, helping them reduce burnout, improve developer experience, optimize their app’s quality, and minimize root-cause analysis effort.

in the example described above, and by using observability enriched with smart visualization and granular error data, the developer could view the outlier or long spans, understand span duration distribution and deep dive into the longest spans, and then identify and analyze the bottlenecks that caused the latency n