Monitoring Weekly | Issues

Issue 282

2024-10-13T00:00:00+00:00

So much great stuff this week, happy to see fresh posts from engineering teams sharing their experiences and challenges. Enjoy! 🍩☕🍂

Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

Tales of Performance Engineering

Insightful read about Mercado Libre’s Performance Engineering team, how they collaborate with Observability and SRE teams, and an anecdote from one of their recent wins.

Recap from eBPF Summit 2024

A summary of some of the talks from last month’s eBPF Summit. All of the streamed talks are also available for binging here.

OpenTelemetry Tracing in 200 lines of code

Really enjoyed this article and how the author broke down tracing into more approachable, digestible bits that the reader’s likely already comfortable with. Good one to share with your developer friends who might be tracing-averse.

The 4 Evolutions of Your Observability Journey

How to reason about where you might fall in your own observability journey, and what sorts of questions you’re probably trying to answer even if you aren’t explicitly aware of it.

How to use Prometheus to efficiently detect anomalies at scale

Maybe I’m just crotchety, but I’m happy to see more open source contributions for anomaly detection that don’t rely on LLMs or outsourcing private data. Gives me old Etsy and Graphite vibes, and I hope to see this framework continue to evolve.

Syncing PagerDuty Schedules to Slack Groups

I appreciate hearing how other folks solve these sorts of bespoke friction problem areas. It’s unfortunate that much of on-call management software is still pretty rough for these sorts of workflows.

Achieving Optimal Service Reliability: Insights Into Service Level Objectives

A good primer on service levels, error budgets, and burn rates. I would’ve liked to hear more about getting buy-in from teams where SLAs originate (e.g. legal, sales, support, etc) because, in my experience, this is where SLOs generally hit a brick wall in terms of usefulness.

Grafana Alloy and Grafana Agent Flow security release: High severity fix for CVE-2024-8975 and CVE-2024-8996

Patched versions of Grafana Alloy and Grafana Agent have been released to address a high severity CVE. Note that users are encouraged to reinstall both applications, as the upgrade process will not make the necessary corrections.

Balancing Speed & Innovation with Reliability: Building a Blameless Incident Culture in Startups

Some useful tips for building a blameless incident response culture. This won’t happen overnight, but it’s a solid outline for any startup looking to improve their incident processes and posture.

Tools

grafana/faro-web-sdk

“a highly configurable web SDK for real user monitoring”

grafana/promql-anomaly-detection

“A framework for anomaly detection using Prometheus and PromQL”

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor

SPECIAL EDITION: Q3 2024 Best of

2024-10-06T00:00:00+00:00

Happy to be back with you this week with our quarterly “Best Of” issue, looking back on the most popular articles from the past few months. 💕📈

P.S. In light of the recent disaster in the SE United States caused by Hurricane Helene, please consider donating to a supported cause helping those in need at this time. I would personally recommend World Central Kitchen, but there are many different ways to support these folks in need. Thank you.

Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

Taming Logs

A solid list of best practices and considerations for log formatting, collection, structure and more.

Automated OpenTelemetry traces for Bash!

Excellent use of OpenTelemetry that should open up a lot of possibilities for platform teams and anyone who runs software they can’t easily instrument themselves.

Is It Time To Version Observability? (Signs Point To Yes)

Some real talk from Charity Majors on the successes (and failures) of “Observability 1.0”. There’s plenty to chew on here, and while I largely agree with her points, I still sense some bias towards particular types of developers and systems. Regardless, an excellent post.

perses/perses

“Facilitates a seamless "dashboards as code" workflow by introducing an innovative and precisely defined dashboard definition model.”

Building a large-scale Observability Ecosystem

An insightful look at one company’s observability journey. This feels like a solid roadmap for anyone planning a similar transformation.

UI Improvements for Prometheus 3.0

Julius Volz offered a sneak peak of the UI improvements in the upcoming Prometheus 3.0 release. Tons of changes landing soon, check out the pre-release if you’d like to test it out and report any issues.

11 Takeaways from Observability Engineering Book

Some solid notes and highlights from “the” observability book. Unsurprisingly, there’s a bit of overlap from my own recent reading of the Learning OpenTelemetry book.

Building a cost-effective logging platform using Clickhouse for petabyte scale

It continues to impress me just how adaptable ClickHouse is to various workloads. Props to Zomato engineers for sharing their story with a useful level of detail.

Network Observability: Beyond Metrics and Logs

A reminder of the importance of network ~~monitoring~~ observability along with some good and bad examples of how it’s been done.

From Chaos to Clarity: Using Loki and Grafana to Tame Your Logs

If you haven’t already tried out Loki for yourself, this article is a solid introduction and getting started guide. Would like to see the author add a follow-up post to demonstrate querying and debugging in greater detail.

Perses is accepted as a CNCF Sandbox project

Looks like Grafana has some future competition in the Perses project. I appreciate their focus on a GitOps and CLI workflow, which has always felt like a bit of an afterthought for other dashboard projects.

Monitoring of Monitoring

Tips and considerations for anyone dealing with the traditional “who watches the watchers” conundrum.

Burn Rate Is a Better Error Rate

Helpful comparison of burn and error rates from Datadog. Props to the author for simplifying the math.

Destroy on Friday! A Chaos Engineering Experiment - Part 1

Fun post from Honeycomb describing a recent chaos engineering experiment. I wish more companies would share these types of learnings.

VictoriaLogs: an overview, run in Kubernetes, LogsQL, and Grafana

Interesting look at VictoriaLogs, how it compares with Grafana Loki, and some of the missing bits that may hold back its adoption for now.

Monitoring in Kubernetes: Best Practices

A decent collection of monitoring concerns and best practices for Kubernetes. 50/50 chance this was written by AI, but it still has some solid points. 😅

What makes a good on-call shift system for DevOps engineer?

A look at some of the primary considerations for choosing an on-call service provider and a quick comparison of four of the most popular options.

Unveiling the Power Duo: osquery and osctrl

A deep dive on two popular open source projects for system introspection and monitoring. Chances are you’re familiar with osquery, but you might not be aware of the osctrl tool for centralized management of osquery agents.

otel-tui: A TUI Tool for Viewing OpenTelemetry Traces

Fun new project for interacting with OTel traces inside the terminal. Love it!

OpenTelemetry and vendor neutrality: how to build an observability strategy with maximum flexibility

The ubiquity of OpenTelemetry has given users more power than ever before to avoid the hassles of vendor lock-in. But it’s not foolproof; there are still steps you can and should take to ensure that you’re using OTel effectively and giving yourself flexibility to adapt in the future.

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor

Issue 281

2024-09-15T00:00:00+00:00

Great collection of articles this week, including a big update from Prometheus and other topics from PromCon EU. Loving the deeply technical posts from Netflix, Pinterest, and IBM too.

Also a quick note that I’ll be taking a short hiatus from the newsletter for the next couple of weeks, returning for our quarterly “best of” issue on October 6. See you then! 👋💗🚵‍

Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

Noisy Neighbor Detection with eBPF

Excellent write-up from Netflix on their use of eBPF for low-overhead instrumentation and profiling. The noisy neighbor example is awesome, but their findings on eBPF optimization are just as interesting imho.

UI Improvements for Prometheus 3.0

Prometheus 3.0 Unveiled: PromCon Highlights with Julius Volz

An interview with Julius Volz diving into the Prometheus 3.0 changes and other topics from the recent PromCon EU 2024. Really great stuff if you’re working with anything in the Prometheus ecosystem.

Improving Efficiency Of Goku Time Series Database at Pinterest (Part 3)

The latest post from Pinterest engineers on the evolution of their in-house TSDB. Even though this isn’t an open source project, I love reading about how they optimize write and read performance (and costs) in these systems.

OpenTelemetry and vendor neutrality: how to build an observability strategy with maximum flexibility

Master Observability with OVM and OpenTelemetry

I haven’t heard anyone talking about this OVM project, but I found another announcement here and a whitepaper here that provide more context. Personally, I’d like to hear more about the real-world scenarios that inspired this design.

Developing an Automated Health Check for Cloud Services and Dependencies

This feels decidedly Nagios-like, but I can see where some folks might derive value out of something like this. OTOH it feels like it might suffer from drift pretty quickly.

VictoriaLogs: an overview, run in Kubernetes, LogsQL, and Grafana

Interesting look at VictoriaLogs, how it compares with Grafana Loki, and some of the missing bits that may hold back its adoption for now.

See you soon!

– Jason (@obfuscurity) Monitoring Weekly Editor

Issue 280

2024-09-08T00:00:00+00:00

Feels like everyone is squeezing the last few drops out of summer (at least here in North America), and I expect to start seeing more event announcements and project updates soon. This week I found a number of new technical guides, with an emphasis on on-call, outages, and error rates. Enjoy! 🍂📶☕

This issue is sponsored by:

Backend says: “99.999%" Frontend says: “Your mobile app sucks."

It's time to learn what your SLOs aren't telling you about mobile. Join Embrace for a session on how to create SLOs for your mobile apps that actually measure what matters — your end user experiences.

Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

Burn Rate Is a Better Error Rate

Helpful comparison of burn and error rates from Datadog. Props to the author for simplifying the math.

Stanza Outage Simulator

Saw this post in my LinkedIn feed and had to include it. I love this as a general tool for approximating the impact of a potential outage. You’ll almost certainly want to click on the “Instructions” button at the bottom for more details.

What makes a good on-call shift system for DevOps engineer?

A look at some of the primary considerations for choosing an on-call service provider and a quick comparison of four of the most popular options.

CI/CD Observability using OpenTelemetry

Solid introduction to observabilty for CI/CD systems, an overview of OpenTelemetry, and a guide for setting it up with Jenkins.

Building a Scalable Logging Service: Akka Lightbend vs. Kafka Confluent

Comparing two different projects as the basis for a scalable logging service. IMHO this is less of a head-to-head showdown and more of a “how to” evaluate these two options if that’s something you’re already planning.

How to Set Up a Free Web App Status Page and On-Call System: A Step-by-Step Guide

A fun guide for gluing together some free service plans to handle website monitoring and paging duties. Definitely skews hard towards the “DIY” end of the spectrum; this probably isn’t a great long-term solution, especially when you factor in turnover concerns.

Monitoring in Kubernetes: Best Practices

A decent collection of monitoring concerns and best practices for Kubernetes. 50/50 chance this was written by AI, but it still has some solid points. 😅

Implementing Observability with Prometheus, VictoriaMetrics, and Tilt

Setting up Prometheus metrics collection with a VictoriaMetrics storage backend, using Tilt to manage the underlying resources on the Kubernetes cluster.

Tools

Stanza Outage Simulator

“This tool simulates the impact of various outage scenarios on a system, allowing you to adjust parameters and observe the results.”

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor

Issue 279

2024-09-01T00:00:00+00:00

Great to see more hands-on technical guides and tools covered this week. And congratulations to the Perses project for being accepted to the CNCF Sandbox. Enjoy! 🌊📈🔔

This issue is sponsored by:

Backend says: “99.999%" Frontend says: “Your mobile app sucks."

Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

Monitoring Inter-Pod Traffic at the AZ Level with Retina

Great example for cross-AZ network observability with Retina. Bonus points if you can grok that PromQL in fewer than three re-reads.

Meaningful availability and uptime of Wise

Excellent article from Wise Engineering on how they think about uptime vs availability and how they relate to business operations and flows.

Goroutines and OpenTelemetry: Avoiding Common Pitfalls

Some antipatterns to watch out for when adopting OpenTelemetry in Go services.

Perses is accepted as a CNCF Sandbox project

Taming Logs

A solid list of best practices and considerations for log formatting, collection, structure and more.

Always. Enable. Keepalives.

Good reminder about the importance of keepalives and the painful ways they can remind us when we least expect it. 😅

Unlocking Insights with High-Quality Dashboards at Scale

The big takeaway here is that one-pager checklist embedded in the middle of the article. Download and share it with anyone in your org who maintains shared dashboards.

How to Check Fragmentation in an Oracle Database

Some queries for identifying fragmentation in an Oracle database. These could easily be monitored and alerted on.

Apache Druid: Query Level Monitoring via Request Logging

A simple guide for setting up request level logging for Apache Druid using emitters. Mostly useful for sending the query logs to a remote monitoring service.

Tools

perses/perses

“Facilitates a seamless "dashboards as code" workflow by introducing an innovative and precisely defined dashboard definition model.”

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor

Issue 278

2024-08-25T00:00:00+00:00

Awesome collection of posts this week with an emphasis on AI/LLM monitoring, adopting OpenTelemetry in Go and Rust, and some heavy feels around the state of observability and learning from our mistakes. Enjoy! 😇🦀🧠

This issue is sponsored by:

Run GitHub Actions up to 2x faster at half the cost

Blacksmith runs your GitHub Actions substantially faster by running them on modern gaming CPUs. Integrating Blacksmith is a one-line code change. 100+ companies like GitBook, Superblocks, and Slope use Blacksmith to help developers merge code faster.

Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

Leveraging LLM-as-a-Judge for Observability

Very interesting review of observability concerns as they pertain to LLM performance and accuracy within the context of the model itself.

Which Metrics Should You Monitor for Large Language Model Performance?

By contrast to the previous article, this one is a very no-nonsense look at metrics relevant to LLM performance (e.g. hardware, system throughput).

Advanced Monitoring with AI and Prometheus: Detecting and Mitigating Memory Leaks

A very approachable demonstration of how basic heuristic monitors can be leveled up with machine learning and a tiny bit of code.

Is It Time To Version Observability? (Signs Point To Yes)

Go and OpenTelemetry: A real-world implementation on open-source software

A unique look at the effort and techniques involved in updating an open source project to leverage OpenTelemetry tracing.

Simple OpenTelemetry logger in Rust

A nice complement to the previous post, this one takes a similar look at adopting OpenTelemetry, but for a Rust app.

What your SLOs aren’t telling you about mobile

Join Embrace September 26th at 1pm ET to learn how to craft and monitor SLOs that are specialized for mobile and connect directly to user experiences. Level up your observability practice with mobile app performance insights. (SPONSORED)

Addressing Tool Sprawl Without Falling Prey to Vendor Lock-In

The siren song of consolidation and reducing tool sprawl can be alluring, but it can also lead to vendor lock-in and a loss of flexibility without the right planning upfront.

Sampling Strategies for Monitoring (Part 1)

Comparing the tradeoffs between static and dynamic sampling of monitoring data.

Incident Management for New Engineers

I have so much empathy for this engineer and their vulnerability in learning how to fail blamelessly. This is how we learn effectively, both individually and as a team. Thank you to the author for sharing their story.

NMS Migration Made Easy: Get Stakeholders Aligned

Some important lessons here for anyone trying to land a new observability initiative. It doesn’t matter how good your technology is if the users and stakeholders aren’t invested in its success.

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor

Issue 277

2024-08-18T00:00:00+00:00

Plenty of OpenTelemetry coverage this week and a look at one team’s use of ClickHouse for logging at scale. Enjoy! 🪓🌴🐢

This issue is sponsored by:

What DevOps and SREs need to know about mobile observability

A generic observability approach doesn't work for mobile. Join this webinar by Embrace CTO Fredric Newberg to learn why the mobile environment impacts how you can understand user experiences. Topics include user-focused observability, client-side network monitoring, and ecosystem limitations.

Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

Automated OpenTelemetry traces for Bash!

Excellent use of OpenTelemetry that should open up a lot of possibilities for platform teams and anyone who runs software they can’t easily instrument themselves.

Building a cost-effective logging platform using Clickhouse for petabyte scale

It continues to impress me just how adaptable ClickHouse is to various workloads. Props to Zomato engineers for sharing their story with a useful level of detail.

Measuring LLM Confusion

Chances are that you aren’t responsible for monitoring the confusion of your language models, but I’m guessing there are some readers here who work with engineers who’d appreciate this post.

From Chaos to Clarity: Using Loki and Grafana to Tame Your Logs

Empower Your Observability: Tail-Based Sampling for Better Tracing with OpenTelemetry

A comparison of OpenTelemetry’s sampling techniques and example of tail-based sampling. I’d recommend also reading the official docs for additional context.

Understanding CPU Utilization and Credit Usage in AWS: A DevOps Perspective

Demonstration of AWS credit use during utilization spikes. Although it feels obvious, how often do any of us really set up infrastructure cost forecasting before the CFO comes knocking?

Behind the scenes of the OpenTelemetry Governance Committee

Pulling back the curtain on the responsibilities and activities and of OpenTelemetry Governance Committee member. Nothing too shocking, but it’s always nice to see transparency around the leadership of popular OSS projects.

OpenTelemetry Tracing on Spring Boot, Java Agent vs. Micrometer Tracing

Looking at the differences between OTel tracing approaches for Java applications.

Grafana security release for CVE-2024-6837

Grafana Labs has released a new version of Grafana to address a medium severity CVE. You’ll want to upgrade to avoid exposing your Grafana to an XSS exploit through the /swagger endpoint.

Tools

plengauer/opentelemetry-bash

“This project delivers OpenTelemetry traces, metrics and logs from shell scripts (sh, ash, dash, bash, busybox, and many other POSIX compliant shells) as well as from GitHub actions…”

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor

Issue 276

2024-08-11T00:00:00+00:00

An emphasis on more challenging observability tactics and problems this week. Plus you know I’m always happy to see network monitoring in the conversation. Enjoy! 🌞🐶💖

This issue is sponsored by:

What DevOps and SREs need to know about mobile observability

Observing mobile apps is vastly different from backend systems, with complexity across user behavior, devices, and network connectivities. Join this virtual event to learn why a generic observability approach doesn't work, and what you can achieve with a purpose-built solution.

Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

eBPF Map Metrics Prometheus Exporter

Monitoring changes in an eBPF map remains an elusive goal but I enjoyed learning how this engineer is trying to tackle the problem space.

Powerful Visibility with Rust, Lambda, Datadog, and OpenTelemetry

The title pretty much says it all. An excellent post for anyone working with Lambdas written in Rust.

Unveiling the Power Duo: osquery and osctrl

Distributed tracing in ABAP

I’m not personally experienced with supporting ABAP systems, but you’ll definitely want to check out this post (and the linked video within) for leveraging OTel with ABAP if that’s something you work with.

Network Observability: Beyond Metrics and Logs

A reminder of the importance of network ~~monitoring~~ observability along with some good and bad examples of how it’s been done.

Prometheus data source update: Redefining our big tent philosophy

An update from Grafana Labs on their latest data source developments for cloud-native managed Prometheus-compatible offerings.

Run GitHub Actions up to 2x faster at half the cost

**[Detecting Deadlock with Micrometer Metrics](https://medium.com/@ruth.kurniawati/detecting-deadlock-with-micrometer-metrics-a8b71ad63cb3)** An overview of deadlocks in Java applications and how support for monitoring them is coming in an upcoming Micrometer release.

**[Introducing Toto: A State-of-the-Art Time Series Forecasting Model](https://www.datadoghq.com/blog/datadog-time-series-foundation-model/)** Interesting review of Datadog's latest time-series foundation model, how it compares to other models, and why we should care about it.

## **Tools** **[jmpsec/osctrl](https://github.com/jmpsec/osctrl)** "_With osctrl you can monitor all your systems running osquery, distribute its configuration fast, collect all the status and result logs and allow you to run on-demand queries._" **[osquery/osquery](https://github.com/osquery/osquery)** "_SQL powered operating system instrumentation, monitoring, and analytics._"

See you next week! -- Jason ([@obfuscurity](https://twitter.com/obfuscurity)) _Monitoring Weekly_ Editor

Issue 275

2024-08-04T00:00:00+00:00

A little something for everyone this week, with an emphasis on OpenTelemetry and Prometheus, and a look back on the CrowdStrike outage. Enjoy! 🌞🛶⏰

This issue is sponsored by:

Run GitHub Actions up to 2x faster at half the cost

Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

OpenTelemetry: Python SDK Design & Fundamental Concepts

An excellent post for anyone who already has a passing knowledge of OTel but needs more detail specific to Python apps and libraries.

Engineering Resilience: Lessons from the CrowdStrike-Microsoft Incident

A sort of aggregated postmortem on the CrowdStrike incident from an outsider’s perspective. As someone who was privileged to not be impacted directly by this, it’s a fascinate look back on what I missed (whew).

Java 21 Virtual Threads - Dude, Where’s My Lock?

Less of an observability or monitoring post, but still a facscinating read on systems engineering and debugging through the eyes of a Netflix engineer.

Top 20 Linux Bandwidth Monitoring Tools in 2024

Always fun to revisit network troubleshooting and monitoring tools, mostly because it reminds me that I used to know something about networking in the Before Cloud :tm: days.

Manage your monitors more efficiently with Datadog Teams

I just left a company who used Datadog and I had no idea these capabilities existed for Teams. Of course, I had pretty limited permissions to start with, so it’s no great surprise I didn’t know about it. 😈

Mastering Prometheus for Robust System Monitoring

A solid post for anyone moving beyond the “ok, I installed Prometheus… what next” phase to critical planning for a successful integration within existing infrastructure. Always good to revisit the discovery options available within Prometheus.

Mezmo's telemetry pipeline streamlines data collection, profiling, transformation, routing, and analysis. Our free Telemetry Data Profiling Offer helps you understand and optimize your data to meet your observability goals. Sign up for a free trial to experience the platform first-hand. (SPONSORED)

11 Takeaways from Observability Engineering Book

Some solid notes and highlights from “the” observability book. Unsurprisingly, there’s a bit of overlap from my own recent reading of the Learning OpenTelemetry book.

Mastering Zabbix Regular Expressions: A Comprehensive Guide

I don’t run into many Zabbix shops this side of the Atlantic Ocean, but if you’re one of them you might enjoy this dive into Zabbix regex.

Don’t get blinded by your Observability tools

A reminder to be mindful of what you’re collecting; most shops can no longer afford to “monitor all the things”.

Tools

DrDroidLab/playbooks

“Runbook automation platform with deep observability integrations for SRE & On-Call Teams”

Job Opportunities

Sr. Site Reliability Engineer at Vimeo (US Remote)

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor

Issue 274

2024-07-28T00:00:00+00:00

Quite the spike in eBPF topics this week, including a thought from Brendan Gregg on how it could’ve saved us from the recent “Blue Friday”. Enjoy! 💙🌽📈

This issue is sponsored by:

Run GitHub Actions up to 2x faster at half the cost

Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

Towards Jaeger v2 💥💥💥 Moar OpenTelemetry!

An exciting update on the next major version of Jaegar. Great to see stronger alignment with OpenTelemetry (native OTLP support, OTel collector extensions!), even if it means some short-term pains, e.g. CLI configuration compatibility.

No More Blue Fridays

A different perspective on the CrowdStrike outage that affected everyone, how it can be avoided once eBPF is production-ready for Windows, and what you can do in the meantime.

Promruval v3: Validation of Loki rules and more

A look at the latest release of Promruval, and open source tool for validating Prometheus rules and expressions. Nifty tool with a good understanding of how folks use (and misuse) Prometheus.

Why Care About Exception Profiling in PHP?

I don’t use PHP, but I love reading about profilers and what we can learn from them. Coming from a very exception-happy Ruby background, I was surprised to hear how expensive they are in PHP-land.

Can eBPF Detect Redis Message Patterns Before They Become Problems?

How Anteon designed their Redis observability agent using eBPF. Some useful learnings for anyone curious about setting up your own custom eBPF hooks.

Monitoring of Monitoring

Tips and considerations for anyone dealing with the traditional “who watches the watchers” conundrum.

Axiom is your hedge against observability and security tools: ingest, store, and query 100% of your event data in Axiom and then flow exactly what you need to vendors. Use Axiom for cost-reduction, silo-busting, tool consolidation, and vendor experimentation - all without compromising a single byte of data. (SPONSORED)

Unveiling the black box with observability stack

How GovTech Singapore engineers use a fully open source stack to gain observability on their applications and infrastructure.

Scaling Prometheus with Thanos

A solid overview of Thanos and its components, but the installation guide is underwhelming imho. Thanos is one of those projects that makes it possible to scale Prometheus, but you don’t truly understand its inner workings until you’ve used it in anger.

Tools

FUSAKLA/promruval

“Tool to validate the Prometheus rules metadata and expression properties…”

Job Opportunities

Senior DevOps Engineer at NavvTrack (US Remote)

Senior Cloud Systems Engineer at Tucows (NA Remote)

Senior Site Reliability Engineer, Databases at Grafana Labs (US Remote)

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor