WinterFlow.io - Netdata v2.4.0 release notes

Summary
Acknowledgments
Contributions
Deprecation notice
Support options

Release Summary

Netdata v2.4.0 is a stability-focused release that addresses many issues that were identified thanks to the new agent reporting system introduced in v2.3.0. This release significantly improves reliability by fixing multiple crash scenarios and memory leaks throughout the codebase.

Key Highlights

Category	Improvements
Memory Optimization	• Resolved significant memory leaks in container monitoring systems, particularly affecting Kubernetes deployments • Fixed memory leaks across database engine components, health alarm entries, and alert pattern matching • Improved SQLite memory management with maximum heap limits and dynamic memory release under system pressure
Stability Improvements	• Fixed numerous crashes in the Windows performance counters handling and container monitoring systems • Improved error handling when dbengine files reside on disks with errors • Enhanced journal file handling with better error logging • Optimized shutdown sequences to prevent resource leaks and crashes • Fixed ACLK synchronization issues to properly handle dynamic host configuration changes
New Features	• Windows Service Monitoring: Added capability to track running states (running, stopped, pending, paused) of Windows services through the windows.plugin/PerflibServices collector (disabled by default, requires manual activation)

Acknowledgments

@dave818 for fixing a cron job syntax error in the updater script by correcting the time format.
@ycdtosa for adding missing —offline-install-source option documentation to kickstart script usage information, adding Synology-specific user and group creation commands to kickstart script for improved DSM compatibility, and updating Synology installation documentation to clearly differentiate steps required for older DSM versions.

Contributions

Collectors

Improvements

Added Windows service monitoring to track running states including running, stopped, pending, and paused services (windows.plugin/PerflibServices) (#19990, @thiagoftsm)

Bug fixes

Fixed Prometheus collector to use appropriate units instead of “ratio” for measurements (go.d/prometheus) (#20069, @ilyam8)
Fixed crash in Windows Hyper-V collector caused by unpopulated shared buffer values (windows.plugin/PerflibHyperV) (#20060, @thiagoftsm)
Fixed MegaCLI collector to properly handle adapter configurations with no connected drives (go.d/megacli) (#20046, @ilyam8)

Other

Added socket and remote client capabilities to OpenTelemetry journald exporter (#20038, #20033, #20121, @ilyam8)
Added hostname labels to virtual nodes in Go-based collectors (#20030, @ilyam8)
Added preliminary support for custom YAML files in SNMP collector that will be used for single metrics in future releases (go.d/snmp) (#20020, @Ancairon)

Packaging/Installation

All changes

Fixed cron job syntax error in updater script by correcting the time format (#20039, @dave818)
Added missing —offline-install-source option documentation to kickstart script usage information (#20025, @ycdtosa)
Added Synology-specific user and group creation commands to kickstart script for improved DSM compatibility (#20024, @ycdtosa)
Added Docker tag rotation system to track the four most recent nightly builds with relative numeric identifiers (#19734, #20089 @Ferroin)

Documentation

All changes

Improved clarity, structure, and examples throughout the Alerts & Notifications documentation (#20085, @kanelatechnical)
Updated documentation to provide clearer guidance on transitioning to static builds for end-of-life platforms (#20075, #20110 @ralphm)
Added documentation for the remove-stale-node command in the Nodes Ephemerality guide (#20057, @ralphm)
Fixed code block formatting in Log2Journal documentation to comply with MDX 3 requirements (#20056, @Ancairon)
Simplified OIDC configuration by removing parameters no longer needed after adding Discovery support (#20053, @juacker)
Improved documentation for observability centralization, including streaming, replication, and node management, with clearer language and structure (#20052, #20073 @kanelatechnical)
Removed on-premises documentation files relocated to a dedicated repository (#20023, @Ancairon)
Improved Windows installer and Machine Learning documentation with simpler language and better organization (#20021, @kanelatechnical)
Improved deployment guides with clearer explanations of standalone installations and centralization options (#20004, @kanelatechnical)
Updated Synology installation documentation to clearly differentiate steps required for older DSM versions (#19989, #19993, #20010 @ycdtosa)
Improved installation documentation with more concise instructions for macOS, offline installation, IPv4 configuration, native packages, and Docker deployment (#19987, @kanelatechnical)
Improved installation documentation for Ansible, Azure, AWS, Kickstart script, and Kubernetes deployments with better organization and clarity (#19981, @kanelatechnical)
Fixed documentation order to provide a more logical top-to-bottom reading flow in kickstart installation guide (#19975, @kanelatechnical)
Updated SCIM documentation to include new Groups support functionality (#19969, @juacker)

Other Notable Changes

Bug Fixes

Fixed significant memory leaks in container monitoring systems, particularly in cgroups plugin and network interface tracking, affecting Kubernetes and container deployments (#20116, @ktsaou)
Fixed journal file handling with improved error logging and better protection during retention calculations (#20098, @stelfrag)
Fixed Windows performance counter handling to prevent crashes from malformed registry data and optimized memory usage (#20097, @ktsaou)
Fixed memory leak in journal file creation process (#20094, @stelfrag)
Fixed memory management in timer cancellation to prevent resource leaks (#20084, @stelfrag)
Fixed crash during shutdown by properly handling pending cloud messages and preventing operations after MQTT connection closure (#20080, @stelfrag)
Fixed resource leaks during shutdown by properly releasing database engine memory and semaphores (#20078, @stelfrag)
Fixed memory leaks across various components including database engine’s extent structures, health alarm entries, worker monitoring system, and alert pattern matching, with improved memory management and cleanup procedures (#20062, @ktsaou)
Fixed use-after-free issue in cgroup network device handling during device rename operations (#20048, #20050 @ktsaou)
Fixed potential crash by preventing statement cleanup after database connections are closed (#20045, @stelfrag)
Fixed ACLK synchronization to properly handle shutdown sequence and prevent crashes during handle closure (#20034, @stelfrag)
Fixed potential crash in ACLK synchronization by properly stopping timers when hosts are deleted (#20031, @stelfrag)
Fixed database file rotation to improve disk space estimation and reduce unnecessary rotation operations (#20019, @stelfrag)
Fixed out-of-memory handling when creating new journal v2 files (#19965, @stelfrag)

Other

Improved Windows operating system detection with better version identification and edition information categorization (#20117, @ktsaou)
Reorders commands in the netdatacli help output to match the typical execution sequence (#20113, @ilyam8)
Fixed ACLK node update cancellation to properly handle cases without defined timers, ensuring commands like remove-stale-node ALL_NODES complete successfully (#20111, @stelfrag)
Fixed ACLK shutdown sequence to improve coordination between synchronization and MQTT threads (#20105, @stelfrag)
Fixed incorrect backoff timeout handling in ACLK connection retry logic (#20095, @stelfrag)
Fixed resource protection system with improved error recovery and detailed diagnostics for memory access violations (#20093, @ktsaou)
Fixed journal v2 file handling to protect against memory access violations with better signal handling (#20092, @ktsaou)
Fixed dlib integration by upgrading to latest version and properly incorporating it into the build system for better compatibility and warning management (#20086, @Ferroin)
Fixed shutdown process by preventing unnecessary journal file indexing operations during termination (#20079, @stelfrag)
Fixed ACLK synchronization to properly handle host configuration changes and prevent timer issues with deleted nodes (#20077, @stelfrag)
Improved agent status reporting with comprehensive system information collection including hardware details, boot mode, and detailed metrics storage analysis (#20058, #20076, #20088, #20096, #20101, #20104 @ktsaou)
Fixed crash detection in journal file migration with improved worker job tracing and assertions (#20043, @ktsaou)
Improved agent status reporting with enhanced hardware identification and better crash diagnostics through intelligent stack trace analysis (#20037, #20041, #20044, #20047, #20051 @ktsaou)
Fixed potential null pointer issue in trim_all function by ensuring it never returns NULL (#20028, #20029, @ktsaou)
Improved logs analysis by adding field filtering without facets and improving histogram visualization options (#20027, @ktsaou)
Added system hardware information to status reporting with privacy protection for serial numbers (#20026, @ktsaou)
Added a new agent events backend system with advanced deduplication, comprehensive metrics, structured logging, and improved request handling for better reliability and observability (#20012, #20014, #20063, #20064, #20065, #20067, #20068, #20074 @ktsaou)
Made several non-functional code changes to improve maintainability including removal of commented code, better naming conventions, formatting improvements, and proper function annotations (#20006, #20022, @vkalintiris)
Improved journal v2 loading performance by utilizing multiple CPU cores during initialization (#19995, @stelfrag)
Fixed potential issues in shutdown process and datafile rotation with better synchronization and improved alert handling (#19991, @stelfrag)
Fixed memory leak in metric correlation calculations when running with address sanitizer (#19979, @stelfrag)
Fixed SQLite memory management by implementing maximum soft and hard heap limits to prevent excessive memory usage (#19963, @stelfrag)
Added support for dynamically releasing SQLite memory when under system pressure (#19952, @stelfrag)
Fixed libbacktrace enabling logic in build system that was previously inverted (#19936, @Ferroin)
Added widget for loading available contexts in health dynamic configuration interface (#19904, @ilyam8)

Deprecation notice

Changed in this release

No changes.

Important Changes in Next Major Release

Deprecated Components

Component Type	Versions Being Deprecated
APIs	v1, v2

What This Means

Only the v3 API and v3 Dashboard will be supported starting with the next major release. These newer versions offer improved performance, enhanced features, and better security.

Important Changes in Next Minor Release

No changes are expected.

Support options

As we grow, we stay committed to providing the best support ever seen from an open-source solution. Should you encounter an issue with any of the changes made in this release or any feature in the Netdata Agent, feel free to contact us through one of the following channels:

Premium Support: Customers who wish to have a direct channel with Netdata and prioritized support with defined SLAs can contact us.
Netdata Learn: Find documentation, guides, and reference material for monitoring and troubleshooting your systems with Netdata.
GitHub Issues: Use the Netdata repository to report bugs or open a new feature request.
GitHub Discussions: Join the conversation around the Netdata development process and be a part of it.
Community Forums: Visit the Community Forums and contribute to the collaborative knowledge base.
Discord Server: Jump into the Netdata Discord and hang out with like-minded sysadmins, DevOps, SREs, and other troubleshooters. More than 2000 engineers are already using it!

Netdata

v2.4.0

Table of Contents