Ran several backend process strategies and took a loss:


Clearly the process is running, and the data is fresh, but PM2 shows it as stopped.
If I trust PM2 to restart directly, it ends up interrupting the processes that are still working.
Later I understood: PM2 / launchd / pid files are just registration layers — whether they record the process or not is one thing, whether the process is actually running is another.
You really need to look at the health files produced by the process itself — the most recent update was a few minutes ago + the process count matches = alive.
I wrote a monitoring script that reports 4 values for each process at the same time:
- Is the process running (checked with ps)
- Is it registered with PM2 / launchd
- How long ago was the health file last updated
- Whether the three match up correctly
As long as the health file is recent, it’s not treated as dead.
Engineering lesson: To judge whether the "system is alive," don’t rely on what your monitoring layer says, look at whether the system’s own output is recent.
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin