EzDevInfo.com

watchdog

Python library and shell utilities to monitor filesystem events. Watchdog — watchdog 0.8.2 documentation

How can I verify if a Windows Service is running

I have an application in C# (2.0 running on XP embedded) that is communicating with a 'watchdog' that is implemented as a Windows Service. When the device boots, this service typically takes some time to start. I'd like to check, from my code, if the service is running. How can I accomplish this?


Source: (StackOverflow)

Writing a file with vim doesn't fire a file change event on OS X

I am using watchdog to monitor .less file change events on OS X. If I change the contents of a .less file with TextMate or Sublime Text the modification event is captured. However, if I edit the content with vim no file modification event is fired (but file creation events for files created with vim are captured). I have seen the same behaviour with FSEvents and kqueue (both of which I have practically zero knowledge of).

I wonder can anybody explain this behaviour?


Source: (StackOverflow)

Advertisements

Foolproof cross-platform process kill daemon

I have some python automation, which spawns telnet sessions that I log with the linux script command; there are two script process IDs (a parent and child) for each logging session.

I need to solve a problem where if the python automation script dies, the script sessions never close on their own; for some reason this is much harder than it should be.

So far, I have implemented watchdog.py (see bottom of the question), which daemonizes itself, and polls the python automation script's PID in a loop. When it sees the python automation PID disappear from the server's process table, it attempts to kill the script sessions.

My problem is:

  • script sessions always spawn two separate processes, one of the script sessions is the parent of the other script session.
  • watchdog.py will not kill the child script sessions, if I start script sessions from the automation script (see AUTOMATION EXAMPLE, below)

AUTOMATION EXAMPLE (reproduce_bug.py)

import pexpect as px
from subprocess import Popen
import code
import time
import sys
import os

def read_pid_and_telnet(_child, addr):
    time.sleep(0.1) # Give the OS time to write the PIDFILE
    # Read the PID in the PIDFILE
    fh = open('PIDFILE', 'r')
    pid = int(''.join(fh.readlines()))
    fh.close()
    time.sleep(0.1)
    # Clean up the PIDFILE
    os.remove('PIDFILE')
    _child.expect(['#', '\$'], timeout=3)
    _child.sendline('telnet %s' % addr)
    return str(pid)

pidlist = list()
child1 = px.spawn("""bash -c 'echo $$ > PIDFILE """
    """&& exec /usr/bin/script -f LOGFILE1.txt'""")
pidlist.append(read_pid_and_telnet(child1, '10.1.1.1'))

child2 = px.spawn("""bash -c 'echo $$ > PIDFILE """
    """&& exec /usr/bin/script -f LOGFILE2.txt'""")
pidlist.append(read_pid_and_telnet(child2, '10.1.1.2'))

cmd = "python watchdog.py -o %s -k %s" % (os.getpid(), ','.join(pidlist))
Popen(cmd.split(' '))
print "I started the watchdog with:\n   %s" % cmd

time.sleep(0.5)
raise RuntimeError, "Simulated script crash.  Note that script child sessions are hung"

Now the example of what happens when I run the automation above... note that PID 30017 spawns 30018 and PID 30020 spawns 30021. All the aforementioned PIDs are script sessions.

[mpenning@Hotcoffee Network]$ python reproduce_bug.py 
I started the watchdog with:
   python watchdog.py -o 30016 -k 30017,30020
Traceback (most recent call last):
  File "reproduce_bug.py", line 35, in <module>
    raise RuntimeError, "Simulated script crash.  Note that script child sessions are hung"
RuntimeError: Simulated script crash.  Note that script child sessions are hung
[mpenning@Hotcoffee Network]$

After I run the automation above, all the child script sessions are still running.

[mpenning@Hotcoffee Models]$ ps auxw | grep script
mpenning 30018  0.0  0.0  15832   508 ?        S    12:08   0:00 /usr/bin/script -f LOGFILE1.txt
mpenning 30021  0.0  0.0  15832   516 ?        S    12:08   0:00 /usr/bin/script -f LOGFILE2.txt
mpenning 30050  0.0  0.0   7548   880 pts/8    S+   12:08   0:00 grep script
[mpenning@Hotcoffee Models]$

I am running the automation under Python 2.6.6, on a Debian Squeeze linux system (uname -a: Linux Hotcoffee 2.6.32-5-amd64 #1 SMP Mon Jan 16 16:22:28 UTC 2012 x86_64 GNU/Linux).

QUESTION:

It seems that the daemon doesn't survive the spawning-process crash. How can I fix watchdog.py to close all the script sessions if the automation dies (as shown in the example above)?

A watchdog.py log that illustrates the problem (sadly, the PIDs do not coincide with the original question)...

[mpenning@Hotcoffee ~]$ cat watchdog.log 
2012-02-22,15:17:20.356313 Start watchdog.watch_process
2012-02-22,15:17:20.356541     observe pid = 31339
2012-02-22,15:17:20.356643     kill pids = 31352,31356
2012-02-22,15:17:20.356730     seconds = 2
[mpenning@Hotcoffee ~]$

Resolution

The problem essentially was a race condition. When I tried to kill the "parent" script processes, they had already died coincidental with the automation event...

To solve the problem... first, the watchdog daemon needed to identify the entire list of children to be killed before polling the observed PID (my original script attempted to identify children after the observed PID crashed). Next, I had to modify my watchdog daemon to allow for the possibility that some script processes might die with the observed PID.


watchdog.py:

#!/usr/bin/python
"""
Implement a cross-platform watchdog daemon, which observes a PID and kills 
other PIDs if the observed PID dies.

Example:
--------

watchdog.py -o 29322 -k 29345,29346,29348 -s 2

The command checks PID 29322 every 2 seconds and kills PIDs 29345, 29346, 29348 
and their children, if PID 29322 dies.

Requires:
----------

 * https://github.com/giampaolo/psutil
 * http://pypi.python.org/pypi/python-daemon
"""
from optparse import OptionParser
import datetime as dt
import signal
import daemon
import logging
import psutil
import time
import sys
import os

class MyFormatter(logging.Formatter):
    converter=dt.datetime.fromtimestamp
    def formatTime(self, record, datefmt=None):
        ct = self.converter(record.created)
        if datefmt:
            s = ct.strftime(datefmt)
        else:
            t = ct.strftime("%Y-%m-%d %H:%M:%S")
            s = "%s,%03d" % (t, record.msecs)
        return s

def check_pid(pid):        
    """ Check For the existence of a unix / windows pid."""
    try:
        os.kill(pid, 0)   # Kill 0 raises OSError, if pid isn't there...
    except OSError:
        return False
    else:
        return True

def kill_process(logger, pid):
    try:
        psu_proc = psutil.Process(pid)
    except Exception, e:
        logger.debug('Caught Exception ["%s"] while looking up PID %s' % (e, pid))
        return False

    logger.debug('Sending SIGTERM to %s' % repr(psu_proc))
    psu_proc.send_signal(signal.SIGTERM)
    psu_proc.wait(timeout=None)
    return True

def watch_process(observe, kill, seconds=2):
    """Kill the process IDs listed in 'kill', when 'observe' dies."""
    logger = logging.getLogger(__name__)
    logger.setLevel(logging.DEBUG)
    logfile = logging.FileHandler('%s/watchdog.log' % os.getcwd())
    logger.addHandler(logfile)
    formatter = MyFormatter(fmt='%(asctime)s %(message)s',datefmt='%Y-%m-%d,%H:%M:%S.%f')
    logfile.setFormatter(formatter)


    logger.debug('Start watchdog.watch_process')
    logger.debug('    observe pid = %s' % observe)
    logger.debug('    kill pids = %s' % kill)
    logger.debug('    seconds = %s' % seconds)
    children = list()

    # Get PIDs of all child processes...
    for childpid in kill.split(','):
        children.append(childpid)
        p = psutil.Process(int(childpid))
        for subpsu in p.get_children():
            children.append(str(subpsu.pid))

    # Poll observed PID...
    while check_pid(int(observe)):
        logger.debug('Poll PID: %s is alive.' % observe)
        time.sleep(seconds)
    logger.debug('Poll PID: %s is *dead*, starting kills of %s' % (observe, ', '.join(children)))

    for pid in children:
        # kill all child processes...
        kill_process(logger, int(pid))
    sys.exit(0) # Exit gracefully

def run(observe, kill, seconds):

    with daemon.DaemonContext(detach_process=True, 
        stdout=sys.stdout,
        working_directory=os.getcwd()):
        watch_process(observe=observe, kill=kill, seconds=seconds)

if __name__=='__main__':
    parser = OptionParser()
    parser.add_option("-o", "--observe", dest="observe", type="int",
                      help="PID to be observed", metavar="INT")
    parser.add_option("-k", "--kill", dest="kill",
                      help="Comma separated list of PIDs to be killed", 
                      metavar="TEXT")
    parser.add_option("-s", "--seconds", dest="seconds", default=2, type="int",
                      help="Seconds to wait between observations (default = 2)", 
                      metavar="INT")
    (options, args) = parser.parse_args()
    run(options.observe, options.kill, options.seconds)

Source: (StackOverflow)

Run cron job only if it isn't already running

So I'm trying to set up a cron job as a sort of watchdog for a daemon that I've created. If the daemon errors out and fails, I want the cron job to periodically restart it... I'm not sure how possible this is, but I read through a couple of cron tutorials and couldn't find anything that would do what I'm looking for...

My daemon gets started from a shell script, so I'm really just looking for a way to run a cron job ONLY if the previous run of that job isn't still running.

I found this post, which did provide a solution for what I'm trying to do using lock files, not I'm not sure if there is a better way to do it...

Thanks for your help.


Source: (StackOverflow)

What's the best way to watchdog a desktop application?

I need some way to monitor a desktop application and restart it if it dies.

Initially I assumed the best way would be to monitor/restart the process from a Windows service, until I found out that since Vista Windows services should not interact with the desktop

I've seen several questions dealing with this issue, but every answer I've seen involved some kind of hack that is discouraged by Microsoft and will likely stop working in future OS updates.

So, a Windows service is probably not an option anymore. I could probably just create a different desktop/console application to do this, but that kind of defeats its purpose.

Which would be the most elegant way to achieve this, in your opinion?

EDIT: This is neither malware nor virus. The app that needs monitoring is a media player that will run on an embedded system, and even though I'm trying to cover all possible crash scenarios, I can't risk having it crash over an unexpected error (s**t happens). This watchdog would be just a safeguard in case everything else goes wrong. Also, since the player would be showing 3rd party flash content, an added plus would be for example to monitor for resource usage, and restart the player if say, some crappy flash movie starts leaking memory.

EDIT 2: I forgot to mention, the application I would like to monitor/restart has absolutely no need to run on either the LocalSystem account nor with any administrative privileges at all. Actually, I'd prefer it to run using the currently logged user credentials.


Source: (StackOverflow)

How does Linux nmi watchdog work?

Now I encounter a problem about Linux NMI Watchdog. I want to use Linux NMI watchdog to detect and recovery OS hang. So I add "nmi_watchdog=1" to grub.cfg. And then check the /proc/interrupt, NMI were triggered per second. But after I load a module with deadlock(double-acquire spinlock), system were hang totally, and nothing occur(never panic!). It looks like that nmi watchdog did not work!

Then I read the Documantation/nmi_watchdog.txt, it says:

Be aware that when using local APIC, the frequency of NMI interrupts it generates, depends on the system load. The local APIC NMI watchdog, lacking a better source, uses the "cycles unhalted" event.

What's the "cycles unhalted" event?

It added

but if your system locks up on anything but the "hlt" processor instruction, the watchdog will trigger very soon as the "cycles unhalted" event will happen every clock tick...If it locks up on "hlt", then you are out of luck -- the event will not happen at all and the watchdog won't trigger.

Seems like that watchdog won't trigger if processor execute "hlt" instruction, then I search "hlt" in "Intel 64 and IA-32 Architectures Software Developer's Manual, Volumn 2A", it describes as follow:

Stops instruction execution and places the processor in a HALT state. An enabled interrupt (including NMI and SMI), a debug exception, the BINIT# signal, the INIT# signal, or the RESET# signal will resume execution.

Then I lost...

My question is:

  • How does Linux nmi watchdog work?
  • Who trigger the nmi?

My OS is Ubuntn 10.04 LTS, Linux-2.6.32.21, CPU Pentium 4 Dual-core 3.20 GHz.

I didn't read the whole source code about nmi watchdog(no time), if I couldn't understand how nmi watchdog work, I want use performance monitoring counter interrupt and inter-processor interrupt(be provided by APIC) to send NMI instead of nmi watchdog.

Could anybody help me? Thanks.


Source: (StackOverflow)

Linux software watchdog

I am writing a system monitor for Linux and want to include some watchdog functionality. In the kernel, you can configure the watchdog to keep going even if /dev/watchdog is closed. In other words, if my daemon exits normally and closes /dev/watchdog, the system would still re-boot 59 seconds later. That may or may not be desirable behavior for the user.

I need to make my daemon aware of this setting because it will influence how I handle SIGINT. If the setting is on, my daemon would need to (preferably) start an orderly shutdown on exit or (at least) warn the user that the system is going to reboot shortly.

Does anyone know of a method to obtain this setting from user space? I don't see anything in sysconf() to get the value. Likewise, I need to be able to tell if the software watchdog is enabled to begin with.

Edit:

Linux provides a very simple watchdog interface. A process can open /dev/watchdog , once the device is opened, the kernel will begin a 60 second count down to reboot unless some data is written to that file, in which case the clock re-sets.

Depending on how the kernel is configured, closing that file may or may not stop the countdown. From the documentation:

The watchdog can be stopped without causing a reboot if the device /dev/watchdog is closed correctly, unless your kernel is compiled with the CONFIG_WATCHDOG_NOWAYOUT option enabled.

I need to be able to tell if CONFIG_WATCHDOG_NOWAYOUT was set from within a user space daemon, so that I can handle the shutdown of said daemon differently. In other words, if that setting is high, a simple:

# /etc/init.d/mydaemon stop

... would reboot the system in 59 seconds, because nothing is writing to /dev/watchdog any longer. So, if its set high, my handler for SIGINT needs to do additional things (i.e. warn the user at the least).

I can not find a way of obtaining this setting from user space :( Any help is appreciated.


Source: (StackOverflow)

Who is refreshing hardware watchdog in Linux?

I have a processor AT91SAM9G20 running a 2.6 kernel. Watchdog is enabled at bootstrap level and configured for 16 seconds. Watchdog mode register can be configured only once. When code hangs either in bootstrap, bootloader or kernel, the board reboots. But once kernel comes up even though watchdog is not refreshed in any of the applications, the board is not being reset after 16 seconds, but 15 minutes.

Who is refreshing the watchdog?

In our case, the watchdog should be influenced by applications, so that the board can reset if our application hangs.

These are the running processes:

1 root     init
2 root     [kthreadd]
3 root     [ksoftirqd/0]
4 root     [watchdog/0]
5 root     [events/0]
6 root     [khelper]
63 root     [kblockd/0]
72 root     [ksuspend_usbd]
78 root     [khubd]
85 root     [kmmcd]
107 root     [pdflush]
108 root     [pdflush]
109 root     [kswapd0]
110 root     [aio/0]
740 root     [mtdblockd]
828 root     [rpciod/0]
982 root     [jffs2_gcd_mtd10]
1003 root     /sbin/udevd -d
1145 daemon   portmap
1158 dbus     dbus-daemon --system
1178 root     /usr/sbin/ifplugd -i eth0 -fwI -u0 -d5 -l -q
1190 root     /usr/sbin/ifplugd -i eth1 -fwI -u0 -d5 -l -q
1221 default  avahi-daemon: running [SP14.local]
1226 root     /usr/sbin/dropbear
1246 root     /root/bin/host_app
1254 root     /root/bin/mini_httpd -c *.cgi -d /root/bin -u root -E /root/bin/
1256 root     -sh
1257 root     /sbin/syslogd -n -m 0
1258 root     /sbin/klogd -n
1259 root     /usr/bin/tail -f /var/log/messages
1265 root     ps -e

We are using the watchdog for soft lockups available in kernel-2.6.25-ts.at91sam9g20/kernel/softlockup.c


Source: (StackOverflow)

Can I create a software watchdog timer thread in C++ using Boost Signals2 and Threads?

I am running function Foo from somebody else's library in a single-threaded application currently. Most of the time, I make a call to Foo and it's really quick, some times, I make a call to Foo and it takes forever. I am not a patient man, if Foo is going to take forever, I want to stop execution of Foo and not call it with those arguments.

What is the best way to call Foo in a controlled manner (my current environment is POSIX/C++) such that I can stop execution after a certain number of seconds. I feel like the right thing to do here is to create a second thread to call Foo, while in my main thread I create a timer function that will eventually signal the second thread if it runs out of time.

Is there another, more apt model (and solution)? If not, would Boost's Signals2 library and Threads do the trick?


Source: (StackOverflow)

python watchdog monitoring file for changes

Folks, I have a need to watch a log file for changes. After looking through stackoverflow questions, I see people recommending 'watchdog'. So i'm trying to test, and am not sure where to add the code for when files change:

#!/usr/bin/python
import time
from watchdog.observers import Observer
from watchdog.events import LoggingEventHandler

if __name__ == "__main__":
event_handler = LoggingEventHandler()
observer = Observer()
observer.schedule(event_handler, path='.', recursive=False)
observer.start()
try:
    while True:
      time.sleep(1)
    else:
      print "got it"
except KeyboardInterrupt:
    observer.stop()
observer.join()

Where do I add the "got it", in the while loop if the files have been added/changed?

Thanks!


Source: (StackOverflow)

how to use linux software watchdog

Hi can anybody tell me how to handle the software watchdog in linux .I have a program "SampleApplication" which runs continuously and I need to restart it if its hangs or closes unexpectedly.

I was googling about this and found linux has watchdog at /dev/watchdog but dont know how to use it.Could someone help me with example.

My question is where to I specify my application name and delay interval to restart . As I am new to linux please brief me with sample if possible. Thanks


Source: (StackOverflow)

What does "variable|variable" mean in C++?

I was looking into this ITE8712 watchdog timer demo code when I saw this:

void InitWD(char cSetWatchDogUnit, char cSetTriggerSignal)
{
OpenIoConfig();     //open super IO of configuration for Super I/O

SelectIoDevice(0x07);   //select device7

//set watch dog counter of unit
WriteIoCR(0x72, cSetWatchDogUnit|cSetTriggerSignal);

//CloseIoConfig();      //close super IO of configuration for Super I/O
}

and, I wonder what is meant by this line:

cSetWatchDogUnit|cSetTriggerSignal

because the WriteIoCR function looks like this:

void WriteIoCR(char cIndex, char cData)
{
//super IO of index port for Super I/O
//select super IO of index register for Super I/O
outportb(equIndexPort,cIndex);

//super IO of data for Super I/O
//write data to data register
outportb(equDataPort,cData);
}

So cIndex should be 0x72, but what about the cData? I really don't get the "|" thing as I've only used it for OR ("||") in a conditional statement.


Source: (StackOverflow)

How to secure methods in java (overflow and so on)

i have to write a "WatchDog" in Java, who secure that Threads don't perform too long. With the initialization of the Objects it's no Problem, i made a Class, who calls the WatchDog and the constructor with reflections in the run() method.

A Thread is easy to stop, but how i can secure normal methods of objects? For example i call the method of an Object and this method perform a endless loop, how you would do that?

thanks


Source: (StackOverflow)

How to debug a watchdog timeout

I have a watchdog in my microcontroller that if it is not kicked, will reset the processor. My applications runs fine for a while but will eventually reset because the watchdog did not get kicked. If I step through the program it works fine.

What are some ways to debug this?

EDIT: Conclusion: The way I found my bug was the watchdog breadcrumbs.

I am using a PIC that has a high and low ISR vector. The High vector was suppose to handle the LED matrix and the Low vector was to handle the timer tick. But I put both ISR handlers in the high vector. So when I disabled the LED matrix ISR and the timer tick ISR needed service, the processor would be stuck in the low ISR to handle the timer tick, but the timer tick handler was not there.

The breadcrumbs limited my search down to the function that handled the led matrix and specifically disabling the LED matrix interrupt.


Source: (StackOverflow)

Detect when parent process exits

I will have a parent process that is used to handle webserver restarts. It will signal the child to stop listening for new requests, the child will signal the parent that it has stopped listening, then the parent will signal the new child that it can start listening. In this way, we can accomplish less than 100ms down time for a restart of that level (I have a zero-downtime grandchild restart also, but that is not always enough of a restart).

The service manager will kill the parent when it is time for shutdown. How can the child detect that the parent has ended?

The signals are sent using stdin and stdout of the child process. Perhaps I can detect the end of an stdin stream? I am hoping to avoid a polling interval. Also, I would like this to be a really quick detection if possible.


Source: (StackOverflow)