Exciting new feature for g95

From: Andy Vaught (andy_at_firstinter.net)
Date: 10/20/04


Date: Wed, 20 Oct 2004 09:43:03 -0700


  I've added a new feature to g95 that I've wanted for a long time.
Living on the northern edge of the Sonoran desert, there are seasonal
monsoons that roll through. Sometimes they get rather close. In case
you've never heard it, thunder gets a special crackle to it when the
bolt is within a couple hundred meters. I usually shut down my
machine down about then, losing any progress a running job has made.

  The new feature is a mechanism for resuming jobs. Being a
reasonably lazy person, the file for doing this is the existing unix
core file, which the operating system already writes. A good way of
getting a core file is to use the QUIT signal, which is usually bound
to the control-backslash key. The behaviour of QUIT is the same as
the interrupt signal (usually control-c or delete keys) except that it
dumps a core file if your ulimit allows it.

  It's now possible to point a g95-compiled program to a core file and
have it load the state of the running program and resume execution.
For example:

-----------------------------------
andy@fulcrum:~/g95/g95 % cat tst.f90

b = 0.0
do i=1, 10
   do j=1, 3000000
      call random_number(a)
      a = 2.0*a - 1.0
      b = b + sin(sin(sin(a)))
   enddo
   print *, i, b
enddo
end

andy@fulcrum:~/g95/g95 % g95 tst.f90
andy@fulcrum:~/g95/g95 % a.out
 1 -464.5689
 2 -38.27584
 3 -652.6890
 4 -597.2142
 5 -150.8911
 6 -376.1212
Quit (core dumped)
andy@fulcrum:~/g95/g95 % a.out --resume core
 7 -1078.404
 8 -1444.724
 9 -372.3247
 10 -934.3513
andy@fulcrum:~/g95/g95 %
-----------------------------------

  Open files are reopened:

-----------------------------------
andy@fulcrum:~/g95/g95 % cat tst.f90

b = 0.0
do i=1, 10
   do j=1, 3000000
      call random_number(a)
      a = 2.0*a - 1.0
      b = b + sin(sin(sin(a)))
   enddo
   print *, i, b
   write(10,*) i, b
enddo
end
andy@fulcrum:~/g95/g95 % g95 tst.f90
andy@fulcrum:~/g95/g95 % a.out
 1 -464.5689
 2 -38.27584
 3 -652.6890
 4 -597.2142
 5 -150.8911
Quit (core dumped)
andy@fulcrum:~/g95/g95 % cat fort.10
 1 -464.5689
andy@fulcrum:~/g95/g95 % a.out --resume core
 6 -376.1212
 7 -1078.404
 8 -1444.724
 9 -372.3247
 10 -934.3513
andy@fulcrum:~/g95/g95 % cat fort.10
 1 -464.5689
 2 -38.27584
 3 -652.6890
 4 -597.2142
 5 -150.8911
 6 -376.1212
 7 -1078.404
 8 -1444.724
 9 -372.3247
 10 -934.3513
andy@fulcrum:~/g95/g95 %
-----------------------------------

  The fort.10 file isn't up to date when the core is dumped, but it is
still buffered inside of the core file. After resuming, the data is
correctly flushed to the disk.

  This feature has a couple limitations-- you need to resume from the
same binary that you quit from. Open files need to be in the same
place and untouched. If you interface with another language, all bets
are off. This feature is only available on x86 based Linux systems at
the moment with support for other systems in the future. A further
constraint is that resumption must be from a processor that supports
the same floating point registers as the core was dumped on, ie SSE
registers. Other than that it should just work.

  For those who are wondering:

-------------------------------
andy@fulcrum:~/g95/g95 % cat tst.f90

integer, pointer :: p => NULL()
  p = 1
end

andy@fulcrum:~/g95/g95 % g95 tst.f90
andy@fulcrum:~/g95/g95 % a.out
Segmentation fault (core dumped)
andy@fulcrum:~/g95/g95 % a.out --resume core
Segmentation fault (core dumped)
andy@fulcrum:~/g95/g95 %

-------------------------------

  Which works perfectly. The saved state is right before the fault,
so when the program resumes it faults again.

  We're pretty excited about this because it opens up a wide range of
possible uses. Without writing any special code, you can force a
short job through a long queue. Another possibility is moving a
running process to another machine. Take your work home. Move to a
faster machine. Free up a fast machine.

  We're in the process of writing the "G95 Power User" page, so if you
can think of something really cool that you can do with this, let us
know and we'll put it where everyone can see it.

  I have one more large innovation planned for g95 that will make the
corefile resume look like the -r option to 'ls'. That is going to
have a wait for a while, though.

  Thanks go to the testers, Michael Richmond, Doug Cox, Harald Anlauf,
Charles Rendleman and Joost Vandevondele. The two most interesting
comments were "I'm telling my sysadmin to start backing up core files"
and "People are really going to like this (after first distrusting it
because this can't work)".

  Try it for yourself and let us know how it works: http://www.g95.org

        Andy

---------------
mail: domain=firstinter.net address=andyv



Relevant Pages

  • Re: Exciting new feature for g95
    ... > I've added a new feature to g95 that I've wanted for a long time. ... > core file, which the operating system already writes. ... The behaviour of QUIT is the same as ...
    (comp.lang.fortran)
  • Exciting new feature for g95
    ... core file, which the operating system already writes. ... The behaviour of QUIT is the same as ... andy@fulcrum:~/g95/g95 % cat tst.f90 ...
    (comp.lang.fortran)
  • Re: su core dumped with signal 3. BSD/OS 3.0, 3.1
    ... > Core was generated by `su'. ... > Program terminated with signal 3, Quit. ... the core file is created as ... of signals and not leave core files lying around, ...
    (Vuln-Dev)
  • Re: strings command
    ... >I was wondering if you can help me interpret the results from a core file ... >using the strings command. ... cat: 0652-050 Cannot open prog.c. ... Segmentation fault (core dumped) ...
    (comp.unix.aix)
  • Re: Linux sockets.
    ... crashes (no core file or anything) before I get a -1. ... if I connect and disconnect before the 3 seconds is up ), ... errno 1: 0 ...
    (comp.unix.programmer)