Re: Exciting new feature for g95

From: Madhusudan Singh (spammers-go-here_at_spam.invalid)
Date: 10/20/04


Date: Wed, 20 Oct 2004 13:40:29 -0400

Andy Vaught wrote:

>
> I've added a new feature to g95 that I've wanted for a long time.
> Living on the northern edge of the Sonoran desert, there are seasonal
> monsoons that roll through. Sometimes they get rather close. In case
> you've never heard it, thunder gets a special crackle to it when the
> bolt is within a couple hundred meters. I usually shut down my
> machine down about then, losing any progress a running job has made.
>
> The new feature is a mechanism for resuming jobs. Being a
> reasonably lazy person, the file for doing this is the existing unix
> core file, which the operating system already writes. A good way of
> getting a core file is to use the QUIT signal, which is usually bound
> to the control-backslash key. The behaviour of QUIT is the same as
> the interrupt signal (usually control-c or delete keys) except that it
> dumps a core file if your ulimit allows it.
>
> It's now possible to point a g95-compiled program to a core file and
> have it load the state of the running program and resume execution.
> For example:
>
> -----------------------------------
> andy@fulcrum:~/g95/g95 % cat tst.f90
>
> b = 0.0
> do i=1, 10
> do j=1, 3000000
> call random_number(a)
> a = 2.0*a - 1.0
> b = b + sin(sin(sin(a)))
> enddo
> print *, i, b
> enddo
> end
>
> andy@fulcrum:~/g95/g95 % g95 tst.f90
> andy@fulcrum:~/g95/g95 % a.out
> 1 -464.5689
> 2 -38.27584
> 3 -652.6890
> 4 -597.2142
> 5 -150.8911
> 6 -376.1212
> Quit (core dumped)
> andy@fulcrum:~/g95/g95 % a.out --resume core
> 7 -1078.404
> 8 -1444.724
> 9 -372.3247
> 10 -934.3513
> andy@fulcrum:~/g95/g95 %
> -----------------------------------
>
> Open files are reopened:
>
>
> -----------------------------------
> andy@fulcrum:~/g95/g95 % cat tst.f90
>
> b = 0.0
> do i=1, 10
> do j=1, 3000000
> call random_number(a)
> a = 2.0*a - 1.0
> b = b + sin(sin(sin(a)))
> enddo
> print *, i, b
> write(10,*) i, b
> enddo
> end
> andy@fulcrum:~/g95/g95 % g95 tst.f90
> andy@fulcrum:~/g95/g95 % a.out
> 1 -464.5689
> 2 -38.27584
> 3 -652.6890
> 4 -597.2142
> 5 -150.8911
> Quit (core dumped)
> andy@fulcrum:~/g95/g95 % cat fort.10
> 1 -464.5689
> andy@fulcrum:~/g95/g95 % a.out --resume core
> 6 -376.1212
> 7 -1078.404
> 8 -1444.724
> 9 -372.3247
> 10 -934.3513
> andy@fulcrum:~/g95/g95 % cat fort.10
> 1 -464.5689
> 2 -38.27584
> 3 -652.6890
> 4 -597.2142
> 5 -150.8911
> 6 -376.1212
> 7 -1078.404
> 8 -1444.724
> 9 -372.3247
> 10 -934.3513
> andy@fulcrum:~/g95/g95 %
> -----------------------------------
>
> The fort.10 file isn't up to date when the core is dumped, but it is
> still buffered inside of the core file. After resuming, the data is
> correctly flushed to the disk.
>
> This feature has a couple limitations-- you need to resume from the
> same binary that you quit from. Open files need to be in the same
> place and untouched. If you interface with another language, all bets
> are off. This feature is only available on x86 based Linux systems at
> the moment with support for other systems in the future. A further
> constraint is that resumption must be from a processor that supports
> the same floating point registers as the core was dumped on, ie SSE
> registers. Other than that it should just work.
>
> For those who are wondering:
>
> -------------------------------
> andy@fulcrum:~/g95/g95 % cat tst.f90
>
> integer, pointer :: p => NULL()
> p = 1
> end
>
> andy@fulcrum:~/g95/g95 % g95 tst.f90
> andy@fulcrum:~/g95/g95 % a.out
> Segmentation fault (core dumped)
> andy@fulcrum:~/g95/g95 % a.out --resume core
> Segmentation fault (core dumped)
> andy@fulcrum:~/g95/g95 %
>
> -------------------------------
>
> Which works perfectly. The saved state is right before the fault,
> so when the program resumes it faults again.
>
> We're pretty excited about this because it opens up a wide range of
> possible uses. Without writing any special code, you can force a
> short job through a long queue. Another possibility is moving a
> running process to another machine. Take your work home. Move to a
> faster machine. Free up a fast machine.
>
> We're in the process of writing the "G95 Power User" page, so if you
> can think of something really cool that you can do with this, let us
> know and we'll put it where everyone can see it.
>
> I have one more large innovation planned for g95 that will make the
> corefile resume look like the -r option to 'ls'. That is going to
> have a wait for a while, though.
>
> Thanks go to the testers, Michael Richmond, Doug Cox, Harald Anlauf,
> Charles Rendleman and Joost Vandevondele. The two most interesting
> comments were "I'm telling my sysadmin to start backing up core files"
> and "People are really going to like this (after first distrusting it
> because this can't work)".
>
> Try it for yourself and let us know how it works: http://www.g95.org
>
> Andy
>
> ---------------
> mail: domain=firstinter.net address=andyv

Thanks !! This is indeed important for many of the simulations I run. I have
written a fairly involved backup and restore mechanism for my program
variables.

I just checked the docs and found :

-std=f2003 Strict fortran 2003 checking

Does g95 support f2003 constructs or is the above just for checking ?



Relevant Pages

  • Exciting new feature for g95
    ... core file, which the operating system already writes. ... The behaviour of QUIT is the same as ... andy@fulcrum:~/g95/g95 % cat tst.f90 ...
    (comp.lang.fortran)
  • Exciting new feature for g95
    ... core file, which the operating system already writes. ... The behaviour of QUIT is the same as ... andy@fulcrum:~/g95/g95 % cat tst.f90 ...
    (comp.lang.fortran)
  • Re: su core dumped with signal 3. BSD/OS 3.0, 3.1
    ... > Core was generated by `su'. ... > Program terminated with signal 3, Quit. ... the core file is created as ... of signals and not leave core files lying around, ...
    (Vuln-Dev)
  • Re: Linux sockets.
    ... crashes (no core file or anything) before I get a -1. ... if I connect and disconnect before the 3 seconds is up ), ... errno 1: 0 ...
    (comp.unix.programmer)
  • Re: Oninit from cron leaves engine in permanent fast recovery
    ... Nog does not have it. ... Does anyone know 1) how to create a core dump of oninit. ... have a problem with DBSPACETEMP or any of the dbspaces. ... Please come forward if you REALLY know how to produce a core file from ...
    (comp.databases.informix)