system call to mpirun

14 views (last 30 days)
Eide
Eide on 3 Oct 2011
Hi all,
I have a matlab file which includes a call to the matlab system command. The command looks like:
'mpirun -machinefile node.list -np 4 /pathtoexecutable/stg > /home/eide/log.txt'
The problem is the stg does not run properly - it crashes with the following error: rank 3 in job 1 name_of_clusternode caused collective abort of all ranks exit status of rank 3: killed by signal 9
If I run the same command from console it works perfectly. The path to mpirun and the executable are listed in the $PATH variable, so matlab knows the location. I also tried absolute paths - also not working. Does anyone have a suggestion? From the fact that the program runs in console but not via the matlab system call makes me wonder if the console matlab is opening internally is missing some info although the path is known. I am running matlab R2010B on a RedHat enterprise 4 cluster with tcsh
regards
Eide

Answers (4)

Jason Ross
Jason Ross on 3 Oct 2011
Compare the results of system('env') and 'env' in the shell. You might need to further configure the environment with setenv.
Also, make sure that you are running the same shell environment when you shell out (sh, csh, etc.) as you are when you run your bare shell command.
You might also investigate which mpirun is being called -- run "system('which mpirun') and 'which mpirun' in the shell.
  1 Comment
Eide
Eide on 4 Oct 2011
please see my response below

Sign in to comment.


Walter Roberson
Walter Roberson on 3 Oct 2011
Sounds like either the PATH is incomplete or the LD_LIBRARY_PATH is incomplete.
You should be able to debug the problem using a combination of ldd and ptrace.
  1 Comment
Eide
Eide on 4 Oct 2011
please see my response below

Sign in to comment.


Eide
Eide on 4 Oct 2011

Hi,

@Jason

>> system('which mpirun') and 'which mpirun' show the same

>> I tested which shell is used by system('ps -p $$') and 'ps -p $$' and both are tcsh

>> I did the same with PATH and LD_LIBRARY_PATH and they are the same plus some paths and libraries extra when matlab is running

@ Walter ldd spits out

/test/New 30> ldd /private/eide/proc/.../bin/stg2
        libdl.so.2 => /lib64/libdl.so.2 (0x00000033ae400000)
        libfftw3.so.3 => /prog/sdpsoft/fftw-3.2.1/lib/libfftw3.so.3 (0x00002b3b8117c000)
        libm.so.6 => /lib64/libm.so.6 (0x00000033ae000000)
        libifcore.so.5 => /prog/Intel/intel_fc_11/lib/intel64/libifcore.so.5 (0x00002b3b813ff000)
        libifcoremt.so.5 => /prog/Intel/intel_fc_11/lib/intel64/libifcoremt.so.5 (0x00002b3b81657000)
        libimf.so => /prog/Intel/intel_fc_11/lib/intel64/libimf.so (0x00002b3b818de000)
        libmpigc4.so.4 => /prog/Intel/intel_mpi_4.0/lib64/libmpigc4.so.4 (0x00002b3b81c34000)
        libmpi_dbg.so.4 => /prog/Intel/intel_mpi_4.0/lib64/libmpi_dbg.so.4 (0x00002b3b81d58000)
        libmpigf.so.4 => /prog/Intel/intel_mpi_4.0/lib64/libmpigf.so.4 (0x00002b3b822b7000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00000033ae800000)
        librt.so.1 => /lib64/librt.so.1 (0x00000033af000000)
        libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00000033bf600000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00000033b9e00000)
        libc.so.6 => /lib64/libc.so.6 (0x00000033adc00000)
        libintlc.so.5 => /prog/Intel/intel_fc_11/lib/intel64/libintlc.so.5 (0x00002b3b823e6000)
        /lib64/ld-linux-x86-64.so.2 (0x00000033ad800000)
        libsvml.so => /prog/Intel/intel_fc_11/lib/intel64/libsvml.so (0x00002b3b82523000)

when I do system('ldd /private/eide/proc/.../bin/stg2') from matlab more libraries are listed -- is it possible that there are functions inside the matlab libraries that maybe have the same name as those provided from intel or fftw and this causes the crash??

Thank both you for your help.

regards

Eide

  1 Comment
Walter Roberson
Walter Roberson on 4 Oct 2011
Your hypothesis is plausible, but I would not expect _more_ libraries to be listed unless there was at least one library in the above list that was loaded from a different place and that different place happened to reference additional libraries.

Sign in to comment.


Eide
Eide on 5 Oct 2011
Hi
@ Walter
Like you suggested I traced the program execution with strace from console and from matlab system call. It is the same but for some detail not - please see:
open("/usr/share/locale/locale.alias", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=2528, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2ade6d8c9000
read(3, "# Locale name alias data base.\n#"..., 4096) = 2528
read(3, "", 4096) = 0
close(3) = 0
munmap(0x2ade6d8c9000, 4096) = 0
open("/usr/lib/locale/LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=en_US.UTF-8;LC_ADDRESS=en_US.UTF-8;LC_TELEPHONE=en_US.UTF-8;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=en_US.UTF-8/LC_CTYPE", O_RDONLY) = -1 ENAMETOOLONG (File name too long)
open("/usr/lib/locale/LC_CTYPE=en_US.utf8lcnumericclctimeenusutf8lccollateenusutf8lcmonetaryenusutf8lcmessagesenusutf8lcpaperenusutf8lcnameenusutf8lcaddressenusutf8lctelephoneenusutf8lcmeasurementenusutf8lcidentificationenusutf8/LC_CTYPE", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib/locale/LC_CTYPE=en_US/LC_CTYPE", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib/locale/LC.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=en_US.UTF-8;LC_ADDRESS=en_US.UTF-8;LC_TELEPHONE=en_US.UTF-8;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=en_US.UTF-8/LC_CTYPE", O_RDONLY) = -1 ENAMETOOLONG (File name too long)
open("/usr/lib/locale/LC.utf8lcnumericclctimeenusutf8lccollateenusutf8lcmonetaryenusutf8lcmessagesenusutf8lcpaperenusutf8lcnameenusutf8lcaddressenusutf8lctelephoneenusutf8lcmeasurementenusutf8lcidentificationenusutf8/LC_CTYPE", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib/locale/LC/LC_CTYPE", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib/locale/LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=en_US.UTF-8;LC_ADDRESS=en_US.UTF-8;LC_TELEPHONE=en_US.UTF-8;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=en_US.UTF-8/LC_COLLATE", O_RDONLY) = -1 ENAMETOOLONG (File name too long)
open("/usr/lib/locale/LC_CTYPE=en_US.utf8lcnumericclctimeenusutf8lccollateenusutf8lcmonetaryenusutf8lcmessagesenusutf8lcpaperenusutf8lcnameenusutf8lcaddressenusutf8lctelephoneenusutf8lcmeasurementenusutf8lcidentificationenusutf8/LC_COLLATE", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib/locale/LC_CTYPE=en_US/LC_COLLATE", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib/locale/LC.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=en_US.UTF-8;LC_ADDRESS=en_US.UTF-8;LC_TELEPHONE=en_US.UTF-8;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=en_US.UTF-8/LC_COLLATE", O_RDONLY) = -1 ENAMETOOLONG (File name too long)
open("/usr/lib/locale/LC.utf8lcnumericclctimeenusutf8lccollateenusutf8lcmonetaryenusutf8lcmessagesenusutf8lcpaperenusutf8lcnameenusutf8lcaddressenusutf8lctelephoneenusutf8lcmeasurementenusutf8lcidentificationenusutf8/LC_COLLATE", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib/locale/LC/LC_COLLATE", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib/locale/LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=en_US.UTF-8;LC_ADDRESS=en_US.UTF-8;LC_TELEPHONE=en_US.UTF-8;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=en_US.UTF-8/LC_MESSAGES", O_RDONLY) = -1 ENAMETOOLONG (File name too long)
open("/usr/lib/locale/LC_CTYPE=en_US.utf8lcnumericclctimeenusutf8lccollateenusutf8lcmonetaryenusutf8lcmessagesenusutf8lcpaperenusutf8lcnameenusutf8lcaddressenusutf8lctelephoneenusutf8lcmeasurementenusutf8lcidentificationenusutf8/LC_MESSAGES", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib/locale/LC_CTYPE=en_US/LC_MESSAGES", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib/locale/LC.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=en_US.UTF-8;LC_ADDRESS=en_US.UTF-8;LC_TELEPHONE=en_US.UTF-8;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=en_US.UTF-8/LC_MESSAGES", O_RDONLY) = -1 ENAMETOOLONG (File name too long)
open("/usr/lib/locale/LC.utf8lcnumericclctimeenusutf8lccollateenusutf8lcmonetaryenusutf8lcmessagesenusutf8lcpaperenusutf8lcnameenusutf8lcaddressenusutf8lctelephoneenusutf8lcmeasurementenusutf8lcidentificationenusutf8/LC_MESSAGES", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib/locale/LC/LC_MESSAGES", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib/locale/LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=en_US.UTF-8;LC_ADDRESS=en_US.UTF-8;LC_TELEPHONE=en_US.UTF-8;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=en_US.UTF-8/LC_TIME", O_RDONLY) = -1 ENAMETOOLONG (File name too long)
open("/usr/lib/locale/LC_CTYPE=en_US.utf8lcnumericclctimeenusutf8lccollateenusutf8lcmonetaryenusutf8lcmessagesenusutf8lcpaperenusutf8lcnameenusutf8lcaddressenusutf8lctelephoneenusutf8lcmeasurementenusutf8lcidentificationenusutf8/LC_TIME", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib/locale/LC_CTYPE=en_US/LC_TIME", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib/locale/LC.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=en_US.UTF-8;LC_ADDRESS=en_US.UTF-8;LC_TELEPHONE=en_US.UTF-8;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=en_US.UTF-8/LC_TIME", O_RDONLY) = -1 ENAMETOOLONG (File name too long)
open("/usr/lib/locale/LC.utf8lcnumericclctimeenusutf8lccollateenusutf8lcmonetaryenusutf8lcmessagesenusutf8lcpaperenusutf8lcnameenusutf8lcaddressenusutf8lctelephoneenusutf8lcmeasurementenusutf8lcidentificationenusutf8/LC_TIME", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib/locale/LC/LC_TIME", O_RDONLY) = -1 ENOENT (No such file or directory)
There are 24 tries to open a file or directory called LC_TYPE or LC_TIME which is not there.
I googled for it and it seems matlab is attempting to find info about a local definition file - which finally fails.
When the timer cannot be set correctly maybe that's the reason why execution in matlab fails but not from console where this opening statement is not appearing.
Does anyone have an idea howto resolve the problem?
Thanks for your help I appreciate this a lot.
Eide
  1 Comment
Walter Roberson
Walter Roberson on 5 Oct 2011
The default for those things is to fall back to LANG=C, and there is only an error if it cannot find the definition for that. Otherwise it is a normal search procedure looking from the most specific versions of your current settings towards less and less specific until it finds a definition file.
You might want to try setting LANG=C in your shell environment before starting up MATLAB. I don't think it will make any substantial difference.

Sign in to comment.

Categories

Find more on Environment and Settings in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!