Skip to content

Error with eclm when using more than 16 nodes on jureca #111

@AGonzalezNicolas

Description

@AGonzalezNicolas

When using eCLM more than 16-nodes on jureca job fails.

Working directory:

/p/scratch/cslts/gonzalez5/TestCases/DE009_tboas/JR_DE009_testing

Test case (Theresa Boas @tboas):
https://icg4geo.icg.kfa-juelich.de/Configurations/CLM/de09_eclm/-/tree/main?ref_type=heads

log-error:

<PSP:r0000115:ucp:psucp_con_connect() : ucp_ep_create() : Input/output error>
<PSP:r0000110:connection to (10.14.1.197,37600,0x6f,r0002066) (type:demand,state:closed) : connect : Protocol not available>
<PSP:r0000069:connection to (10.14.1.197,37743,0x46,r0002111) (type:demand,state:closed) : connect : Protocol not available>
<PSP:r0000071:ucp:psucp_con_connect() : ucp_ep_create() : Input/output error>
<PSP:r0000083:connection to (10.14.1.197,37669,0x54,r0002090) (type:demand,state:closed) : connect : Protocol not available>
Fatal error in internal_Send: Other MPI error, error stack:
internal_Send(120).: MPI_Send(buf=0x7ffdb2c3233c, count=1, MPI_INTEGER, 2111, 4415, comm=0xc4000001) failed
mpid_isend_done(38): write to socket failed - request state:send(pde)done
Fatal error in internal_Send: Other MPI error, error stack:
internal_Send(120).: MPI_Send(buf=0x7ffef52178dc, count=1, MPI_INTEGER, 69, 2373, comm=0xc4000001) failed
mpid_isend_done(38): write to socket failed - request state:send(pde)done
Fatal error in internal_Send: Other MPI error, error stack:
internal_Send(120).: MPI_Send(buf=0x7ffdda981e3c, count=1, MPI_INTEGER, 2066, 4370, comm=0xc4000001) failed
mpid_isend_done(38): write to socket failed - request state:send(pde)done
Fatal error in internal_Send: Other MPI error, error stack:
internal_Send(120).: MPI_Send(buf=0x7fff041e367c, count=1, MPI_INTEGER, 2090, 4394, comm=0xc4000001) failed
mpid_isend_done(38): write to socket failed - request state:send(pde)done
<PSP:r0000105:ucp:psucp_con_connect() : ucp_ep_create() : Input/output error>
<PSP:r0000001:ucp:psucp_con_connect() : ucp_ep_create() : Input/output error>
<PSP:r0000081:connection to (10.14.1.197,37679,0x52,r0002088) (type:demand,state:closed) : connect : Protocol not available>
<PSP:r0000082:connection to (10.14.1.197,37673,0x53,r0002091) (type:demand,state:closed) : connect : Protocol not available>
Fatal error in internal_Send: Other MPI error, error stack:
internal_Send(120).: MPI_Send(buf=0x7ffe5721fc5c, count=1, MPI_INTEGER, 2088, 4392, comm=0xc4000001) failed
mpid_isend_done(38): write to socket failed - request state:send(pde)done
pspmix_service_abort: on users request from rank 69: Fatal error in internal_Send: Other MPI error, error stack:
internal_Send(120).: MPI_Send(buf=0x7ffdb2c3233c, count=1, MPI_INTEGER, 2111, 4415, comm=0xc4000001) failed

Metadata

Metadata

Assignees

No one assigned

    Labels

    weird-runtime-erroreCLM failed or hanged with unclear root cause

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions