Skip to content

yv4-sd: Fix CPU thermal trip SEL handling#2739

Closed
Wiwynn-Evan wants to merge 1 commit into
facebook:mainfrom
Wiwynn:Mandy/yv4/Modify_Thermaltrip_SEL
Closed

yv4-sd: Fix CPU thermal trip SEL handling#2739
Wiwynn-Evan wants to merge 1 commit into
facebook:mainfrom
Wiwynn:Mandy/yv4/Modify_Thermaltrip_SEL

Conversation

@Wiwynn-Evan

@Wiwynn-Evan Wiwynn-Evan commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

[Issue Description]

CPU thermal trip may be reported as an incorrect SEL event on yv4-sd. When FM_CPU_BIC_THERMTRIP_N is asserted, BIC correctly enters ISR_SOC_THMALTRIP. However, CPU thermal trip uses event type 0x00, and this value may be treated as an invalid event in the current SEL handling flow. As a result, the CPU thermal trip SEL may not be sent correctly to BMC.

[Root Cause]

CPU thermal trip uses event type 0x00, which is a valid SEL event. The current SEL handling flow may treat this value as an invalid or uninitialized event when deciding whether the work item belongs to the normal GPIO SEL path or the wrapper SEL path. As a result, CPU thermal trip may fall through to the wrapper path and be reported incorrectly.

[Solution]

Keep addsel_work_handler() for normal GPIO SEL events, so CPU thermal trip can be sent correctly with event type 0x00. Add addsel_wrapper_work_handler() for FAST_PROCHOT and SYS_THROTTLE wrapper SEL events. The wrapper handler skips incomplete wrapper SEL data to avoid reporting it as CPU thermal trip. Update FAST_PROCHOT and SYS_THROTTLE work initialization to use the wrapper SEL handler.

[Test Log]

"580": {
    "additional_data": {
        "DEVICE": "/xyz/openbmc_project/State/Thermal/host6/cpu0",
        "FAILURE_DATA": "CPU Thermal Trip",
        "_CODE_FILE": "/usr/src/debug/pldm/1.0+git/oem/meta/libpldmresponder/file_io_type_event_log.cpp",
        "_CODE_FUNC": "void pldm::responder::oem_meta::record::commit(const std::string&, pldm::responder::oem_meta::EventAssert, const std::string&) [with const char* TypeLabel = (& label); AssertType = sdbusplus::error::xyz::openbmc_project::state::Thermal::DeviceOverOperatingTemperatureFault; DeassertType = sdbusplus::event::xyz::openbmc_project::state::Thermal::DeviceOperatingNormalTemperature; std::string = std::__cxx11::basic_string<char>]",
        "_CODE_LINE": "201",
        "_PID": "575"
    },
    "event_id": "",
    "message": "xyz.openbmc_project.State.Thermal.DeviceOverOperatingTemperatureFault",
    "redfish": {
        "args": [
            "/xyz/openbmc_project/State/Thermal/host6/cpu0"
        ],
        "id": "OpenBMC_StateThermal.DeviceOverOperatingTemperatureFault",
        "message": "Device /xyz/openbmc_project/State/Thermal/host6/cpu0 is significantly over safe operating temperature and may have been powered off."
    },
    "resolution": "",
    "resolved": false,
    "severity": "xyz.openbmc_project.Logging.Entry.Level.Critical",
    "timestamp": "2026-06-26T09:18:39.875000000Z",
    "updated_timestamp": "2026-06-26T09:18:39.875000000Z"
}

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 26, 2026
@meta-codesync

meta-codesync Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

This pull request has been imported. If you are a Meta employee, you can view this in D109810978. (Because this pull request was imported automatically, there will not be any future comments.)

[Issue Description]
CPU thermal trip may be reported as an incorrect SEL event on yv4-sd.
When FM_CPU_BIC_THERMTRIP_N is asserted, BIC correctly enters ISR_SOC_THMALTRIP. However, CPU thermal trip uses event type 0x00, and this value may be treated as an invalid event in the current SEL handling flow. As a result, the CPU thermal trip SEL may not be sent correctly to
BMC.

[Root Cause]
CPU thermal trip uses event type 0x00, which is a valid SEL event.
The current SEL handling flow may treat this value as an invalid or uninitialized event when deciding whether the work item belongs to the normal GPIO SEL path or the wrapper SEL path. As a result, CPU thermal trip may fall through to the wrapper path and be reported incorrectly.

[Solution]
Keep addsel_work_handler() for normal GPIO SEL events, so CPU thermal trip can be sent correctly with event type 0x00.
Add addsel_wrapper_work_handler() for FAST_PROCHOT and SYS_THROTTLE wrapper SEL events. The wrapper handler skips incomplete wrapper SEL data to avoid reporting it as CPU thermal trip.
Update FAST_PROCHOT and SYS_THROTTLE work initialization to use the wrapper SEL handler.

[Test Log]
    "580": {
        "additional_data": {
            "DEVICE": "/xyz/openbmc_project/State/Thermal/host6/cpu0",
            "FAILURE_DATA": "CPU Thermal Trip",
            "_CODE_FILE": "/usr/src/debug/pldm/1.0+git/oem/meta/libpldmresponder/file_io_type_event_log.cpp",
            "_CODE_FUNC": "void pldm::responder::oem_meta::record::commit(const std::string&, pldm::responder::oem_meta::EventAssert, const std::string&) [with const char* TypeLabel = (& label); AssertType = sdbusplus::error::xyz::openbmc_project::state::Thermal::DeviceOverOperatingTemperatureFault; DeassertType = sdbusplus::event::xyz::openbmc_project::state::Thermal::DeviceOperatingNormalTemperature; std::string = std::__cxx11::basic_string<char>]",
            "_CODE_LINE": "201",
            "_PID": "575"
        },
        "event_id": "",
        "message": "xyz.openbmc_project.State.Thermal.DeviceOverOperatingTemperatureFault",
        "redfish": {
            "args": [
                "/xyz/openbmc_project/State/Thermal/host6/cpu0"
            ],
            "id": "OpenBMC_StateThermal.DeviceOverOperatingTemperatureFault",
            "message": "Device /xyz/openbmc_project/State/Thermal/host6/cpu0 is significantly over safe operating temperature and may have been powered off."
        },
        "resolution": "",
        "resolved": false,
        "severity": "xyz.openbmc_project.Logging.Entry.Level.Critical",
        "timestamp": "2026-06-26T09:18:39.875000000Z",
        "updated_timestamp": "2026-06-26T09:18:39.875000000Z"
    }
@MandyMCHung MandyMCHung force-pushed the Mandy/yv4/Modify_Thermaltrip_SEL branch from 3dfaf95 to f69a380 Compare June 29, 2026 01:19
@facebook-github-tools

Copy link
Copy Markdown

@Wiwynn-Evan has updated the pull request. You must reimport the pull request before landing.

@meta-codesync meta-codesync Bot closed this in d986f8e Jun 29, 2026
@meta-codesync meta-codesync Bot added the Merged label Jun 29, 2026
@meta-codesync

meta-codesync Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

This pull request has been merged in d986f8e.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants