Skip to content

Commit 9527192

Browse files
author
Chandrakanth Patil
committed
scsi: mpt3sas: Correctly handle ATA device errors
JIRA: https://issues.redhat.com/browse/RHEL-101342 With the ATA error model, an NCQ command failure always triggers an abort (termination) of all NCQ commands queued on the device. In such case, the SAT or the host must handle the failed command according to the command sense data and immediately retry all other NCQ commands that were aborted due to the failed NCQ command. For SAS HBAs controlled by the mpt3sas driver, NCQ command aborts are not handled by the HBA SAT and sent back to the host, with an ioc log information equal to 0x31080000 (IOC_LOGINFO_PREFIX_PL with the PL code PL_LOGINFO_CODE_SATA_NCQ_FAIL_ALL_CMDS_AFTR_ERR). The function _scsih_io_done() always forces a retry of commands terminated with the status MPI2_IOCSTATUS_SCSI_IOC_TERMINATED using the SCSI result DID_SOFT_ERROR, regardless of the log_info for the command. This correctly forces the retry of collateral NCQ abort commands, but with the retry counter for the command being incremented. If a command to an ATA device is subject to too many retries due to other NCQ commands failing (e.g. read commands trying to access unreadable sectors), the collateral NCQ abort commands may be terminated with an error as they run out of retries. This violates the SAT specification and causes hard-to-debug command errors. Solve this issue by modifying the handling of the MPI2_IOCSTATUS_SCSI_IOC_TERMINATED status to check if a command is for an ATA device and if the command loginfo indicates an NCQ collateral abort. If that is the case, force the command retry using the SCSI result DID_IMM_RETRY to avoid incrementing the command retry count. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250606052747.742998-3-dlemoal@kernel.org Tested-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> (cherry picked from commit 15592a1) Signed-off-by: Chandrakanth Patil <chanpati@redhat.com>
1 parent 4114714 commit 9527192

File tree

1 file changed

+19
-0
lines changed

1 file changed

+19
-0
lines changed

drivers/scsi/mpt3sas/mpt3sas_scsih.c

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -195,6 +195,14 @@ struct sense_info {
195195
#define MPT3SAS_PORT_ENABLE_COMPLETE (0xFFFD)
196196
#define MPT3SAS_ABRT_TASK_SET (0xFFFE)
197197
#define MPT3SAS_REMOVE_UNRESPONDING_DEVICES (0xFFFF)
198+
199+
/*
200+
* SAS Log info code for a NCQ collateral abort after an NCQ error:
201+
* IOC_LOGINFO_PREFIX_PL | PL_LOGINFO_CODE_SATA_NCQ_FAIL_ALL_CMDS_AFTR_ERR
202+
* See: drivers/message/fusion/lsi/mpi_log_sas.h
203+
*/
204+
#define IOC_LOGINFO_SATA_NCQ_FAIL_AFTER_ERR 0x31080000
205+
198206
/**
199207
* struct fw_event_work - firmware event struct
200208
* @list: link list framework
@@ -5828,6 +5836,17 @@ _scsih_io_done(struct MPT3SAS_ADAPTER *ioc, u16 smid, u8 msix_index, u32 reply)
58285836
scmd->result = DID_TRANSPORT_DISRUPTED << 16;
58295837
goto out;
58305838
}
5839+
if (log_info == IOC_LOGINFO_SATA_NCQ_FAIL_AFTER_ERR) {
5840+
/*
5841+
* This is a ATA NCQ command aborted due to another NCQ
5842+
* command failure. We must retry this command
5843+
* immediately but without incrementing its retry
5844+
* counter.
5845+
*/
5846+
WARN_ON_ONCE(xfer_cnt != 0);
5847+
scmd->result = DID_IMM_RETRY << 16;
5848+
break;
5849+
}
58315850
if (log_info == 0x31110630) {
58325851
if (scmd->retries > 2) {
58335852
scmd->result = DID_NO_CONNECT << 16;

0 commit comments

Comments
 (0)