Skip to content

Provide intrinsics for speculative loads#179642

Open
fhahn wants to merge 14 commits into
llvm:mainfrom
fhahn:speculative-load-intrinsics
Open

Provide intrinsics for speculative loads#179642
fhahn wants to merge 14 commits into
llvm:mainfrom
fhahn:speculative-load-intrinsics

Conversation

@fhahn

@fhahn fhahn commented Feb 4, 2026

Copy link
Copy Markdown
Contributor

Introduce two new intrinsics to enable vectorization of loops with early exits that have potentially faulting loads.

  1. @llvm.speculative.load (name subject to change) - perform a load that may access memory beyond the allocated object. It must be used in combination with @llvm.can.load.speculatively to ensure the load is guaranteed to not trap.

  2. @llvm.can.load.speculatively - Returns true if it's safe to speculatively load a given number of bytes from a pointer. The semantics are target-dependent. On some targets, this may check that the access does not cross page boundaries, or stricter checks for example on AArch64 with MTE, which limits the access size to 16 bytes.

@llvm.speculative.load is lowered to a regular load in SelectionDAG without MODereferenceable. I am not sure if we need to be more careful than this, i.e. if we could still reason about SelectionDAG loads to infer dereferencability for the pointer.

@llvm.can.load.speculatively is lowered to regular IR in PreISel lowering, using a target-lowering hook. By default, it conservatively expands to false.

These intrinsics should allow the loop vectorizer to vectorize early-exit loops with potentially non-dereferenceable loads.

This has previously been discussed in
#120603 and is similar to @nikic's https://hackmd.io/@nikic/S1O4QWYZkx, with the major difference being that there is no %defined_size argument and instead the load returns the stored values for the bytes within bounds and undef otherwise. I don't think we can easily compute the defined size because it may depend on the loaded values (i.e. at what lane the early exit has been taken).

RFC on Discourse: https://discourse.llvm.org/t/rfc-provide-intrinsics-for-speculative-loads/89692

@llvmbot

llvmbot commented Feb 4, 2026

Copy link
Copy Markdown
Member

@llvm/pr-subscribers-backend-x86
@llvm/pr-subscribers-backend-aarch64

@llvm/pr-subscribers-llvm-ir

Author: Florian Hahn (fhahn)

Changes

Introduce two new intrinsics to enable vectorization of loops with early exits that have potentially faulting loads.

  1. @<!-- -->llvm.speculative.load (name subject to change) - perform a load that may access memory beyond the allocated object. It must be used in combination with @<!-- -->llvm.can.load.speculatively to ensure the load is guaranteed to not trap.

  2. @<!-- -->llvm.can.load.speculatively - Returns true if it's safe to speculatively load a given number of bytes from a pointer. The semantics are target-dependent. On some targets, this may check that the access does not cross page boundaries, or stricter checks for example on AArch64 with MTE, which limits the access size to 16 bytes.

@<!-- -->llvm.speculative.load is lowered to a regular load in SelectionDAG without MODereferenceable. I am not sure if we need to be more careful than this, i.e. if we could still reason about SelectionDAG loads to infer dereferencability for the pointer.

@<!-- -->llvm.can.load.speculatively is lowered to regular IR in PreISel lowering, using a target-lowering hook. By default, it conservatively expands to false.

These intrinsics should allow the loop vectorizer to vectorize early-exit loops with potentially non-dereferenceable loads.

This has previously been discussed in
#120603 and is similar to @nikic's https://hackmd.io/@nikic/S1O4QWYZkx, with the major difference being that there is no %defined_size argument and instead the load returns the stored values for the bytes within bounds and undef otherwise. I don't think we can easily compute the defined size because it may depend on the loaded values (i.e. at what lane the early exit has been taken).


Patch is 33.86 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/179642.diff

16 Files Affected:

  • (modified) llvm/docs/LangRef.rst (+113)
  • (modified) llvm/include/llvm/CodeGen/TargetLowering.h (+13)
  • (modified) llvm/include/llvm/IR/Intrinsics.td (+14)
  • (modified) llvm/lib/CodeGen/PreISelIntrinsicLowering.cpp (+39)
  • (modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp (+30)
  • (modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h (+1)
  • (modified) llvm/lib/IR/Verifier.cpp (+18)
  • (modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+49)
  • (modified) llvm/lib/Target/AArch64/AArch64ISelLowering.h (+2)
  • (added) llvm/test/CodeGen/AArch64/can-load-speculatively.ll (+78)
  • (added) llvm/test/CodeGen/AArch64/speculative-load-intrinsic-sve.ll (+66)
  • (added) llvm/test/CodeGen/AArch64/speculative-load-intrinsic.ll (+117)
  • (added) llvm/test/CodeGen/X86/can-load-speculatively.ll (+32)
  • (added) llvm/test/CodeGen/X86/speculative-load-intrinsic.ll (+146)
  • (added) llvm/test/Verifier/can-load-speculatively.ll (+19)
  • (added) llvm/test/Verifier/speculative-load.ll (+18)
diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index 9c3ffb396649b..5f36dcfb714c2 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -27666,6 +27666,119 @@ The '``llvm.masked.compressstore``' intrinsic is designed for compressing data i
 Other targets may support this intrinsic differently, for example, by lowering it into a sequence of branches that guard scalar store operations.
 
 
+Speculative Load Intrinsics
+---------------------------
+
+LLVM provides intrinsics for speculatively loading memory that may be
+out-of-bounds. These intrinsics enable optimizations like early-exit loop
+vectorization where the vectorized loop may read beyond the end of an array,
+provided the access is guaranteed to not trap by target-specific checks.
+
+.. _int_speculative_load:
+
+'``llvm.speculative.load``' Intrinsic
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+This is an overloaded intrinsic.
+
+::
+
+      declare <4 x float>  @llvm.speculative.load.v4f32.p0(ptr <ptr>)
+      declare <8 x i32>    @llvm.speculative.load.v8i32.p0(ptr <ptr>)
+      declare i64          @llvm.speculative.load.i64.p0(ptr <ptr>)
+
+Overview:
+"""""""""
+
+The '``llvm.speculative.load``' intrinsic loads a value from memory. Unlike a
+regular load, the memory access may
+extend beyond the bounds of the allocated object, provided the pointer has been
+verified by :ref:`llvm.can.load.speculatively <int_can_load_speculatively>` to
+ensure the access cannot fault.
+
+Arguments:
+""""""""""
+
+The argument is a pointer to the memory location to load from. The return type
+must have a power-of-2 size in bytes.
+
+Semantics:
+""""""""""
+
+The '``llvm.speculative.load``' intrinsic performs a load that may access
+memory beyond the allocated object. It must be used in combination with
+:ref:`llvm.can.load.speculatively <int_can_load_speculatively>` to ensure
+the access cannot fault.
+
+For bytes that are within the bounds of the allocated object, the intrinsic
+returns the stored value. For bytes that are beyond the bounds of the
+allocated object, the intrinsic returns ``undef`` for those bytes. At least the
+first accessed byte must be within the bounds of an allocated object the pointer is
+based on.
+
+The behavior is undefined if this intrinsic is used to load from a pointer
+for which ``llvm.can.load.speculatively`` returns false.
+
+.. _int_can_load_speculatively:
+
+'``llvm.can.load.speculatively``' Intrinsic
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+This is an overloaded intrinsic.
+
+::
+
+      declare i1 @llvm.can.load.speculatively.p0(ptr <ptr>, i64 <num_bytes>)
+      declare i1 @llvm.can.load.speculatively.p1(ptr addrspace(1) <ptr>, i64 <num_bytes>)
+
+Overview:
+"""""""""
+
+The '``llvm.can.load.speculatively``' intrinsic returns true if it is safe
+to speculatively load ``num_bytes`` bytes starting from ``ptr``,
+even if the memory may be beyond the bounds of an allocated object.
+
+Arguments:
+""""""""""
+
+The first argument is a pointer to the memory location.
+
+The second argument is an i64 specifying the size in bytes of the load.
+The size must be a positive power of 2.  If the size is not a power-of-2, the
+result is ``poison``.
+
+Semantics:
+""""""""""
+
+This intrinsic has **target-dependent** semantics. It returns ``true`` if
+loading ``num_bytes`` bytes from ``ptr`` is guaranteed not to trap,
+even if the memory is beyond the bounds of an allocated object. It returns
+``false`` otherwise.
+
+The specific conditions under which this intrinsic returns ``true`` are
+determined by the target. For example, a target may check whether the pointer
+alignment guarantees the load cannot cross a page boundary.
+
+.. code-block:: llvm
+
+    ; Check if we can safely load 16 bytes from %ptr
+    %can_load = call i1 @llvm.can.load.speculatively.p0(ptr %ptr, i64 16)
+    br i1 %can_load, label %speculative_path, label %safe_path
+
+    speculative_path:
+      ; Safe to speculatively load from %ptr
+      %vec = call <4 x i32> @llvm.speculative.load.v4i32.p0(ptr %ptr)
+      ...
+
+    safe_path:
+      ; Fall back to masked load or scalar operations
+      ...
+
+
 Memory Use Markers
 ------------------
 
diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index ada4ffd3bcc89..ebc6b64590dea 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -2292,6 +2292,19 @@ class LLVM_ABI TargetLoweringBase {
     llvm_unreachable("Store conditional unimplemented on this target");
   }
 
+  /// Emit code to check if a speculative load of the given size from Ptr is
+  /// safe. Returns a Value* representing the check result (i1), or nullptr
+  /// to use the default lowering (which returns false). Targets can override
+  /// to provide their own safety check (e.g., alignment-based page boundary
+  /// check).
+  /// \param Builder IRBuilder positioned at the intrinsic call site
+  /// \param Ptr the pointer operand
+  /// \param Size the size in bytes (constant or runtime value for scalable)
+  virtual Value *emitCanLoadSpeculatively(IRBuilderBase &Builder, Value *Ptr,
+                                          Value *Size) const {
+    return nullptr;
+  }
+
   /// Perform a masked atomicrmw using a target-specific intrinsic. This
   /// represents the core LL/SC loop which will be lowered at a late stage by
   /// the backend. The target-specific intrinsic returns the loaded value and
diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td
index ea6bd59c5aeca..2576c03b184ef 100644
--- a/llvm/include/llvm/IR/Intrinsics.td
+++ b/llvm/include/llvm/IR/Intrinsics.td
@@ -2601,6 +2601,20 @@ def int_experimental_vector_compress:
               [LLVMMatchType<0>, LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>, LLVMMatchType<0>],
               [IntrNoMem]>;
 
+// Speculatively load a value from memory; lowers to a regular aligned load.
+// The loaded type must have a power-of-2 size.
+def int_speculative_load:
+  DefaultAttrsIntrinsic<[llvm_any_ty],
+            [llvm_anyptr_ty],
+            [IntrArgMemOnly, IntrWillReturn, NoCapture<ArgIndex<0>>]>;
+
+// Returns true if it's safe to speculatively load 'num_bytes' from 'ptr'.
+// The size can be a runtime value to support scalable vectors.
+def int_can_load_speculatively:
+  DefaultAttrsIntrinsic<[llvm_i1_ty],
+            [llvm_anyptr_ty, llvm_i64_ty],
+            [IntrNoMem, IntrSpeculatable, IntrWillReturn]>;
+
 // Test whether a pointer is associated with a type metadata identifier.
 def int_type_test : DefaultAttrsIntrinsic<[llvm_i1_ty], [llvm_ptr_ty, llvm_metadata_ty],
                               [IntrNoMem, IntrSpeculatable]>;
diff --git a/llvm/lib/CodeGen/PreISelIntrinsicLowering.cpp b/llvm/lib/CodeGen/PreISelIntrinsicLowering.cpp
index 490f014aaf220..d5ec88aa589c9 100644
--- a/llvm/lib/CodeGen/PreISelIntrinsicLowering.cpp
+++ b/llvm/lib/CodeGen/PreISelIntrinsicLowering.cpp
@@ -131,6 +131,42 @@ static bool lowerLoadRelative(Function &F) {
   return Changed;
 }
 
+/// Lower @llvm.can.load.speculatively using target-specific expansion.
+/// Each target provides its own expansion via
+/// TargetLowering::emitCanLoadSpeculatively.
+/// The default expansion returns false (conservative).
+static bool lowerCanLoadSpeculatively(Function &F, const TargetMachine *TM) {
+  if (F.use_empty())
+    return false;
+
+  bool Changed = false;
+
+  for (Use &U : llvm::make_early_inc_range(F.uses())) {
+    auto *CI = dyn_cast<CallInst>(U.getUser());
+    if (!CI || CI->getCalledOperand() != &F)
+      continue;
+
+    Function *ParentFunc = CI->getFunction();
+    const TargetLowering *TLI =
+        TM->getSubtargetImpl(*ParentFunc)->getTargetLowering();
+
+    IRBuilder<> Builder(CI);
+    Value *Ptr = CI->getArgOperand(0);
+    Value *Size = CI->getArgOperand(1);
+
+    // Ask target for expansion; nullptr means use default (return false)
+    Value *Result = TLI->emitCanLoadSpeculatively(Builder, Ptr, Size);
+    if (!Result)
+      Result = Builder.getFalse();
+
+    CI->replaceAllUsesWith(Result);
+    CI->eraseFromParent();
+    Changed = true;
+  }
+
+  return Changed;
+}
+
 // ObjCARC has knowledge about whether an obj-c runtime function needs to be
 // always tail-called or never tail-called.
 static CallInst::TailCallKind getOverridingTailCallKind(const Function &F) {
@@ -630,6 +666,9 @@ bool PreISelIntrinsicLowering::lowerIntrinsics(Module &M) const {
     case Intrinsic::load_relative:
       Changed |= lowerLoadRelative(F);
       break;
+    case Intrinsic::can_load_speculatively:
+      Changed |= lowerCanLoadSpeculatively(F, TM);
+      break;
     case Intrinsic::is_constant:
     case Intrinsic::objectsize:
       Changed |= forEachCall(F, [&](CallInst *CI) {
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
index 6045b55130925..12401a04ebb63 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -5122,6 +5122,33 @@ void SelectionDAGBuilder::visitMaskedLoad(const CallInst &I, bool IsExpanding) {
   setValue(&I, Res);
 }
 
+void SelectionDAGBuilder::visitSpeculativeLoad(const CallInst &I) {
+  SDLoc sdl = getCurSDLoc();
+  Value *PtrOperand = I.getArgOperand(0);
+  SDValue Ptr = getValue(PtrOperand);
+
+  const TargetLowering &TLI = DAG.getTargetLoweringInfo();
+  EVT VT = TLI.getValueType(DAG.getDataLayout(), I.getType());
+  Align Alignment = I.getParamAlign(0).valueOrOne();
+  AAMDNodes AAInfo = I.getAAMetadata();
+  TypeSize StoreSize = VT.getStoreSize();
+
+  SDValue InChain = DAG.getRoot();
+
+  // Use MOLoad but NOT MODereferenceable - the memory may not be
+  // fully dereferenceable.
+  MachineMemOperand::Flags MMOFlags = MachineMemOperand::MOLoad;
+  LocationSize LocSize = StoreSize.isScalable()
+                             ? LocationSize::beforeOrAfterPointer()
+                             : LocationSize::precise(StoreSize);
+  MachineMemOperand *MMO = DAG.getMachineFunction().getMachineMemOperand(
+      MachinePointerInfo(PtrOperand), MMOFlags, LocSize, Alignment, AAInfo);
+
+  SDValue Load = DAG.getLoad(VT, sdl, InChain, Ptr, MMO);
+  PendingLoads.push_back(Load.getValue(1));
+  setValue(&I, Load);
+}
+
 void SelectionDAGBuilder::visitMaskedGather(const CallInst &I) {
   SDLoc sdl = getCurSDLoc();
 
@@ -6873,6 +6900,9 @@ void SelectionDAGBuilder::visitIntrinsicCall(const CallInst &I,
   case Intrinsic::masked_compressstore:
     visitMaskedStore(I, true /* IsCompressing */);
     return;
+  case Intrinsic::speculative_load:
+    visitSpeculativeLoad(I);
+    return;
   case Intrinsic::powi:
     setValue(&I, ExpandPowI(sdl, getValue(I.getArgOperand(0)),
                             getValue(I.getArgOperand(1)), DAG));
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
index f8aecea25b3d6..dad406f48b77b 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
@@ -619,6 +619,7 @@ class SelectionDAGBuilder {
   void visitStore(const StoreInst &I);
   void visitMaskedLoad(const CallInst &I, bool IsExpanding = false);
   void visitMaskedStore(const CallInst &I, bool IsCompressing = false);
+  void visitSpeculativeLoad(const CallInst &I);
   void visitMaskedGather(const CallInst &I);
   void visitMaskedScatter(const CallInst &I);
   void visitAtomicCmpXchg(const AtomicCmpXchgInst &I);
diff --git a/llvm/lib/IR/Verifier.cpp b/llvm/lib/IR/Verifier.cpp
index 3d44d1317ecc7..f850706e16ab2 100644
--- a/llvm/lib/IR/Verifier.cpp
+++ b/llvm/lib/IR/Verifier.cpp
@@ -6749,6 +6749,24 @@ void Verifier::visitIntrinsicCall(Intrinsic::ID ID, CallBase &Call) {
           &Call);
     break;
   }
+  case Intrinsic::speculative_load: {
+    Type *LoadTy = Call.getType();
+    TypeSize Size = DL.getTypeStoreSize(LoadTy);
+    // For scalable vectors, check the known minimum size is a power of 2.
+    Check(Size.getKnownMinValue() > 0 && isPowerOf2_64(Size.getKnownMinValue()),
+          "llvm.speculative.load type must have a power-of-2 size", &Call);
+    break;
+  }
+  case Intrinsic::can_load_speculatively: {
+    // If size is a constant, verify it's a positive power of 2.
+    if (auto *SizeCI = dyn_cast<ConstantInt>(Call.getArgOperand(1))) {
+      uint64_t Size = SizeCI->getZExtValue();
+      Check(Size > 0 && isPowerOf2_64(Size),
+            "llvm.can.load.speculatively size must be a positive power of 2",
+            &Call);
+    }
+    break;
+  }
   case Intrinsic::vector_insert: {
     Value *Vec = Call.getArgOperand(0);
     Value *SubVec = Call.getArgOperand(1);
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 840298ff965e1..2289a8ad48973 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -30174,6 +30174,55 @@ Value *AArch64TargetLowering::emitStoreConditional(IRBuilderBase &Builder,
   return CI;
 }
 
+Value *AArch64TargetLowering::emitCanLoadSpeculatively(IRBuilderBase &Builder,
+                                                       Value *Ptr,
+                                                       Value *Size) const {
+  // For power-of-2 sizes <= 16, emit alignment check: (ptr & (size - 1)) == 0.
+  // If the pointer is aligned to at least 'size' bytes, loading 'size' bytes
+  // cannot cross a page boundary, so it's safe to speculate.
+  // The 16-byte limit ensures correctness with MTE (memory tagging), since
+  // MTE uses 16-byte tag granules.
+  //
+  // The alignment check only works for power-of-2 sizes. For non-power-of-2
+  // sizes, we conservatively return false.
+  const DataLayout &DL =
+      Builder.GetInsertBlock()->getModule()->getDataLayout();
+
+  if (auto *CI = dyn_cast<ConstantInt>(Size)) {
+    uint64_t SizeVal = CI->getZExtValue();
+    assert(isPowerOf2_64(SizeVal) && "size must be power-of-two");
+    // For constant sizes > 16, return nullptr (default false).
+    if (SizeVal > 16)
+      return nullptr;
+
+    // Power-of-2 constant size <= 16: use fast alignment check.
+    unsigned PtrBits = DL.getPointerSizeInBits();
+    Type *IntPtrTy = Builder.getIntNTy(PtrBits);
+    Value *PtrInt = Builder.CreatePtrToInt(Ptr, IntPtrTy);
+    Value *Mask = ConstantInt::get(IntPtrTy, SizeVal - 1);
+    Value *Masked = Builder.CreateAnd(PtrInt, Mask);
+    return Builder.CreateICmpEQ(Masked, ConstantInt::get(IntPtrTy, 0));
+  }
+
+  // Check power-of-2 size <= 16 and alignment.
+  unsigned PtrBits = DL.getPointerSizeInBits();
+  Type *IntPtrTy = Builder.getIntNTy(PtrBits);
+  Value *PtrInt = Builder.CreatePtrToInt(Ptr, IntPtrTy);
+  Value *SizeExt = Builder.CreateZExtOrTrunc(Size, IntPtrTy);
+
+  Value *SizeLE16 =
+      Builder.CreateICmpULE(SizeExt, ConstantInt::get(IntPtrTy, 16));
+
+  // alignment check: (ptr & (size - 1)) == 0
+  Value *SizeMinusOne =
+      Builder.CreateSub(SizeExt, ConstantInt::get(IntPtrTy, 1));
+  Value *Masked = Builder.CreateAnd(PtrInt, SizeMinusOne);
+  Value *AlignCheck =
+      Builder.CreateICmpEQ(Masked, ConstantInt::get(IntPtrTy, 0));
+
+  return Builder.CreateAnd(SizeLE16, AlignCheck);
+}
+
 bool AArch64TargetLowering::functionArgumentNeedsConsecutiveRegisters(
     Type *Ty, CallingConv::ID CallConv, bool isVarArg,
     const DataLayout &DL) const {
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.h b/llvm/lib/Target/AArch64/AArch64ISelLowering.h
index 89a8858550ca2..884e072eaa925 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.h
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.h
@@ -350,6 +350,8 @@ class AArch64TargetLowering : public TargetLowering {
                         AtomicOrdering Ord) const override;
   Value *emitStoreConditional(IRBuilderBase &Builder, Value *Val, Value *Addr,
                               AtomicOrdering Ord) const override;
+  Value *emitCanLoadSpeculatively(IRBuilderBase &Builder, Value *Ptr,
+                                  Value *Size) const override;
 
   void emitAtomicCmpXchgNoStoreLLBalance(IRBuilderBase &Builder) const override;
 
diff --git a/llvm/test/CodeGen/AArch64/can-load-speculatively.ll b/llvm/test/CodeGen/AArch64/can-load-speculatively.ll
new file mode 100644
index 0000000000000..7916f2e4d340f
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/can-load-speculatively.ll
@@ -0,0 +1,78 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
+; RUN: opt -mtriple=aarch64-unknown-linux-gnu -passes=pre-isel-intrinsic-lowering -S < %s | FileCheck %s
+
+; Test that @llvm.can.load.speculatively is lowered to an alignment check
+; for power-of-2 sizes <= 16 bytes on AArch64, and returns false for larger sizes.
+; The 16-byte limit ensures correctness with MTE (memory tagging).
+; Note: non-power-of-2 constant sizes are rejected by the verifier.
+
+define i1 @can_load_speculatively_16(ptr %ptr) {
+; CHECK-LABEL: @can_load_speculatively_16(
+; CHECK-NEXT:    [[TMP1:%.*]] = ptrtoint ptr [[PTR:%.*]] to i64
+; CHECK-NEXT:    [[TMP2:%.*]] = and i64 [[TMP1]], 15
+; CHECK-NEXT:    [[TMP3:%.*]] = icmp eq i64 [[TMP2]], 0
+; CHECK-NEXT:    ret i1 [[TMP3]]
+;
+  %can_load = call i1 @llvm.can.load.speculatively.p0(ptr %ptr, i64 16)
+  ret i1 %can_load
+}
+
+; Size > 16 - returns false (may cross MTE tag granule boundary)
+define i1 @can_load_speculatively_32(ptr %ptr) {
+; CHECK-LABEL: @can_load_speculatively_32(
+; CHECK-NEXT:    ret i1 false
+;
+  %can_load = call i1 @llvm.can.load.speculatively.p0(ptr %ptr, i64 32)
+  ret i1 %can_load
+}
+
+; Size > 16 - returns false (may cross MTE tag granule boundary)
+define i1 @can_load_speculatively_64(ptr %ptr) {
+; CHECK-LABEL: @can_load_speculatively_64(
+; CHECK-NEXT:    ret i1 false
+;
+  %can_load = call i1 @llvm.can.load.speculatively.p0(ptr %ptr, i64 64)
+  ret i1 %can_load
+}
+
+; Test with address space
+define i1 @can_load_speculatively_addrspace1(ptr addrspace(1) %ptr) {
+; CHECK-LABEL: @can_load_speculatively_addrspace1(
+; CHECK-NEXT:    [[TMP1:%.*]] = ptrtoint ptr addrspace(1) [[PTR:%.*]] to i64
+; CHECK-NEXT:    [[TMP2:%.*]] = and i64 [[TMP1]], 15
+; CHECK-NEXT:    [[TMP3:%.*]] = icmp eq i64 [[TMP2]], 0
+; CHECK-NEXT:    ret i1 [[TMP3]]
+;
+  %can_load = call i1 @llvm.can.load.speculatively.p1(ptr addrspace(1) %ptr, i64 16)
+  ret i1 %can_load
+}
+
+; Test size 8 (within limit, power-of-2)
+define i1 @can_load_speculatively_8(ptr %ptr) {
+; CHECK-LABEL: @can_load_speculatively_8(
+; CHECK-NEXT:    [[TMP1:%.*]] = ptrtoint ptr [[PTR:%.*]] to i64
+; CHECK-NEXT:    [[TMP2:%.*]] = and i64 [[TMP1]], 7
+; CHECK-NEXT:    [[TMP3:%.*]] = icmp eq i64 [[TMP2]], 0
+; CHECK-NEXT:    ret i1 [[TMP3]]
+;
+  %can_load = call i1 @llvm.can.load.speculatively.p0(ptr %ptr, i64 8)
+  ret i1 %can_load
+}
+
+; Test with runtime size - checks size <= 16 and alignment
+define i1 @can_load_speculatively_runtime(ptr %ptr, i64 %size) {
+; CHECK-LABEL: @can_load_speculatively_runtime(
+; CHECK-NEXT:    [[TMP1:%.*]] = ptrtoint ptr [[PTR:%.*]] to i64
+; CHECK-NEXT:    [[TMP2:%.*]] = icmp ule i64 [[SIZE:%.*]], 16
+; CHECK-NEXT:    [[TMP3:%.*]] = sub i64 [[SIZE]], 1
+; CHECK-NEXT:    [[TMP4:%.*]] = and i64 [[TMP1]], [[TMP3]]
+; CHECK-NEXT:    [[TMP5:%.*]] = icmp eq i64 [[TMP4]], 0
+; CHECK-NEXT:    [[TMP6:%.*]] = and i1 [[TMP2]], [[TMP5]]
+; CHECK-NEXT:    ret i1 [[TMP6]]
+;
+  %can_load = call i1 @llvm.can.load.speculatively.p0(ptr %ptr, i64 %size)
+  ret i1 %can_load
+}
+
+declare i1 @llvm.can.load.speculatively.p0(ptr, i64)
+declare i1 @llvm.can.load.speculatively.p1(ptr addrspace(1), i64)
diff --git a/llvm/test/CodeGen/AArch64/speculative-load-intrinsic-sve.ll b/llvm/test/CodeGen/AArch64/speculative-load-intrinsic-sve.ll
new file mode 100644
index 0000000000000..78a56f3539d11
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/speculative-load-intrinsic-sve.ll
@@ -0,0 +1,66 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc -mtriple=aarch64-unknown-linux-gnu -mattr=+sve < %s | FileCheck %s
+
+; Test that @llvm.speculative.load with scalable vectors is lowered to a
+; regular load in SelectionDAG.
+
+define <vscale x 4 x i32> @speculative_load_nxv4i32(ptr %ptr) {
+; CHECK-LABEL: speculative_load_nxv4i32:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr z0, [x0]
+; CHECK-NEXT:    r...
[truncated]

@llvmbot

llvmbot commented Feb 4, 2026

Copy link
Copy Markdown
Member

@llvm/pr-subscribers-llvm-selectiondag

Author: Florian Hahn (fhahn)

Changes

Introduce two new intrinsics to enable vectorization of loops with early exits that have potentially faulting loads.

  1. @<!-- -->llvm.speculative.load (name subject to change) - perform a load that may access memory beyond the allocated object. It must be used in combination with @<!-- -->llvm.can.load.speculatively to ensure the load is guaranteed to not trap.

  2. @<!-- -->llvm.can.load.speculatively - Returns true if it's safe to speculatively load a given number of bytes from a pointer. The semantics are target-dependent. On some targets, this may check that the access does not cross page boundaries, or stricter checks for example on AArch64 with MTE, which limits the access size to 16 bytes.

@<!-- -->llvm.speculative.load is lowered to a regular load in SelectionDAG without MODereferenceable. I am not sure if we need to be more careful than this, i.e. if we could still reason about SelectionDAG loads to infer dereferencability for the pointer.

@<!-- -->llvm.can.load.speculatively is lowered to regular IR in PreISel lowering, using a target-lowering hook. By default, it conservatively expands to false.

These intrinsics should allow the loop vectorizer to vectorize early-exit loops with potentially non-dereferenceable loads.

This has previously been discussed in
#120603 and is similar to @nikic's https://hackmd.io/@nikic/S1O4QWYZkx, with the major difference being that there is no %defined_size argument and instead the load returns the stored values for the bytes within bounds and undef otherwise. I don't think we can easily compute the defined size because it may depend on the loaded values (i.e. at what lane the early exit has been taken).


Patch is 33.86 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/179642.diff

16 Files Affected:

  • (modified) llvm/docs/LangRef.rst (+113)
  • (modified) llvm/include/llvm/CodeGen/TargetLowering.h (+13)
  • (modified) llvm/include/llvm/IR/Intrinsics.td (+14)
  • (modified) llvm/lib/CodeGen/PreISelIntrinsicLowering.cpp (+39)
  • (modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp (+30)
  • (modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h (+1)
  • (modified) llvm/lib/IR/Verifier.cpp (+18)
  • (modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+49)
  • (modified) llvm/lib/Target/AArch64/AArch64ISelLowering.h (+2)
  • (added) llvm/test/CodeGen/AArch64/can-load-speculatively.ll (+78)
  • (added) llvm/test/CodeGen/AArch64/speculative-load-intrinsic-sve.ll (+66)
  • (added) llvm/test/CodeGen/AArch64/speculative-load-intrinsic.ll (+117)
  • (added) llvm/test/CodeGen/X86/can-load-speculatively.ll (+32)
  • (added) llvm/test/CodeGen/X86/speculative-load-intrinsic.ll (+146)
  • (added) llvm/test/Verifier/can-load-speculatively.ll (+19)
  • (added) llvm/test/Verifier/speculative-load.ll (+18)
diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index 9c3ffb396649b..5f36dcfb714c2 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -27666,6 +27666,119 @@ The '``llvm.masked.compressstore``' intrinsic is designed for compressing data i
 Other targets may support this intrinsic differently, for example, by lowering it into a sequence of branches that guard scalar store operations.
 
 
+Speculative Load Intrinsics
+---------------------------
+
+LLVM provides intrinsics for speculatively loading memory that may be
+out-of-bounds. These intrinsics enable optimizations like early-exit loop
+vectorization where the vectorized loop may read beyond the end of an array,
+provided the access is guaranteed to not trap by target-specific checks.
+
+.. _int_speculative_load:
+
+'``llvm.speculative.load``' Intrinsic
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+This is an overloaded intrinsic.
+
+::
+
+      declare <4 x float>  @llvm.speculative.load.v4f32.p0(ptr <ptr>)
+      declare <8 x i32>    @llvm.speculative.load.v8i32.p0(ptr <ptr>)
+      declare i64          @llvm.speculative.load.i64.p0(ptr <ptr>)
+
+Overview:
+"""""""""
+
+The '``llvm.speculative.load``' intrinsic loads a value from memory. Unlike a
+regular load, the memory access may
+extend beyond the bounds of the allocated object, provided the pointer has been
+verified by :ref:`llvm.can.load.speculatively <int_can_load_speculatively>` to
+ensure the access cannot fault.
+
+Arguments:
+""""""""""
+
+The argument is a pointer to the memory location to load from. The return type
+must have a power-of-2 size in bytes.
+
+Semantics:
+""""""""""
+
+The '``llvm.speculative.load``' intrinsic performs a load that may access
+memory beyond the allocated object. It must be used in combination with
+:ref:`llvm.can.load.speculatively <int_can_load_speculatively>` to ensure
+the access cannot fault.
+
+For bytes that are within the bounds of the allocated object, the intrinsic
+returns the stored value. For bytes that are beyond the bounds of the
+allocated object, the intrinsic returns ``undef`` for those bytes. At least the
+first accessed byte must be within the bounds of an allocated object the pointer is
+based on.
+
+The behavior is undefined if this intrinsic is used to load from a pointer
+for which ``llvm.can.load.speculatively`` returns false.
+
+.. _int_can_load_speculatively:
+
+'``llvm.can.load.speculatively``' Intrinsic
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+This is an overloaded intrinsic.
+
+::
+
+      declare i1 @llvm.can.load.speculatively.p0(ptr <ptr>, i64 <num_bytes>)
+      declare i1 @llvm.can.load.speculatively.p1(ptr addrspace(1) <ptr>, i64 <num_bytes>)
+
+Overview:
+"""""""""
+
+The '``llvm.can.load.speculatively``' intrinsic returns true if it is safe
+to speculatively load ``num_bytes`` bytes starting from ``ptr``,
+even if the memory may be beyond the bounds of an allocated object.
+
+Arguments:
+""""""""""
+
+The first argument is a pointer to the memory location.
+
+The second argument is an i64 specifying the size in bytes of the load.
+The size must be a positive power of 2.  If the size is not a power-of-2, the
+result is ``poison``.
+
+Semantics:
+""""""""""
+
+This intrinsic has **target-dependent** semantics. It returns ``true`` if
+loading ``num_bytes`` bytes from ``ptr`` is guaranteed not to trap,
+even if the memory is beyond the bounds of an allocated object. It returns
+``false`` otherwise.
+
+The specific conditions under which this intrinsic returns ``true`` are
+determined by the target. For example, a target may check whether the pointer
+alignment guarantees the load cannot cross a page boundary.
+
+.. code-block:: llvm
+
+    ; Check if we can safely load 16 bytes from %ptr
+    %can_load = call i1 @llvm.can.load.speculatively.p0(ptr %ptr, i64 16)
+    br i1 %can_load, label %speculative_path, label %safe_path
+
+    speculative_path:
+      ; Safe to speculatively load from %ptr
+      %vec = call <4 x i32> @llvm.speculative.load.v4i32.p0(ptr %ptr)
+      ...
+
+    safe_path:
+      ; Fall back to masked load or scalar operations
+      ...
+
+
 Memory Use Markers
 ------------------
 
diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index ada4ffd3bcc89..ebc6b64590dea 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -2292,6 +2292,19 @@ class LLVM_ABI TargetLoweringBase {
     llvm_unreachable("Store conditional unimplemented on this target");
   }
 
+  /// Emit code to check if a speculative load of the given size from Ptr is
+  /// safe. Returns a Value* representing the check result (i1), or nullptr
+  /// to use the default lowering (which returns false). Targets can override
+  /// to provide their own safety check (e.g., alignment-based page boundary
+  /// check).
+  /// \param Builder IRBuilder positioned at the intrinsic call site
+  /// \param Ptr the pointer operand
+  /// \param Size the size in bytes (constant or runtime value for scalable)
+  virtual Value *emitCanLoadSpeculatively(IRBuilderBase &Builder, Value *Ptr,
+                                          Value *Size) const {
+    return nullptr;
+  }
+
   /// Perform a masked atomicrmw using a target-specific intrinsic. This
   /// represents the core LL/SC loop which will be lowered at a late stage by
   /// the backend. The target-specific intrinsic returns the loaded value and
diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td
index ea6bd59c5aeca..2576c03b184ef 100644
--- a/llvm/include/llvm/IR/Intrinsics.td
+++ b/llvm/include/llvm/IR/Intrinsics.td
@@ -2601,6 +2601,20 @@ def int_experimental_vector_compress:
               [LLVMMatchType<0>, LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>, LLVMMatchType<0>],
               [IntrNoMem]>;
 
+// Speculatively load a value from memory; lowers to a regular aligned load.
+// The loaded type must have a power-of-2 size.
+def int_speculative_load:
+  DefaultAttrsIntrinsic<[llvm_any_ty],
+            [llvm_anyptr_ty],
+            [IntrArgMemOnly, IntrWillReturn, NoCapture<ArgIndex<0>>]>;
+
+// Returns true if it's safe to speculatively load 'num_bytes' from 'ptr'.
+// The size can be a runtime value to support scalable vectors.
+def int_can_load_speculatively:
+  DefaultAttrsIntrinsic<[llvm_i1_ty],
+            [llvm_anyptr_ty, llvm_i64_ty],
+            [IntrNoMem, IntrSpeculatable, IntrWillReturn]>;
+
 // Test whether a pointer is associated with a type metadata identifier.
 def int_type_test : DefaultAttrsIntrinsic<[llvm_i1_ty], [llvm_ptr_ty, llvm_metadata_ty],
                               [IntrNoMem, IntrSpeculatable]>;
diff --git a/llvm/lib/CodeGen/PreISelIntrinsicLowering.cpp b/llvm/lib/CodeGen/PreISelIntrinsicLowering.cpp
index 490f014aaf220..d5ec88aa589c9 100644
--- a/llvm/lib/CodeGen/PreISelIntrinsicLowering.cpp
+++ b/llvm/lib/CodeGen/PreISelIntrinsicLowering.cpp
@@ -131,6 +131,42 @@ static bool lowerLoadRelative(Function &F) {
   return Changed;
 }
 
+/// Lower @llvm.can.load.speculatively using target-specific expansion.
+/// Each target provides its own expansion via
+/// TargetLowering::emitCanLoadSpeculatively.
+/// The default expansion returns false (conservative).
+static bool lowerCanLoadSpeculatively(Function &F, const TargetMachine *TM) {
+  if (F.use_empty())
+    return false;
+
+  bool Changed = false;
+
+  for (Use &U : llvm::make_early_inc_range(F.uses())) {
+    auto *CI = dyn_cast<CallInst>(U.getUser());
+    if (!CI || CI->getCalledOperand() != &F)
+      continue;
+
+    Function *ParentFunc = CI->getFunction();
+    const TargetLowering *TLI =
+        TM->getSubtargetImpl(*ParentFunc)->getTargetLowering();
+
+    IRBuilder<> Builder(CI);
+    Value *Ptr = CI->getArgOperand(0);
+    Value *Size = CI->getArgOperand(1);
+
+    // Ask target for expansion; nullptr means use default (return false)
+    Value *Result = TLI->emitCanLoadSpeculatively(Builder, Ptr, Size);
+    if (!Result)
+      Result = Builder.getFalse();
+
+    CI->replaceAllUsesWith(Result);
+    CI->eraseFromParent();
+    Changed = true;
+  }
+
+  return Changed;
+}
+
 // ObjCARC has knowledge about whether an obj-c runtime function needs to be
 // always tail-called or never tail-called.
 static CallInst::TailCallKind getOverridingTailCallKind(const Function &F) {
@@ -630,6 +666,9 @@ bool PreISelIntrinsicLowering::lowerIntrinsics(Module &M) const {
     case Intrinsic::load_relative:
       Changed |= lowerLoadRelative(F);
       break;
+    case Intrinsic::can_load_speculatively:
+      Changed |= lowerCanLoadSpeculatively(F, TM);
+      break;
     case Intrinsic::is_constant:
     case Intrinsic::objectsize:
       Changed |= forEachCall(F, [&](CallInst *CI) {
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
index 6045b55130925..12401a04ebb63 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -5122,6 +5122,33 @@ void SelectionDAGBuilder::visitMaskedLoad(const CallInst &I, bool IsExpanding) {
   setValue(&I, Res);
 }
 
+void SelectionDAGBuilder::visitSpeculativeLoad(const CallInst &I) {
+  SDLoc sdl = getCurSDLoc();
+  Value *PtrOperand = I.getArgOperand(0);
+  SDValue Ptr = getValue(PtrOperand);
+
+  const TargetLowering &TLI = DAG.getTargetLoweringInfo();
+  EVT VT = TLI.getValueType(DAG.getDataLayout(), I.getType());
+  Align Alignment = I.getParamAlign(0).valueOrOne();
+  AAMDNodes AAInfo = I.getAAMetadata();
+  TypeSize StoreSize = VT.getStoreSize();
+
+  SDValue InChain = DAG.getRoot();
+
+  // Use MOLoad but NOT MODereferenceable - the memory may not be
+  // fully dereferenceable.
+  MachineMemOperand::Flags MMOFlags = MachineMemOperand::MOLoad;
+  LocationSize LocSize = StoreSize.isScalable()
+                             ? LocationSize::beforeOrAfterPointer()
+                             : LocationSize::precise(StoreSize);
+  MachineMemOperand *MMO = DAG.getMachineFunction().getMachineMemOperand(
+      MachinePointerInfo(PtrOperand), MMOFlags, LocSize, Alignment, AAInfo);
+
+  SDValue Load = DAG.getLoad(VT, sdl, InChain, Ptr, MMO);
+  PendingLoads.push_back(Load.getValue(1));
+  setValue(&I, Load);
+}
+
 void SelectionDAGBuilder::visitMaskedGather(const CallInst &I) {
   SDLoc sdl = getCurSDLoc();
 
@@ -6873,6 +6900,9 @@ void SelectionDAGBuilder::visitIntrinsicCall(const CallInst &I,
   case Intrinsic::masked_compressstore:
     visitMaskedStore(I, true /* IsCompressing */);
     return;
+  case Intrinsic::speculative_load:
+    visitSpeculativeLoad(I);
+    return;
   case Intrinsic::powi:
     setValue(&I, ExpandPowI(sdl, getValue(I.getArgOperand(0)),
                             getValue(I.getArgOperand(1)), DAG));
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
index f8aecea25b3d6..dad406f48b77b 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
@@ -619,6 +619,7 @@ class SelectionDAGBuilder {
   void visitStore(const StoreInst &I);
   void visitMaskedLoad(const CallInst &I, bool IsExpanding = false);
   void visitMaskedStore(const CallInst &I, bool IsCompressing = false);
+  void visitSpeculativeLoad(const CallInst &I);
   void visitMaskedGather(const CallInst &I);
   void visitMaskedScatter(const CallInst &I);
   void visitAtomicCmpXchg(const AtomicCmpXchgInst &I);
diff --git a/llvm/lib/IR/Verifier.cpp b/llvm/lib/IR/Verifier.cpp
index 3d44d1317ecc7..f850706e16ab2 100644
--- a/llvm/lib/IR/Verifier.cpp
+++ b/llvm/lib/IR/Verifier.cpp
@@ -6749,6 +6749,24 @@ void Verifier::visitIntrinsicCall(Intrinsic::ID ID, CallBase &Call) {
           &Call);
     break;
   }
+  case Intrinsic::speculative_load: {
+    Type *LoadTy = Call.getType();
+    TypeSize Size = DL.getTypeStoreSize(LoadTy);
+    // For scalable vectors, check the known minimum size is a power of 2.
+    Check(Size.getKnownMinValue() > 0 && isPowerOf2_64(Size.getKnownMinValue()),
+          "llvm.speculative.load type must have a power-of-2 size", &Call);
+    break;
+  }
+  case Intrinsic::can_load_speculatively: {
+    // If size is a constant, verify it's a positive power of 2.
+    if (auto *SizeCI = dyn_cast<ConstantInt>(Call.getArgOperand(1))) {
+      uint64_t Size = SizeCI->getZExtValue();
+      Check(Size > 0 && isPowerOf2_64(Size),
+            "llvm.can.load.speculatively size must be a positive power of 2",
+            &Call);
+    }
+    break;
+  }
   case Intrinsic::vector_insert: {
     Value *Vec = Call.getArgOperand(0);
     Value *SubVec = Call.getArgOperand(1);
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 840298ff965e1..2289a8ad48973 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -30174,6 +30174,55 @@ Value *AArch64TargetLowering::emitStoreConditional(IRBuilderBase &Builder,
   return CI;
 }
 
+Value *AArch64TargetLowering::emitCanLoadSpeculatively(IRBuilderBase &Builder,
+                                                       Value *Ptr,
+                                                       Value *Size) const {
+  // For power-of-2 sizes <= 16, emit alignment check: (ptr & (size - 1)) == 0.
+  // If the pointer is aligned to at least 'size' bytes, loading 'size' bytes
+  // cannot cross a page boundary, so it's safe to speculate.
+  // The 16-byte limit ensures correctness with MTE (memory tagging), since
+  // MTE uses 16-byte tag granules.
+  //
+  // The alignment check only works for power-of-2 sizes. For non-power-of-2
+  // sizes, we conservatively return false.
+  const DataLayout &DL =
+      Builder.GetInsertBlock()->getModule()->getDataLayout();
+
+  if (auto *CI = dyn_cast<ConstantInt>(Size)) {
+    uint64_t SizeVal = CI->getZExtValue();
+    assert(isPowerOf2_64(SizeVal) && "size must be power-of-two");
+    // For constant sizes > 16, return nullptr (default false).
+    if (SizeVal > 16)
+      return nullptr;
+
+    // Power-of-2 constant size <= 16: use fast alignment check.
+    unsigned PtrBits = DL.getPointerSizeInBits();
+    Type *IntPtrTy = Builder.getIntNTy(PtrBits);
+    Value *PtrInt = Builder.CreatePtrToInt(Ptr, IntPtrTy);
+    Value *Mask = ConstantInt::get(IntPtrTy, SizeVal - 1);
+    Value *Masked = Builder.CreateAnd(PtrInt, Mask);
+    return Builder.CreateICmpEQ(Masked, ConstantInt::get(IntPtrTy, 0));
+  }
+
+  // Check power-of-2 size <= 16 and alignment.
+  unsigned PtrBits = DL.getPointerSizeInBits();
+  Type *IntPtrTy = Builder.getIntNTy(PtrBits);
+  Value *PtrInt = Builder.CreatePtrToInt(Ptr, IntPtrTy);
+  Value *SizeExt = Builder.CreateZExtOrTrunc(Size, IntPtrTy);
+
+  Value *SizeLE16 =
+      Builder.CreateICmpULE(SizeExt, ConstantInt::get(IntPtrTy, 16));
+
+  // alignment check: (ptr & (size - 1)) == 0
+  Value *SizeMinusOne =
+      Builder.CreateSub(SizeExt, ConstantInt::get(IntPtrTy, 1));
+  Value *Masked = Builder.CreateAnd(PtrInt, SizeMinusOne);
+  Value *AlignCheck =
+      Builder.CreateICmpEQ(Masked, ConstantInt::get(IntPtrTy, 0));
+
+  return Builder.CreateAnd(SizeLE16, AlignCheck);
+}
+
 bool AArch64TargetLowering::functionArgumentNeedsConsecutiveRegisters(
     Type *Ty, CallingConv::ID CallConv, bool isVarArg,
     const DataLayout &DL) const {
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.h b/llvm/lib/Target/AArch64/AArch64ISelLowering.h
index 89a8858550ca2..884e072eaa925 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.h
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.h
@@ -350,6 +350,8 @@ class AArch64TargetLowering : public TargetLowering {
                         AtomicOrdering Ord) const override;
   Value *emitStoreConditional(IRBuilderBase &Builder, Value *Val, Value *Addr,
                               AtomicOrdering Ord) const override;
+  Value *emitCanLoadSpeculatively(IRBuilderBase &Builder, Value *Ptr,
+                                  Value *Size) const override;
 
   void emitAtomicCmpXchgNoStoreLLBalance(IRBuilderBase &Builder) const override;
 
diff --git a/llvm/test/CodeGen/AArch64/can-load-speculatively.ll b/llvm/test/CodeGen/AArch64/can-load-speculatively.ll
new file mode 100644
index 0000000000000..7916f2e4d340f
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/can-load-speculatively.ll
@@ -0,0 +1,78 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
+; RUN: opt -mtriple=aarch64-unknown-linux-gnu -passes=pre-isel-intrinsic-lowering -S < %s | FileCheck %s
+
+; Test that @llvm.can.load.speculatively is lowered to an alignment check
+; for power-of-2 sizes <= 16 bytes on AArch64, and returns false for larger sizes.
+; The 16-byte limit ensures correctness with MTE (memory tagging).
+; Note: non-power-of-2 constant sizes are rejected by the verifier.
+
+define i1 @can_load_speculatively_16(ptr %ptr) {
+; CHECK-LABEL: @can_load_speculatively_16(
+; CHECK-NEXT:    [[TMP1:%.*]] = ptrtoint ptr [[PTR:%.*]] to i64
+; CHECK-NEXT:    [[TMP2:%.*]] = and i64 [[TMP1]], 15
+; CHECK-NEXT:    [[TMP3:%.*]] = icmp eq i64 [[TMP2]], 0
+; CHECK-NEXT:    ret i1 [[TMP3]]
+;
+  %can_load = call i1 @llvm.can.load.speculatively.p0(ptr %ptr, i64 16)
+  ret i1 %can_load
+}
+
+; Size > 16 - returns false (may cross MTE tag granule boundary)
+define i1 @can_load_speculatively_32(ptr %ptr) {
+; CHECK-LABEL: @can_load_speculatively_32(
+; CHECK-NEXT:    ret i1 false
+;
+  %can_load = call i1 @llvm.can.load.speculatively.p0(ptr %ptr, i64 32)
+  ret i1 %can_load
+}
+
+; Size > 16 - returns false (may cross MTE tag granule boundary)
+define i1 @can_load_speculatively_64(ptr %ptr) {
+; CHECK-LABEL: @can_load_speculatively_64(
+; CHECK-NEXT:    ret i1 false
+;
+  %can_load = call i1 @llvm.can.load.speculatively.p0(ptr %ptr, i64 64)
+  ret i1 %can_load
+}
+
+; Test with address space
+define i1 @can_load_speculatively_addrspace1(ptr addrspace(1) %ptr) {
+; CHECK-LABEL: @can_load_speculatively_addrspace1(
+; CHECK-NEXT:    [[TMP1:%.*]] = ptrtoint ptr addrspace(1) [[PTR:%.*]] to i64
+; CHECK-NEXT:    [[TMP2:%.*]] = and i64 [[TMP1]], 15
+; CHECK-NEXT:    [[TMP3:%.*]] = icmp eq i64 [[TMP2]], 0
+; CHECK-NEXT:    ret i1 [[TMP3]]
+;
+  %can_load = call i1 @llvm.can.load.speculatively.p1(ptr addrspace(1) %ptr, i64 16)
+  ret i1 %can_load
+}
+
+; Test size 8 (within limit, power-of-2)
+define i1 @can_load_speculatively_8(ptr %ptr) {
+; CHECK-LABEL: @can_load_speculatively_8(
+; CHECK-NEXT:    [[TMP1:%.*]] = ptrtoint ptr [[PTR:%.*]] to i64
+; CHECK-NEXT:    [[TMP2:%.*]] = and i64 [[TMP1]], 7
+; CHECK-NEXT:    [[TMP3:%.*]] = icmp eq i64 [[TMP2]], 0
+; CHECK-NEXT:    ret i1 [[TMP3]]
+;
+  %can_load = call i1 @llvm.can.load.speculatively.p0(ptr %ptr, i64 8)
+  ret i1 %can_load
+}
+
+; Test with runtime size - checks size <= 16 and alignment
+define i1 @can_load_speculatively_runtime(ptr %ptr, i64 %size) {
+; CHECK-LABEL: @can_load_speculatively_runtime(
+; CHECK-NEXT:    [[TMP1:%.*]] = ptrtoint ptr [[PTR:%.*]] to i64
+; CHECK-NEXT:    [[TMP2:%.*]] = icmp ule i64 [[SIZE:%.*]], 16
+; CHECK-NEXT:    [[TMP3:%.*]] = sub i64 [[SIZE]], 1
+; CHECK-NEXT:    [[TMP4:%.*]] = and i64 [[TMP1]], [[TMP3]]
+; CHECK-NEXT:    [[TMP5:%.*]] = icmp eq i64 [[TMP4]], 0
+; CHECK-NEXT:    [[TMP6:%.*]] = and i1 [[TMP2]], [[TMP5]]
+; CHECK-NEXT:    ret i1 [[TMP6]]
+;
+  %can_load = call i1 @llvm.can.load.speculatively.p0(ptr %ptr, i64 %size)
+  ret i1 %can_load
+}
+
+declare i1 @llvm.can.load.speculatively.p0(ptr, i64)
+declare i1 @llvm.can.load.speculatively.p1(ptr addrspace(1), i64)
diff --git a/llvm/test/CodeGen/AArch64/speculative-load-intrinsic-sve.ll b/llvm/test/CodeGen/AArch64/speculative-load-intrinsic-sve.ll
new file mode 100644
index 0000000000000..78a56f3539d11
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/speculative-load-intrinsic-sve.ll
@@ -0,0 +1,66 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc -mtriple=aarch64-unknown-linux-gnu -mattr=+sve < %s | FileCheck %s
+
+; Test that @llvm.speculative.load with scalable vectors is lowered to a
+; regular load in SelectionDAG.
+
+define <vscale x 4 x i32> @speculative_load_nxv4i32(ptr %ptr) {
+; CHECK-LABEL: speculative_load_nxv4i32:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr z0, [x0]
+; CHECK-NEXT:    r...
[truncated]

@github-actions

github-actions Bot commented Feb 4, 2026

Copy link
Copy Markdown

✅ With the latest revision this PR passed the C/C++ code formatter.

@fhahn

fhahn commented Feb 4, 2026

Copy link
Copy Markdown
Contributor Author

Comment on lines +139 to +141
if (F.use_empty())
return false;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed thanks

}

// Check power-of-2 size <= 16 and alignment.
unsigned PtrBits = DL.getPointerSizeInBits();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use default address space

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use pointer address space, thanks!

@github-actions

github-actions Bot commented Feb 4, 2026

Copy link
Copy Markdown

🐧 Linux x64 Test Results

  • 196528 tests passed
  • 5338 tests skipped

✅ The build succeeded and all tests passed.

@github-actions

github-actions Bot commented Feb 4, 2026

Copy link
Copy Markdown

🪟 Windows x64 Test Results

  • 135665 tests passed
  • 3394 tests skipped

✅ The build succeeded and all tests passed.

Comment thread llvm/docs/LangRef.rst Outdated

For bytes that are within the bounds of the allocated object, the intrinsic
returns the stored value. For bytes that are beyond the bounds of the
allocated object, the intrinsic returns ``undef`` for those bytes. At least the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe consider using poison here? Not sure.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that should work. In the vectorizer, we need to perform a horizontal reduction on a condition based on the load, but we already have to freeze the condition there before.

Comment thread llvm/docs/LangRef.rst Outdated
based on.

The behavior is undefined if this intrinsic is used to load from a pointer
for which ``llvm.can.load.speculatively`` returns false.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for which ``llvm.can.load.speculatively`` returns false.
for which ``llvm.can.load.speculatively`` would return false.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated thanks

Comment thread llvm/docs/LangRef.rst Outdated
""""""""""

This intrinsic has **target-dependent** semantics. It returns ``true`` if
loading ``num_bytes`` bytes from ``ptr`` is guaranteed not to trap,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not to use "trap" here; on embedded targets, out-of-bounds loads can have unexpected effects which aren't a literal trap.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to ```num_bytesbytes starting atptr`` can be loaded speculatively, even ...`. Not sure if there's a better way to phrase this more generally?

// MTE uses 16-byte tag granules.
//
// The alignment check only works for power-of-2 sizes. For non-power-of-2
// sizes, we conservatively return false.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may need some sort of "userspace" check. Doing this sort of speculation is dangerous when you could potentially read MMIO registers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have anything in particular in mind?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a property we currently track, really... the closest thing is -mno-unaligned-access. Which is arguably related, I guess, since the usage I've seen involves non-volatile accesses to MMIO.

I guess we could just make it the user's fault if they do a non-volatile load within a 16-byte granule that contains an MMIO register with side-effects. And add a clang flag to disable speculative loads, if the user can't control that for some reason.

@fhahn fhahn force-pushed the speculative-load-intrinsics branch 2 times, most recently from 15073d9 to 412ea86 Compare February 5, 2026 21:15
}

; Test with runtime size - checks size <= 16 and alignment
define i1 @can_load_speculatively_runtime(ptr %ptr, i64 %size) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth having a test where the alignment is known? For example, an alignment attribute on the pointer. In such cases we should know the answer at compile time, or is there some reason why an attribute is not good enough?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, added a number of tests with known alignments. Currently they don't get simplified by the TLI hook.

The generated IR would get constant folded by InstCombine, but at this point we are too late, and it seems the backend passes won't simplify it; not sure what the best solution would be, adding the fold directly to TLI doesn't seem great.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps after the loop vectoriser creates the intrinsic something might propagate the alignment attribute directly to the call argument if that's possible? Then it's much more trivial for the lowering code to pick it up.

@fhahn fhahn force-pushed the speculative-load-intrinsics branch from 412ea86 to cc4b37b Compare February 9, 2026 18:45
@fhahn

fhahn commented Apr 20, 2026

Copy link
Copy Markdown
Contributor Author

ping

1 similar comment
@fhahn

fhahn commented Apr 30, 2026

Copy link
Copy Markdown
Contributor Author

ping

@fhahn fhahn force-pushed the speculative-load-intrinsics branch from 672eb89 to 425fd74 Compare May 5, 2026 13:47
@fhahn

fhahn commented May 5, 2026

Copy link
Copy Markdown
Contributor Author

ping :)

Comment thread llvm/docs/LangRef.rst Outdated
""""""""""

The first argument is a pointer to the memory location to load from. The return
type must be a vector type with a power-of-2 size in bytes. The second argument

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the vector type requirement here because we can't return partial poison otherwise?

I think the operation is generally useful for non-vector types as well (e.g. for a GPR memcmp).

So I wonder whether we should drop this limit, and make the result undef instead of poison? (Or nondet once undef is removed.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we make it return a byte typed value, so the user can then deal with partial poison in whatever way fits their needs?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep the vector requirement was to avoid poison propagation through integer types.

Using the new byte type seems a good fit. Probably best to drop the vector return variants, and requiring bitcasts for users?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated to latest version to also support the new byte type. I kept the support for vector types, so we can keep support for scalable vectors, which would not be possible with just the byte type.

Comment thread llvm/docs/LangRef.rst Outdated
``num_bytes`` bytes starting at ``ptr + I * num_bytes``, for all non-negative
integers ``I`` where the computed address does not wrap around the address
space, can be loaded speculatively, even if the memory is beyond the bounds of
an allocated object. It returns ``false`` otherwise.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment here is still open. I agree that this needs to be defined in terms of "if at least one byte is dereferenceable, can we speculatively load the whole range".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this requirement in llvm.can.load.speculatively? I think in practice, at the point where we will be generating ``llvm.can.load.speculatively` calls, we may not know that at least one byte is dereferenceable.

With @llvm.speculative.load now requiring passing the accessed bytes, and I think we should allow calling @llvm.speculative.load with size 0 (and llvm.can.load.speculatively should return true)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "at least one byte dereferenceable" is not a precondition for calling the intrinsic, it's a precondition for the result to be meaningful. If at least one byte is not dereferenceable, both true and false are valid results.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You effectively want can.load.speculatively to be just a property of the size and the alignment. E.g. the AArch64 implementation will return true for size <= 16, align >= 16. But it's not the case that you can do a 16 byte speculative load at any 16 aligned address, as the memory may be unmapped. You only know that if at least one byte can be accessed, all of them can be, because that's the granularity the memory protection works at.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, tried to include this:

 The first byte of each access must be part of the same underlying object as ``ptr`` in order to speculatively load the whole range.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about speculatively loading before the beginning of a allocation, e.g. for a reverse search? also, with the semantics of checking all later addresses, basically any cpu in kernel mode or any embedded target would have to always return false since there's probably some memory-mapped device later in the address space that you can't load from without causing undesired side-effects -- so I agree it should just be checking size/alignment and you have to know you can load at least one byte non-speculatively in a speculative load for it to not be UB.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to keep the initial version of the intrinsic to just loading speculatively from the end, and leave speculatively loading from the front to a follow-up extension, to limit the scope

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion, the load intrinsic in the latest version actually support skipping bits from the end and from the start depending on an argument. I update the wording of the new sentence to say first or last.

Comment thread llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp Outdated
Comment thread llvm/lib/IR/Verifier.cpp Outdated
Comment thread llvm/lib/IR/Verifier.cpp Outdated
return nullptr;

// Power-of-2 constant size <= 16: use fast alignment check.
Value *PtrInt = Builder.CreatePtrToInt(Ptr, IntPtrTy);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PtrToAddr.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wonder whether we should be trying to avoid generating the check here, or do we expect it to be optimized later if the pointer is known aligned? (This seems awkward esp. in conjunction with runtime check cost modeling?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use PtrToAddr, thanks.

I am not sure if we could do any better with the intrinsic? In terms of cost-modeling, TTI would need to know how the intrinsic gets expanded, but this already true for most intrinsics I think

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, what I meant here is whether we should be avoiding the runtime check if we know the pointer is already appropriately aligned (and just directly return true in that case).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah for that there's currently no good way to get those simplifications unfortunately, other than maybe expand them earlier (before instcombine/simplifycfg)? In terms of cost modeling, TTI should be able to return zero cost if it is free on the platform with a given alignment on the argument for example

Comment thread llvm/docs/LangRef.rst Outdated
The '``llvm.speculative.load``' intrinsic loads a value from memory. Unlike a
regular load, the memory access may extend beyond the bounds of the allocated
object, provided the pointer has been verified by
:ref:`llvm.can.load.speculatively <int_can_load_speculatively>` to ensure the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like the wording to be a bit more clearer wrt can.load.speculatively, which is a sufficient condition, but not a necessary one. The necessary condition is just that the load does not trap, based on target-specific guarantees. Using llvm.speculative.load should be fine even if the target does not happen to provide a can.load.speculatively implementation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that we should be able to introduce llvm.speculative.loadwithout introducing llvm.can.load.speculatively if we can prove it's safe statically though other means.

I think @efriedma-quic earlier mentioned that wording in terms of trap, as on embedded targets, out-of-bounds loads can have unexpected effects which aren't a literal trap.. (#179642 (comment))

Not sure if there's a more generic term to use instead of trap?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you could say something abstract like "can safely accessed on the underlying hardware"?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated thanks

@fhahn fhahn force-pushed the speculative-load-intrinsics branch 3 times, most recently from e83078d to 05da42e Compare May 21, 2026 12:44
@fhahn

fhahn commented May 21, 2026

Copy link
Copy Markdown
Contributor Author

ping

Comment thread llvm/docs/LangRef.rst Outdated
@fhahn fhahn force-pushed the speculative-load-intrinsics branch 2 times, most recently from 810b447 to 1a9c3a6 Compare June 1, 2026 11:15

@fhahn fhahn left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping

fhahn added 14 commits June 9, 2026 21:51
Introduce two new intrinsics to enable vectorization of loops with early
exits that have potentially faulting loads.

This has previously been discussed in
llvm#120603 and is similar to
@nikic's https://hackmd.io/@nikic/S1O4QWYZkx, with the major difference
being that there is no `%defined_size` argument and instead the load
returns the stored values for the bytes within bounds and undef
otherwise. I don't think we can easily compute the defined size because
it may depend on the loaded values (i.e. at what lane the early exit has
been taken).

1. `@llvm.speculative.load` (name subject to change) - perform a load that
   may access memory beyond the allocated object. It must be used in
   combination with `@llvm.can.load.speculatively` to ensure the load is
   guaranteed to not trap.

2. `@llvm.can.load.speculatively` - Returns true if it's safe to speculatively
   load a given number of bytes from a pointer. The semantics are
   target-dependent. On some targets, this may check that the access
   does not cross page boundaries, or stricter checks for example on
   AArch64 with MTE, which limits the access size to 16 bytes.

`@llvm.speculative.load` is lowered to a regular load in SelectionDAG
without MODereferenceable. I am not sure if we need to be more careful
than this, i.e. if we could still reason about SelectionDAG loads to
infer dereferencability for the pointer.

`@llvm.can.load.speculatively`  is lowered to regular IR in PreISel
lowering, using a target-lowering hook. By default, it conservatively
expands to false.

These intrinsics should allow the loop vectorizer to vectorize early-exit
loops with potentially non-dereferenceable loads.
@fhahn fhahn force-pushed the speculative-load-intrinsics branch from 1a9c3a6 to 4a1e10a Compare June 9, 2026 19:51

@fhahn fhahn left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping

@nikic nikic left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants