Structured kernel definitions (#45277)

ezyang · facebook-github-bot · commit cdc2d2843b2b · 2020-11-17T15:24:43.000-08:00
Summary: Pull Request resolved: #45277 Implements structured kernels as per pytorch/rfcs#9 and ports upsample_nearest1d to use the framework. The general structure of this diff: - Define a new syntax for specifying structured kernels in `native_functions.yaml`. You put `structured: True` on the `out` function (that's what you implement) and `structured_delegate: foo.out` on the functional/inplace variants to define them in terms of the `out` function. There's a bunch of new consistency checking to see if you've done this right, though the error messages are of varying quality. This is most of what's going on in tools.codegen.model - NativeFunctionGroup turns into StructuredNativeFunctions. Previously I thought that maybe we would use this grouping mechanism for both structured and unstructured kernels, but it turned out that Jiakai needed to make his own grouping structure. So now I've specialized it for structured kernels, which also means I get to add a bunch of invariants, like requiring structured kernels to have both a functional and an out variant. This is the lower bundle of changes in tools.codegen.model - When you make an out kernel structured, this induces us to generate a new meta function signature for you to write shape checking and output allocation code. The signatures of these is defined by `tools.codegen.api.meta` and generated into `MetaFunctions.h`. Coverage here is very bare bones and will be driven by actual operators we port as we go. - The meaty part of code generation is what we do when we have some grouped StructuredNativeFunctions. We continue to generate a wrapper per function type, but they're are a bit different as the call your meta functions, and make reference to the actual implementations in out. - Then there's a port of `upsample_nearest1d`; easiest to review by just looking at what the final code looks like. Missing pieces: - Stride calculation in TensorMeta - Sufficient sanity checking for inplace/out variants - Enough rope to make TensorIterator work This PR improves instruction counts on `upsample_nearest1d` because it eliminates an extra redispatch. Testing `at::upsample_nearest1d(x, {10});` * Functional: before 1314105, after 1150705 * Out: before 915705, after 838405 These numbers may be jittered up to +-16400 (which is the difference when I tested against an unaffected operator `at::upsample_linear1d`), though that may also because unrelated changes affected all operators globally. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: D24253555 Test Plan: Imported from OSS Reviewed By: smessmer Pulled By: ezyang fbshipit-source-id: 4ef58dd911991060f13576864c8171f9cc614456
diff --git a/BUILD.bazel b/BUILD.bazel
@@ -136,6 +136,7 @@ genrule(
         "aten/src/ATen/Functions.h",
         "aten/src/ATen/Functions.cpp",
         "aten/src/ATen/NativeFunctions.h",
+        "aten/src/ATen/MetaFunctions.h",
         "aten/src/ATen/core/TensorBody.h",
         "aten/src/ATen/core/TensorMethods.cpp",
         "aten/src/ATen/core/ATenOpList.cpp",
diff --git a/aten/src/ATen/TensorMeta.h b/aten/src/ATen/TensorMeta.h
@@ -0,0 +1,27 @@
+#pragma once
+
+#include <ATen/ATen.h>  // TODO: improve
+// #include <ATen/NativeFunctions.h>
+
+namespace at {
+
+struct TensorMeta {
+  DimVector sizes;
+  // TODO: DimVector strides;
+  TensorOptions options;
+
+  TensorMeta(IntArrayRef _sizes, TensorOptions _options)
+    : sizes(_sizes), options(_options) {}
+};
+
+inline Tensor tensor_from_meta(const TensorMeta& meta) {
+  // TODO: eliminate indirection
+  return at::empty(meta.sizes, meta.options);
+}
+
+// Analogous to self.new_empty(sizes)
+inline TensorMeta new_meta(const Tensor& self, IntArrayRef sizes) {
+  return TensorMeta(sizes, self.options());
+}
+
+} // namespace at
diff --git a/aten/src/ATen/native/UpSampleNearest1d.cpp b/aten/src/ATen/native/UpSampleNearest1d.cpp
@@ -1,47 +1,12 @@
 #include <ATen/ATen.h>
 #include <ATen/NativeFunctions.h>
 #include <ATen/native/UpSample.h>
+#include <ATen/MetaFunctions.h>
 
 namespace at {
-namespace native {
-namespace {
-
-static void upsample_nearest1d_out_cpu_template(
-    Tensor& output,
-    const Tensor& input,
-    IntArrayRef output_size,
-    c10::optional<double> scales) {
-  TORCH_CHECK(
-      output_size.size() == 1,
-      "It is expected output_size equals to 1, but got size ",
-      output_size.size());
-
-  int64_t output_width = output_size[0];
-
-  int64_t nbatch = input.size(0);
-  int64_t channels = input.size(1);
-  int64_t input_width = input.size(2);
-
-  upsample_1d_shape_check(
-      input,
-      Tensor(),
-      nbatch,
-      channels,
-      input_width,
-      output_width);
+namespace meta {
 
-  output.resize_({nbatch, channels, output_width});
-
-  AT_ASSERT(input_width > 0 && output_width > 0);
-  upsample_nearest1d_kernel(kCPU, output, input, scales);
-}
-
-static void upsample_nearest1d_backward_out_cpu_template(
-    Tensor& grad_input,
-    const Tensor& grad_output,
-    IntArrayRef output_size,
-    IntArrayRef input_size,
-    c10::optional<double> scales) {
+static std::array<int64_t, 3> upsample_nearest1d_common_check(IntArrayRef input_size, IntArrayRef output_size) {
   TORCH_CHECK(
       output_size.size() == 1,
       "It is expected output_size equals to 1, but got size ",
@@ -58,36 +23,50 @@ static void upsample_nearest1d_backward_out_cpu_template(
   int64_t channels = input_size[1];
   int64_t input_width = input_size[2];
 
-  upsample_1d_shape_check(
-      Tensor(),
-      grad_output,
-      nbatch,
-      channels,
+  TORCH_CHECK(
+      input_width > 0 && output_width > 0,
+      "Input and output sizes should be greater than 0, but got input (W: ",
       input_width,
-      output_width);
+      ") and output (W: ",
+      output_width,
+      ")");
 
-  grad_input.resize_({nbatch, channels, input_width});
-  grad_input.zero_();
+  return {nbatch, channels, output_width};
+}
 
-  upsample_nearest1d_backward_kernel(kCPU, grad_input, grad_output, scales);
+TensorMeta upsample_nearest1d(const Tensor& input, IntArrayRef output_size, c10::optional<double> scales) {
+  auto full_output_size = upsample_nearest1d_common_check(input.sizes(), output_size);
+
+  // Allow for empty batch size but not other dimensions
+  TORCH_CHECK(
+      (input.size(1) != 0 && input.size(2) != 0) && input.dim() == 3,
+      "Non-empty 3D data tensor expected but got a tensor with sizes ",
+      input.sizes());
+
+  return new_meta(input, full_output_size);
 }
-} // namespace
 
-Tensor& upsample_nearest1d_out_cpu(
-    Tensor& output,
-    const Tensor& input,
-    IntArrayRef output_size,
-    c10::optional<double> scales) {
-  upsample_nearest1d_out_cpu_template(output, input, output_size, scales);
-  return output;
+TensorMeta upsample_nearest1d_backward(const Tensor& grad_output, IntArrayRef output_size, IntArrayRef input_size, c10::optional<double> scales) {
+  auto full_output_size = upsample_nearest1d_common_check(input_size, output_size);
+
+  check_dim_size(grad_output, 3, 0, full_output_size[0]);
+  check_dim_size(grad_output, 3, 1, full_output_size[1]);
+  check_dim_size(grad_output, 3, 2, full_output_size[2]);
+
+  return new_meta(grad_output, input_size);
 }
 
-Tensor upsample_nearest1d_cpu(
+} // namespace meta
+
+
+namespace native {
+
+Tensor& upsample_nearest1d_out_cpu(
+    Tensor& output,
     const Tensor& input,
     IntArrayRef output_size,
     c10::optional<double> scales) {
-  auto output = at::empty({0}, input.options());
-  upsample_nearest1d_out_cpu_template(output, input, output_size, scales);
+  upsample_nearest1d_kernel(kCPU, output, input, scales);
   return output;
 }
 
@@ -97,51 +76,38 @@ Tensor& upsample_nearest1d_backward_out_cpu(
     IntArrayRef output_size,
     IntArrayRef input_size,
     c10::optional<double> scales) {
-  upsample_nearest1d_backward_out_cpu_template(
-      grad_input, grad_output, output_size, input_size, scales);
-  return grad_input;
-}
-
-Tensor upsample_nearest1d_backward_cpu(
-    const Tensor& grad_output,
-    IntArrayRef output_size,
-    IntArrayRef input_size,
-    c10::optional<double> scales) {
-  auto grad_input = at::zeros(input_size, grad_output.options());
-  upsample_nearest1d_backward_out_cpu_template(
-      grad_input, grad_output, output_size, input_size, scales);
+  grad_input.zero_();
+  upsample_nearest1d_backward_kernel(kCPU, grad_input, grad_output, scales);
   return grad_input;
 }
 
 using at::native::upsample::compute_output_size;
 using at::native::upsample::get_scale_value;
 
-Tensor upsample_nearest1d_cpu(
+// vec variants
+
+Tensor upsample_nearest1d(
     const Tensor& input,
     c10::optional<IntArrayRef> output_size,
     c10::optional<ArrayRef<double>> scale_factors) {
-  auto output = at::empty({0}, input.options());
   auto osize = compute_output_size(input.sizes(), output_size, scale_factors);
   auto scale_w = get_scale_value(scale_factors, 0);
-  upsample_nearest1d_out_cpu_template(output, input, osize, scale_w);
-  return output;
+  return at::upsample_nearest1d(input, osize, scale_w);
 }
 
-Tensor upsample_nearest1d_backward_cpu(
+Tensor upsample_nearest1d_backward(
     const Tensor& grad_output,
     c10::optional<IntArrayRef> output_size,
     IntArrayRef input_size,
     c10::optional<ArrayRef<double>> scale_factors) {
   auto osize = compute_output_size(input_size, output_size, scale_factors);
   auto scale_w = get_scale_value(scale_factors, 0);
-  auto grad_input = at::zeros(input_size, grad_output.options());
-  upsample_nearest1d_backward_out_cpu_template(
-      grad_input, grad_output, osize, input_size, scale_w);
-  return grad_input;
+  return at::upsample_nearest1d_backward(grad_output, osize, input_size, scale_w);
 }
 
 DEFINE_DISPATCH(upsample_nearest1d_kernel);
 DEFINE_DISPATCH(upsample_nearest1d_backward_kernel);
 
 } // namespace native
+
 } // namespace at
diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml
@@ -8253,15 +8253,13 @@
   use_c10_dispatcher: full
   python_module: nn
   dispatch:
-    CPU: upsample_nearest1d_cpu
-    CUDA: upsample_nearest1d_cuda
+    DefaultBackend: upsample_nearest1d
 
 - func: upsample_nearest1d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, float[]? scale_factors) -> Tensor
   use_c10_dispatcher: full
   python_module: nn
   dispatch:
-    CPU: upsample_nearest1d_backward_cpu
-    CUDA: upsample_nearest1d_backward_cuda
+    DefaultBackend: upsample_nearest1d_backward
 
 - func: upsample_nearest2d.vec(Tensor input, int[]? output_size, float[]? scale_factors) -> Tensor
   use_c10_dispatcher: full
@@ -8401,29 +8399,27 @@
 
 - func: upsample_nearest1d.out(Tensor self, int[1] output_size, float? scales=None, *, Tensor(a!) out) -> Tensor(a!)
   python_module: nn
+  structured: True
   dispatch:
     CPU: upsample_nearest1d_out_cpu
     CUDA: upsample_nearest1d_out_cuda
 
 - func: upsample_nearest1d(Tensor self, int[1] output_size, float? scales=None) -> Tensor
   use_c10_dispatcher: full
   python_module: nn
-  dispatch:
-    CPU: upsample_nearest1d_cpu
-    CUDA: upsample_nearest1d_cuda
+  structured_delegate: upsample_nearest1d.out
 
 - func: upsample_nearest1d_backward.grad_input(Tensor grad_output, int[1] output_size, int[3] input_size, float? scales=None, *, Tensor(a!) grad_input) -> Tensor(a!)
   python_module: nn
+  structured: True
   dispatch:
     CPU: upsample_nearest1d_backward_out_cpu
     CUDA: upsample_nearest1d_backward_out_cuda
 
 - func: upsample_nearest1d_backward(Tensor grad_output, int[1] output_size, int[3] input_size, float? scales=None) -> Tensor
   use_c10_dispatcher: full
   python_module: nn
-  dispatch:
-    CPU: upsample_nearest1d_backward_cpu
-    CUDA: upsample_nearest1d_backward_cuda
+  structured_delegate: upsample_nearest1d_backward.grad_input
 
 - func: upsample_nearest2d.out(Tensor self, int[2] output_size, float? scales_h=None, float? scales_w=None, *, Tensor(a!) out) -> Tensor(a!)
   python_module: nn
diff --git a/aten/src/ATen/templates/MetaFunctions.h b/aten/src/ATen/templates/MetaFunctions.h
@@ -0,0 +1,14 @@
+#pragma once
+
+// ${generated_comment}
+
+#include <ATen/ATen.h>  // TODO: improve
+#include <ATen/TensorMeta.h>
+
+namespace at {
+namespace meta {
+
+${declarations}
+
+} // namespace meta
+} // namespace at
diff --git a/aten/src/ATen/templates/RegisterDispatchKey.cpp b/aten/src/ATen/templates/RegisterDispatchKey.cpp
@@ -11,6 +11,7 @@
 #include <c10/core/Allocator.h>
 #include <ATen/DeviceGuard.h>
 #include <ATen/NativeFunctions.h>
+#include <ATen/MetaFunctions.h>
 #include <ATen/NamedTensorUtils.h>
 #include <ATen/Utils.h>
 #include <ATen/WrapDimUtils.h>
diff --git a/tools/codegen/api/meta.py b/tools/codegen/api/meta.py
@@ -0,0 +1,59 @@
+from tools.codegen.model import *
+from tools.codegen.api.types import MetaArgument
+
+import tools.codegen.api.cpp as cpp
+import tools.codegen.api.dispatcher as dispatcher
+
+from typing import Sequence
+import itertools
+
+# Follows dispatcher calling convention, but:
+#   - Mutable arguments not allowed.  Meta functions are always
+#     written in functional form.  Look at FunctionSchema.signature()
+#   - No tensor returns; instead we return a TensorMeta describing
+#     the tensor in question
+
+def name(f: FunctionSchema) -> str:
+    assert f.name.overload_name == ""
+    return str(f.name.name)
+
+def argument_type(a: Argument) -> str:
+    assert not a.is_write
+    return dispatcher.argumenttype_type(a.type, mutable=False)
+
+def returntype_type(t: Type) -> str:
+    r = cpp.valuetype_type(t)
+    if r is not None:
+        return r
+
+    if isinstance(t, BaseType):
+        if t.name == BaseTy.Tensor:
+            return 'TensorMeta'
+    elif isinstance(t, ListType):
+        raise NotImplementedError("list returns not supported yet")
+
+    raise AssertionError(f"unrecognized return type {t}")
+
+def return_type(r: Return) -> str:
+    assert not r.is_write
+    return returntype_type(r.type)
+
+def returns_type(rs: Sequence[Return]) -> str:
+    if len(rs) == 0:
+        return 'void'
+    elif len(rs) == 1:
+        return return_type(rs[0])
+    else:
+        args = ','.join(map(return_type, rs))
+        return f'std::tuple<{args}>'
+
+def argument(a: Argument) -> MetaArgument:
+    return MetaArgument(
+        type=argument_type(a),
+        name=a.name,
+        argument=a,
+    )
+
+def arguments(func: FunctionSchema) -> Sequence[MetaArgument]:
+    assert not func.out_arguments
+    return list(map(argument, itertools.chain(func.arguments, func.kwarg_only_arguments)))
diff --git a/tools/codegen/api/types.py b/tools/codegen/api/types.py
diff --git a/tools/codegen/gen.py b/tools/codegen/gen.py
diff --git a/tools/codegen/model.py b/tools/codegen/model.py