|
| 1 | +============================================================= |
| 2 | +NEP 16 — An abstract base class for identifying "duck arrays" |
| 3 | +============================================================= |
| 4 | + |
| 5 | +:Author: Nathaniel J. Smith <njs@pobox.com> |
| 6 | +:Status: Withdrawn |
| 7 | +:Type: Standards Track |
| 8 | +:Created: 2018-03-06 |
| 9 | +:Resolution: https://github.com/numpy/numpy/pull/12174 |
| 10 | + |
| 11 | +.. note:: |
| 12 | + |
| 13 | + This NEP has been withdrawn in favor of the protocol based approach |
| 14 | + described in |
| 15 | + `NEP 22 <http://www.numpy.org/neps/nep-0022-ndarray-duck-typing-overview.html>`__ |
| 16 | + |
| 17 | +Abstract |
| 18 | +-------- |
| 19 | + |
| 20 | +We propose to add an abstract base class ``AbstractArray`` so that |
| 21 | +third-party classes can declare their ability to "quack like" an |
| 22 | +``ndarray``, and an ``asabstractarray`` function that performs |
| 23 | +similarly to ``asarray`` except that it passes through |
| 24 | +``AbstractArray`` instances unchanged. |
| 25 | + |
| 26 | + |
| 27 | +Detailed description |
| 28 | +-------------------- |
| 29 | + |
| 30 | +Many functions, in NumPy and in third-party packages, start with some |
| 31 | +code like:: |
| 32 | + |
| 33 | + def myfunc(a, b): |
| 34 | + a = np.asarray(a) |
| 35 | + b = np.asarray(b) |
| 36 | + ... |
| 37 | + |
| 38 | +This ensures that ``a`` and ``b`` are ``np.ndarray`` objects, so |
| 39 | +``myfunc`` can carry on assuming that they'll act like ndarrays both |
| 40 | +semantically (at the Python level), and also in terms of how they're |
| 41 | +stored in memory (at the C level). But many of these functions only |
| 42 | +work with arrays at the Python level, which means that they don't |
| 43 | +actually need ``ndarray`` objects *per se*: they could work just as |
| 44 | +well with any Python object that "quacks like" an ndarray, such as |
| 45 | +sparse arrays, dask's lazy arrays, or xarray's labeled arrays. |
| 46 | + |
| 47 | +However, currently, there's no way for these libraries to express that |
| 48 | +their objects can quack like an ndarray, and there's no way for |
| 49 | +functions like ``myfunc`` to express that they'd be happy with |
| 50 | +anything that quacks like an ndarray. The purpose of this NEP is to |
| 51 | +provide those two features. |
| 52 | + |
| 53 | +Sometimes people suggest using ``np.asanyarray`` for this purpose, but |
| 54 | +unfortunately its semantics are exactly backwards: it guarantees that |
| 55 | +the object it returns uses the same memory layout as an ``ndarray``, |
| 56 | +but tells you nothing at all about its semantics, which makes it |
| 57 | +essentially impossible to use safely in practice. Indeed, the two |
| 58 | +``ndarray`` subclasses distributed with NumPy – ``np.matrix`` and |
| 59 | +``np.ma.masked_array`` – do have incompatible semantics, and if they |
| 60 | +were passed to a function like ``myfunc`` that doesn't check for them |
| 61 | +as a special-case, then it may silently return incorrect results. |
| 62 | + |
| 63 | + |
| 64 | +Declaring that an object can quack like an array |
| 65 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 66 | + |
| 67 | +There are two basic approaches we could use for checking whether an |
| 68 | +object quacks like an array. We could check for a special attribute on |
| 69 | +the class:: |
| 70 | + |
| 71 | + def quacks_like_array(obj): |
| 72 | + return bool(getattr(type(obj), "__quacks_like_array__", False)) |
| 73 | + |
| 74 | +Or, we could define an `abstract base class (ABC) |
| 75 | +<https://docs.python.org/3/library/collections.abc.html>`__:: |
| 76 | + |
| 77 | + def quacks_like_array(obj): |
| 78 | + return isinstance(obj, AbstractArray) |
| 79 | + |
| 80 | +If you look at how ABCs work, this is essentially equivalent to |
| 81 | +keeping a global set of types that have been declared to implement the |
| 82 | +``AbstractArray`` interface, and then checking it for membership. |
| 83 | + |
| 84 | +Between these, the ABC approach seems to have a number of advantages: |
| 85 | + |
| 86 | +* It's Python's standard, "one obvious way" of doing this. |
| 87 | + |
| 88 | +* ABCs can be introspected (e.g. ``help(np.AbstractArray)`` does |
| 89 | + something useful). |
| 90 | + |
| 91 | +* ABCs can provide useful mixin methods. |
| 92 | + |
| 93 | +* ABCs integrate with other features like mypy type-checking, |
| 94 | + ``functools.singledispatch``, etc. |
| 95 | + |
| 96 | +One obvious thing to check is whether this choice affects speed. Using |
| 97 | +the attached benchmark script on a CPython 3.7 prerelease (revision |
| 98 | +c4d77a661138d, self-compiled, no PGO), on a Thinkpad T450s running |
| 99 | +Linux, we find:: |
| 100 | + |
| 101 | + np.asarray(ndarray_obj) 330 ns |
| 102 | + np.asarray([]) 1400 ns |
| 103 | + |
| 104 | + Attribute check, success 80 ns |
| 105 | + Attribute check, failure 80 ns |
| 106 | + |
| 107 | + ABC, success via subclass 340 ns |
| 108 | + ABC, success via register() 700 ns |
| 109 | + ABC, failure 370 ns |
| 110 | + |
| 111 | +Notes: |
| 112 | + |
| 113 | +* The first two lines are included to put the other lines in context. |
| 114 | + |
| 115 | +* This used 3.7 because both ``getattr`` and ABCs are receiving |
| 116 | + substantial optimizations in this release, and it's more |
| 117 | + representative of the long-term future of Python. (Failed |
| 118 | + ``getattr`` doesn't necessarily construct an exception object |
| 119 | + anymore, and ABCs were reimplemented in C.) |
| 120 | + |
| 121 | +* The "success" lines refer to cases where ``quacks_like_array`` would |
| 122 | + return True. The "failure" lines are cases where it would return |
| 123 | + False. |
| 124 | + |
| 125 | +* The first measurement for ABCs is subclasses defined like:: |
| 126 | + |
| 127 | + class MyArray(AbstractArray): |
| 128 | + ... |
| 129 | + |
| 130 | + The second is for subclasses defined like:: |
| 131 | + |
| 132 | + class MyArray: |
| 133 | + ... |
| 134 | + |
| 135 | + AbstractArray.register(MyArray) |
| 136 | + |
| 137 | + I don't know why there's such a large difference between these. |
| 138 | + |
| 139 | +In practice, either way we'd only do the full test after first |
| 140 | +checking for well-known types like ``ndarray``, ``list``, etc. `This |
| 141 | +is how NumPy currently checks for other double-underscore attributes |
| 142 | +<https://github.com/numpy/numpy/blob/master/numpy/core/src/private/get_attr_string.h>`__ |
| 143 | +and the same idea applies here to either approach. So these numbers |
| 144 | +won't affect the common case, just the case where we actually have an |
| 145 | +``AbstractArray``, or else another third-party object that will end up |
| 146 | +going through ``__array__`` or ``__array_interface__`` or end up as an |
| 147 | +object array. |
| 148 | + |
| 149 | +So in summary, using an ABC will be slightly slower than using an |
| 150 | +attribute, but this doesn't affect the most common paths, and the |
| 151 | +magnitude of slowdown is fairly small (~250 ns on an operation that |
| 152 | +already takes longer than that). Furthermore, we can potentially |
| 153 | +optimize this further (e.g. by keeping a tiny LRU cache of types that |
| 154 | +are known to be AbstractArray subclasses, on the assumption that most |
| 155 | +code will only use one or two of these types at a time), and it's very |
| 156 | +unclear that this even matters – if the speed of ``asarray`` no-op |
| 157 | +pass-throughs were a bottleneck that showed up in profiles, then |
| 158 | +probably we would have made them faster already! (It would be trivial |
| 159 | +to fast-path this, but we don't.) |
| 160 | + |
| 161 | +Given the semantic and usability advantages of ABCs, this seems like |
| 162 | +an acceptable trade-off. |
| 163 | + |
| 164 | +.. |
| 165 | + CPython 3.6 (from Debian):: |
| 166 | +
|
| 167 | + Attribute check, success 110 ns |
| 168 | + Attribute check, failure 370 ns |
| 169 | + |
| 170 | + ABC, success via subclass 690 ns |
| 171 | + ABC, success via register() 690 ns |
| 172 | + ABC, failure 1220 ns |
| 173 | + |
| 174 | + |
| 175 | +Specification of ``asabstractarray`` |
| 176 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 177 | + |
| 178 | +Given ``AbstractArray``, the definition of ``asabstractarray`` is simple:: |
| 179 | + |
| 180 | + def asabstractarray(a, dtype=None): |
| 181 | + if isinstance(a, AbstractArray): |
| 182 | + if dtype is not None and dtype != a.dtype: |
| 183 | + return a.astype(dtype) |
| 184 | + return a |
| 185 | + return asarray(a, dtype=dtype) |
| 186 | + |
| 187 | +Things to note: |
| 188 | + |
| 189 | +* ``asarray`` also accepts an ``order=`` argument, but we don't |
| 190 | + include that here because it's about details of memory |
| 191 | + representation, and the whole point of this function is that you use |
| 192 | + it to declare that you don't care about details of memory |
| 193 | + representation. |
| 194 | + |
| 195 | +* Using the ``astype`` method allows the ``a`` object to decide how to |
| 196 | + implement casting for its particular type. |
| 197 | + |
| 198 | +* For strict compatibility with ``asarray``, we skip calling |
| 199 | + ``astype`` when the dtype is already correct. Compare:: |
| 200 | + |
| 201 | + >>> a = np.arange(10) |
| 202 | + |
| 203 | + # astype() always returns a view: |
| 204 | + >>> a.astype(a.dtype) is a |
| 205 | + False |
| 206 | + |
| 207 | + # asarray() returns the original object if possible: |
| 208 | + >>> np.asarray(a, dtype=a.dtype) is a |
| 209 | + True |
| 210 | + |
| 211 | + |
| 212 | +What exactly are you promising if you inherit from ``AbstractArray``? |
| 213 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 214 | + |
| 215 | +This will presumably be refined over time. The ideal of course is that |
| 216 | +your class should be indistinguishable from a real ``ndarray``, but |
| 217 | +nothing enforces that except the expectations of users. In practice, |
| 218 | +declaring that your class implements the ``AbstractArray`` interface |
| 219 | +simply means that it will start passing through ``asabstractarray``, |
| 220 | +and so by subclassing it you're saying that if some code works for |
| 221 | +``ndarray``\s but breaks for your class, then you're willing to accept |
| 222 | +bug reports on that. |
| 223 | + |
| 224 | +To start with, we should declare ``__array_ufunc__`` to be an abstract |
| 225 | +method, and add the ``NDArrayOperatorsMixin`` methods as mixin |
| 226 | +methods. |
| 227 | + |
| 228 | +Declaring ``astype`` as an ``@abstractmethod`` probably makes sense as |
| 229 | +well, since it's used by ``asabstractarray``. We might also want to go |
| 230 | +ahead and add some basic attributes like ``ndim``, ``shape``, |
| 231 | +``dtype``. |
| 232 | + |
| 233 | +Adding new abstract methods will be a bit tricky, because ABCs enforce |
| 234 | +these at subclass time; therefore, simply adding a new |
| 235 | +`@abstractmethod` will be a backwards compatibility break. If this |
| 236 | +becomes a problem then we can use some hacks to implement an |
| 237 | +`@upcoming_abstractmethod` decorator that only issues a warning if the |
| 238 | +method is missing, and treat it like a regular deprecation cycle. (In |
| 239 | +this case, the thing we'd be deprecating is "support for abstract |
| 240 | +arrays that are missing feature X".) |
| 241 | + |
| 242 | + |
| 243 | +Naming |
| 244 | +~~~~~~ |
| 245 | + |
| 246 | +The name of the ABC doesn't matter too much, because it will only be |
| 247 | +referenced rarely and in relatively specialized situations. The name |
| 248 | +of the function matters a lot, because most existing instances of |
| 249 | +``asarray`` should be replaced by this, and in the future it's what |
| 250 | +everyone should be reaching for by default unless they have a specific |
| 251 | +reason to use ``asarray`` instead. This suggests that its name really |
| 252 | +should be *shorter* and *more memorable* than ``asarray``... which |
| 253 | +is difficult. I've used ``asabstractarray`` in this draft, but I'm not |
| 254 | +really happy with it, because it's too long and people are unlikely to |
| 255 | +start using it by habit without endless exhortations. |
| 256 | + |
| 257 | +One option would be to actually change ``asarray``\'s semantics so |
| 258 | +that *it* passes through ``AbstractArray`` objects unchanged. But I'm |
| 259 | +worried that there may be a lot of code out there that calls |
| 260 | +``asarray`` and then passes the result into some C function that |
| 261 | +doesn't do any further type checking (because it knows that its caller |
| 262 | +has already used ``asarray``). If we allow ``asarray`` to return |
| 263 | +``AbstractArray`` objects, and then someone calls one of these C |
| 264 | +wrappers and passes it an ``AbstractArray`` object like a sparse |
| 265 | +array, then they'll get a segfault. Right now, in the same situation, |
| 266 | +``asarray`` will instead invoke the object's ``__array__`` method, or |
| 267 | +use the buffer interface to make a view, or pass through an array with |
| 268 | +object dtype, or raise an error, or similar. Probably none of these |
| 269 | +outcomes are actually desireable in most cases, so maybe making it a |
| 270 | +segfault instead would be OK? But it's dangerous given that we don't |
| 271 | +know how common such code is. OTOH, if we were starting from scratch |
| 272 | +then this would probably be the ideal solution. |
| 273 | + |
| 274 | +We can't use ``asanyarray`` or ``array``, since those are already |
| 275 | +taken. |
| 276 | + |
| 277 | +Any other ideas? ``np.cast``, ``np.coerce``? |
| 278 | + |
| 279 | + |
| 280 | +Implementation |
| 281 | +-------------- |
| 282 | + |
| 283 | +1. Rename ``NDArrayOperatorsMixin`` to ``AbstractArray`` (leaving |
| 284 | + behind an alias for backwards compatibility) and make it an ABC. |
| 285 | + |
| 286 | +2. Add ``asabstractarray`` (or whatever we end up calling it), and |
| 287 | + probably a C API equivalent. |
| 288 | + |
| 289 | +3. Begin migrating NumPy internal functions to using |
| 290 | + ``asabstractarray`` where appropriate. |
| 291 | + |
| 292 | + |
| 293 | +Backward compatibility |
| 294 | +---------------------- |
| 295 | + |
| 296 | +This is purely a new feature, so there are no compatibility issues. |
| 297 | +(Unless we decide to change the semantics of ``asarray`` itself.) |
| 298 | + |
| 299 | + |
| 300 | +Rejected alternatives |
| 301 | +--------------------- |
| 302 | + |
| 303 | +One suggestion that has come up is to define multiple abstract classes |
| 304 | +for different subsets of the array interface. Nothing in this proposal |
| 305 | +stops either NumPy or third-parties from doing this in the future, but |
| 306 | +it's very difficult to guess ahead of time which subsets would be |
| 307 | +useful. Also, "the full ndarray interface" is something that existing |
| 308 | +libraries are written to expect (because they work with actual |
| 309 | +ndarrays) and test (because they test with actual ndarrays), so it's |
| 310 | +by far the easiest place to start. |
| 311 | + |
| 312 | + |
| 313 | +Links to discussion |
| 314 | +------------------- |
| 315 | + |
| 316 | +* https://mail.python.org/pipermail/numpy-discussion/2018-March/077767.html |
| 317 | + |
| 318 | + |
| 319 | +Appendix: Benchmark script |
| 320 | +-------------------------- |
| 321 | + |
| 322 | +.. literalinclude:: nep-0016-benchmark.py |
| 323 | + |
| 324 | + |
| 325 | +Copyright |
| 326 | +--------- |
| 327 | + |
| 328 | +This document has been placed in the public domain. |
0 commit comments