Skip to content

gh-104909: Split BINARY_OP into micro-ops #104910

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
May 31, 2023
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
121 changes: 74 additions & 47 deletions Python/bytecodes.c
Original file line number Diff line number Diff line change
Expand Up @@ -279,73 +279,111 @@ dummy_func(

family(binary_op, INLINE_CACHE_ENTRIES_BINARY_OP) = {
BINARY_OP,
BINARY_OP_ADD_FLOAT,
BINARY_OP_MULTIPLY_INT,
BINARY_OP_ADD_INT,
BINARY_OP_ADD_UNICODE,
// BINARY_OP_INPLACE_ADD_UNICODE, // This is an odd duck.
BINARY_OP_SUBTRACT_INT,
BINARY_OP_MULTIPLY_FLOAT,
BINARY_OP_MULTIPLY_INT,
BINARY_OP_ADD_FLOAT,
BINARY_OP_SUBTRACT_FLOAT,
BINARY_OP_SUBTRACT_INT,
BINARY_OP_ADD_UNICODE,
// BINARY_OP_INPLACE_ADD_UNICODE, // See comments at that opcode.
Comment on lines -282 to +289
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: these were alphabetized before, and now they're not. I think it makes sense to reorder the implementations below, but I'm not sure there's also value in reordering these.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of alphabetization any more (just search :-), and I tried to let the grouping match the ordering of the definitions below, with the exception of BINARY_OP itself, which is somewhere else entirely.

TBH I'm not sure that there's a single organizing principle in the ordering in this file any more; I like to keep families together, but I also don't like to move code around unnecessarily.

If you insist I can undo this chunk.

};


inst(BINARY_OP_MULTIPLY_INT, (unused/1, left, right -- prod)) {
op(_BINARY_OP_INT_GUARD, (left, right -- left, right)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think maybe the name could just be _GUARD_INT or something, since this could potentially be reused for things like COMPARE_OP.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could rename it once we get to that bridge... But we'd probably end up moving it around again too. Anyway, it looks like there are other concerns for these guards.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested name GUARD_BOTH_INTS as it checks that both are ints, regardless of how they are used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll use _GUARD_BOTH_INT, _GUARD_BOTH_FLOAT, and _GUARD_BOTH_UNICODE.

DEOPT_IF(!PyLong_CheckExact(left), BINARY_OP);
DEOPT_IF(!PyLong_CheckExact(right), BINARY_OP);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that there might be value in splitting these types of checks up: one for guarding the top of stack, and one for guarding the second item on the stack.

That way we could successfully remove uops for cases like 1 + rhs or lhs += 1, where they type of one argument is known (or inferred).

Copy link
Member

@Fidget-Spinner Fidget-Spinner May 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a heads up: in the very near future for py lazy basic block, we plan to just have a generic CHECK_INT x that checks the xth stack entry is indeed an int. The main reason is that it would save us an opcode.

Copy link
Member

@brandtbucher brandtbucher May 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Just need space for a couple extra opargs. I wonder if it would make sense to have "hardcoded" or "shared" opargs as part of macro instructions.

So we could have (with straw-man syntax):

macro(BINARY_OP, oparg) = _BINARY_OP_SPECIALIZE(oparg) + _BINARY_OP_ACTION(oparg);
macro(BINARY_OP_ADD_INT, oparg) = _CHECK_INT(1) + _CHECK_INT(2) + _BINARY_OP_ADD_INT(oparg);

I sort of like it, and maybe the scheme could be extended to share other locals or caches too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gvanrossum, thoughts on something like the above?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to also mention that we encountered many situations where one type is known but the other isnt. So it helps to have some granularity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all very interesting proposals, and I'll hold up this PR until we've got some agreement on what we eventually want to do.

  • Splitting the type-checking guard into atomic pieces: this makes sense from the POV of machine code generation (fewer, simpler templates) but it would make the Tier-2 interpreter slower, due to increased opcode decoding overhead (unless the guards can be eliminated, of course). There are various possible ways to address that though (possibly even super-micro-ops :-), so maybe we should just go for it, if we can agree on a syntax.
  • Having micro-ops with parameters, notably _CHECK_INT(1). I don't know how much effort this would be in the generator, but I expect it'll be easy enough, and if we are indeed going with the smallest possible uops, we should just do it.
  • Being explicit about which uops use oparg. This feels like it might require more effort in the generator, but it also sounds like an interesting way to go.

All in all, I think we should probably have an in-person discussion about this (earliest I'm available is the week of June 5th).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd keep the checks for pairs of values for now.
Even if we want to separate them for optimization, we are likely to want to combine them into superinstructions for interpretation, so we might as well keep them for now.

We can always split them up later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think I'll follow Mark's guidance, and keep the current structure.


op(_BINARY_OP_MULTIPLY_INT, (unused/1, left, right -- res)) {
STAT_INC(BINARY_OP, hit);
prod = _PyLong_Multiply((PyLongObject *)left, (PyLongObject *)right);
res = _PyLong_Multiply((PyLongObject *)left, (PyLongObject *)right);
_Py_DECREF_SPECIALIZED(right, (destructor)PyObject_Free);
_Py_DECREF_SPECIALIZED(left, (destructor)PyObject_Free);
ERROR_IF(prod == NULL, error);
ERROR_IF(res == NULL, error);
}

inst(BINARY_OP_MULTIPLY_FLOAT, (unused/1, left, right -- prod)) {
DEOPT_IF(!PyFloat_CheckExact(left), BINARY_OP);
DEOPT_IF(!PyFloat_CheckExact(right), BINARY_OP);
op(_BINARY_OP_ADD_INT, (unused/1, left, right -- res)) {
STAT_INC(BINARY_OP, hit);
double dprod = ((PyFloatObject *)left)->ob_fval *
((PyFloatObject *)right)->ob_fval;
DECREF_INPUTS_AND_REUSE_FLOAT(left, right, dprod, prod);
res = _PyLong_Add((PyLongObject *)left, (PyLongObject *)right);
_Py_DECREF_SPECIALIZED(right, (destructor)PyObject_Free);
_Py_DECREF_SPECIALIZED(left, (destructor)PyObject_Free);
Comment on lines +308 to +309
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might benefit in the future from being two additional uops too, since they can be removed for known immortal values (although mechanically that might be difficult right now, given that we don't have a good way of sharing locals across uops).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, once we're over our concern for dispatch overhead in the Tier-2 interpreter, that makes a lot of sense. But I'm not quite over that (I still would like to see a Tier-2 interpreter that not slower than Tier-1 without all conceivable optimizations). So maybe we should just postpone this (maybe we need a new Ideas issue about the granularity of uops).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not worry about refcount stuff for now, we can always do it later.
The important thing for now is that the interpreter generator can understand instructions made up of a sequence of micro-ops. With that we can break down instructions as much as we like later on.

ERROR_IF(res == NULL, error);
}

inst(BINARY_OP_SUBTRACT_INT, (unused/1, left, right -- sub)) {
DEOPT_IF(!PyLong_CheckExact(left), BINARY_OP);
DEOPT_IF(!PyLong_CheckExact(right), BINARY_OP);
op(_BINARY_OP_SUBTRACT_INT, (unused/1, left, right -- res)) {
STAT_INC(BINARY_OP, hit);
sub = _PyLong_Subtract((PyLongObject *)left, (PyLongObject *)right);
res = _PyLong_Subtract((PyLongObject *)left, (PyLongObject *)right);
_Py_DECREF_SPECIALIZED(right, (destructor)PyObject_Free);
_Py_DECREF_SPECIALIZED(left, (destructor)PyObject_Free);
ERROR_IF(sub == NULL, error);
ERROR_IF(res == NULL, error);
}

inst(BINARY_OP_SUBTRACT_FLOAT, (unused/1, left, right -- sub)) {
macro(BINARY_OP_MULTIPLY_INT) =
_BINARY_OP_INT_GUARD + _BINARY_OP_MULTIPLY_INT;
macro(BINARY_OP_ADD_INT) =
_BINARY_OP_INT_GUARD + _BINARY_OP_ADD_INT;
macro(BINARY_OP_SUBTRACT_INT) =
_BINARY_OP_INT_GUARD + _BINARY_OP_SUBTRACT_INT;

op(_BINARY_OP_FLOAT_GUARD, (left, right -- left, right)) {
DEOPT_IF(!PyFloat_CheckExact(left), BINARY_OP);
DEOPT_IF(!PyFloat_CheckExact(right), BINARY_OP);
}

op(_BINARY_OP_MULTIPLY_FLOAT, (unused/1, left, right -- res)) {
STAT_INC(BINARY_OP, hit);
double dsub = ((PyFloatObject *)left)->ob_fval - ((PyFloatObject *)right)->ob_fval;
DECREF_INPUTS_AND_REUSE_FLOAT(left, right, dsub, sub);
double dres =
((PyFloatObject *)left)->ob_fval *
((PyFloatObject *)right)->ob_fval;
DECREF_INPUTS_AND_REUSE_FLOAT(left, right, dres, res);
}

inst(BINARY_OP_ADD_UNICODE, (unused/1, left, right -- res)) {
op(_BINARY_OP_ADD_FLOAT, (unused/1, left, right -- res)) {
STAT_INC(BINARY_OP, hit);
double dres =
((PyFloatObject *)left)->ob_fval +
((PyFloatObject *)right)->ob_fval;
DECREF_INPUTS_AND_REUSE_FLOAT(left, right, dres, res);
}

op(_BINARY_OP_SUBTRACT_FLOAT, (unused/1, left, right -- res)) {
STAT_INC(BINARY_OP, hit);
double dres =
((PyFloatObject *)left)->ob_fval -
((PyFloatObject *)right)->ob_fval;
DECREF_INPUTS_AND_REUSE_FLOAT(left, right, dres, res);
}

macro(BINARY_OP_MULTIPLY_FLOAT) =
_BINARY_OP_FLOAT_GUARD + _BINARY_OP_MULTIPLY_FLOAT;
macro(BINARY_OP_ADD_FLOAT) =
_BINARY_OP_FLOAT_GUARD + _BINARY_OP_ADD_FLOAT;
macro(BINARY_OP_SUBTRACT_FLOAT) =
_BINARY_OP_FLOAT_GUARD + _BINARY_OP_SUBTRACT_FLOAT;

op(_BINARY_OP_UNICODE_GUARD, (left, right -- left, right)) {
DEOPT_IF(!PyUnicode_CheckExact(left), BINARY_OP);
DEOPT_IF(Py_TYPE(right) != Py_TYPE(left), BINARY_OP);
DEOPT_IF(!PyUnicode_CheckExact(right), BINARY_OP);
}

op(_BINARY_OP_ADD_UNICODE, (unused/1, left, right -- res)) {
STAT_INC(BINARY_OP, hit);
res = PyUnicode_Concat(left, right);
_Py_DECREF_SPECIALIZED(left, _PyUnicode_ExactDealloc);
_Py_DECREF_SPECIALIZED(right, _PyUnicode_ExactDealloc);
ERROR_IF(res == NULL, error);
}

macro(BINARY_OP_ADD_UNICODE) =
_BINARY_OP_UNICODE_GUARD + _BINARY_OP_ADD_UNICODE;

// This is a subtle one. It's a super-instruction for
// BINARY_OP_ADD_UNICODE followed by STORE_FAST
// where the store goes into the left argument.
// So the inputs are the same as for all BINARY_OP
// specializations, but there is no output.
// At the end we just skip over the STORE_FAST.
inst(BINARY_OP_INPLACE_ADD_UNICODE, (left, right --)) {
DEOPT_IF(!PyUnicode_CheckExact(left), BINARY_OP);
DEOPT_IF(Py_TYPE(right) != Py_TYPE(left), BINARY_OP);
op(_BINARY_OP_INPLACE_ADD_UNICODE, (left, right --)) {
_Py_CODEUNIT true_next = next_instr[INLINE_CACHE_ENTRIES_BINARY_OP];
assert(true_next.op.code == STORE_FAST ||
true_next.op.code == STORE_FAST__LOAD_FAST);
Expand All @@ -372,24 +410,8 @@ dummy_func(
JUMPBY(INLINE_CACHE_ENTRIES_BINARY_OP + 1);
}

inst(BINARY_OP_ADD_FLOAT, (unused/1, left, right -- sum)) {
DEOPT_IF(!PyFloat_CheckExact(left), BINARY_OP);
DEOPT_IF(Py_TYPE(right) != Py_TYPE(left), BINARY_OP);
STAT_INC(BINARY_OP, hit);
double dsum = ((PyFloatObject *)left)->ob_fval +
((PyFloatObject *)right)->ob_fval;
DECREF_INPUTS_AND_REUSE_FLOAT(left, right, dsum, sum);
}

inst(BINARY_OP_ADD_INT, (unused/1, left, right -- sum)) {
DEOPT_IF(!PyLong_CheckExact(left), BINARY_OP);
DEOPT_IF(Py_TYPE(right) != Py_TYPE(left), BINARY_OP);
STAT_INC(BINARY_OP, hit);
sum = _PyLong_Add((PyLongObject *)left, (PyLongObject *)right);
_Py_DECREF_SPECIALIZED(right, (destructor)PyObject_Free);
_Py_DECREF_SPECIALIZED(left, (destructor)PyObject_Free);
ERROR_IF(sum == NULL, error);
}
macro(BINARY_OP_INPLACE_ADD_UNICODE) =
_BINARY_OP_UNICODE_GUARD + _BINARY_OP_INPLACE_ADD_UNICODE;

family(binary_subscr, INLINE_CACHE_ENTRIES_BINARY_SUBSCR) = {
BINARY_SUBSCR,
Expand Down Expand Up @@ -3302,7 +3324,7 @@ dummy_func(
top = Py_NewRef(bottom);
}

inst(BINARY_OP, (unused/1, lhs, rhs -- res)) {
op(_BINARY_OP_SPECIALIZE, (unused/1, lhs, rhs -- lhs, rhs)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be left alone.
We don't need or want to break up all instructions, just the specialized ones.
In fact, next_instr will not available to the tier 2 optimizer, so it will have to reject this instruction.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. In tier 2 we have to set ENABLE_SPECIALIZATION to false, since specializing a tier 2 instruction makes no sense -- that's the job of tier 1.

#if ENABLE_SPECIALIZATION
_PyBinaryOpCache *cache = (_PyBinaryOpCache *)next_instr;
if (ADAPTIVE_COUNTER_IS_ZERO(cache->counter)) {
Expand All @@ -3313,6 +3335,9 @@ dummy_func(
STAT_INC(BINARY_OP, deferred);
DECREMENT_ADAPTIVE_COUNTER(cache->counter);
#endif /* ENABLE_SPECIALIZATION */
}

op(_BINARY_OP_ACTION, (lhs, rhs -- res)) {
assert(0 <= oparg);
assert((unsigned)oparg < Py_ARRAY_LENGTH(binary_ops));
assert(binary_ops[oparg]);
Expand All @@ -3321,6 +3346,8 @@ dummy_func(
ERROR_IF(res == NULL, error);
}

macro(BINARY_OP) = _BINARY_OP_SPECIALIZE + _BINARY_OP_ACTION;

inst(SWAP, (bottom, unused[oparg-2], top --
top, unused[oparg-2], bottom)) {
assert(oparg >= 2);
Expand Down
Loading
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy