Skip to content

Fix!: mark vars referenced in metadata macros as metadata #4936

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

georgesittas
Copy link
Contributor

@georgesittas georgesittas commented Jul 8, 2025

Macro variable references are always treated as non-metadata today. This means that if, for example, a variable is referenced within a metadata-only macro, changing its value will result in a breaking change, which is inconsistent.

This PR alters this behavior, similar to the macro metadata-only status propagation:

  • Variables referenced within metadata-only macro definitions can be treated as metadata-obly
  • Variables referenced in metadata-only macro calls can be treated as metadata-only
  • Variables referenced within metadata expressions (e.g. the audits property) can be treated as metadata-only

I intentionally say "can" instead of "will" above, because we need to factor in all references of a variable to decide whether it's a metadata-only reference. The rules implemented here are similar to those we apply for macros: a non-metadata occurrence overrules all metadata occurrences.

Additionally, this PR introduces trimming for blueprint variables. Certain blueprint variables, e.g. used in model names, aren't required after loading, while others are because they may be referenced in the model's statements or in "runtime-rendered" properties (e.g., merge_filter).

The former category can be omitted from the model's python_env, thus reducing its snapshot's size, as long as a variable is only referenced in the meta block and in fields that are static after loading the model.

Both of these changes are quite breaking, so I'm planning to implement a migration script to at least warn about this. I'm also planning to increase the testing coverage. (EDIT: done)

@georgesittas georgesittas marked this pull request as draft July 8, 2025 15:32
@georgesittas georgesittas force-pushed the jo/metadata_vars branch 5 times, most recently from a7d62e1 to 06c31f9 Compare July 18, 2025 12:26
@georgesittas georgesittas marked this pull request as ready for review July 18, 2025 12:28
@georgesittas georgesittas requested a review from a team July 18, 2025 13:45
@georgesittas georgesittas force-pushed the jo/metadata_vars branch 2 times, most recently from 28157fa to a9e829a Compare July 28, 2025 09:14
@georgesittas georgesittas requested a review from a team July 28, 2025 09:17
@georgesittas georgesittas force-pushed the jo/metadata_vars branch 2 times, most recently from 3b1e614 to aa7ad2c Compare July 28, 2025 10:39
@georgesittas georgesittas force-pushed the jo/metadata_vars branch 2 times, most recently from 2ba3b38 to 16d6260 Compare July 29, 2025 17:14
@georgesittas
Copy link
Contributor Author

@izeigerman thank you for the review– addressed all comments. Planning to merge once CI's green.

@georgesittas georgesittas force-pushed the jo/metadata_vars branch 2 times, most recently from ee50871 to 142dbc2 Compare July 30, 2025 13:02
@georgesittas
Copy link
Contributor Author

@izeigerman since my last comment in this PR, I discovered a bug and refactored my approach to patch it– can you please take another look when you find some bandwidth?

The bug

Given the following macros:

from sqlmesh import macro

@macro()
def macro1(evaluator):
    return evaluator.var("foo")

@macro(metadata_only=True)
def macro2(evaluator, var_value):
    return 1

and this model:

MODEL (
  name test_model,
  kind FULL,
);

SELECT
  @macro1() AS col,
  @macro2(@foo) AS col2

I verified that @foo was being marked as metadata-only, despite it being referenced within macro1, which is not metadata-only:

>>> from sqlmesh import Context
>>> ctx = Context()
>>> ctx.models['"test_model"'].python_env
{'macro1': Executable<payload: def macro1(evaluator):
    return evaluator.var('foo'), name: macro1, path: macros/test.py>, 'macro2': Executable<payload: def macro2(evaluator, var_value):
    return 1, name: macro2, path: macros/test.py, is_metadata: True>, '__sqlmesh__vars__metadata__': Executable<payload: {'foo': "'id'"}, kind: value, is_metadata: True>}

Root cause

The problematic logic was located here. The value of used_variables was {'foo': False} for the above project, while the value of macro_funcs_by_used_var was defaultdict(<class 'set'>, {'foo': {'macro2'}}).

This meant that used_variables.get(used_var) would yield false and so we checked the all(...) condition that only looked at macro2, which is indeed metadata-only but that is not enough to mark foo as metadata-only.

Fix

I addressed the above issue in this commit. Below is my analysis that led to this refactor:

Terminology

  • Metadata status: the value of is_metadata for an Executable instance
  • Metadata-only: when a variable or macro Executable has is_metadata set to True
  • Metadata expression: expressions such as audits (...), or virtual statement blocks that only affect the metadata hash
  • Direct variable reference: a variable that is referenced directly in SQL code
  • Indirect variable reference: a variable that is referenced indirectly in SQL code, e.g., an invoked macro references it internally

Observations

  1. We can always infer the metadata status of an indirect variable reference; it inherits it from the macro that references it.

  2. We can't determine the metadata status of direct variable references that are under some macro function call until later. The reason is that the macro definition walking happens after we analyze the model's ASTs, so only after that is done we can have the full picture w.r.t. the metadata status of said macro call, which affects the metadata status of the variable itself. For example, a variable reference that is used as an argument of a metadata-only macro is itself metadata-only. The only exception to this rule is when the variable appears in a metadata expression, since everything under it is treated as metadata-only.

  3. If there is at least one reference of a variable outside of a metadata expression and not under any macro function call, then that variable is guaranteed to be non-metadata. Otherwise, every reference of said variable may appear in either a metadata expression, or under a macro function call. If there is at least one reference under a macro function call, we must defer the metadata status inference for that variable until after we walk the macro objects and know their metadata statuses. Then, the status of the metadata is all of these statuses combined through conjunction.

@georgesittas georgesittas requested a review from izeigerman July 30, 2025 13:22
@georgesittas georgesittas force-pushed the jo/metadata_vars branch 3 times, most recently from 06b78cd to cca9524 Compare July 31, 2025 15:49
@georgesittas
Copy link
Contributor Author

... and yet another bug squashed: I realized that only the top-level macro func call matters when it comes to marking variable references under it as metadata or not. Added a test that demonstrates this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy