0% found this document useful (0 votes)
133 views

Rule-Based Matching Spacy Usage Documentation

The document discusses rule-based matching in spaCy. It can be summarized as follows: Rule-based matching in spaCy allows matching entities and tokens based on rules rather than regular expressions. This provides access to token relationships and attributes. Rules can refer to token annotations and flags. Patterns can be added to a matcher to identify matches in documents. The matcher returns matched spans that can then be analyzed further.

Uploaded by

maxzoel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
133 views

Rule-Based Matching Spacy Usage Documentation

The document discusses rule-based matching in spaCy. It can be summarized as follows: Rule-based matching in spaCy allows matching entities and tokens based on rules rather than regular expressions. This provides access to token relationships and attributes. Rules can refer to token annotations and flags. Patterns can be added to a matcher to identify matches in documents. The matcher returns matched spans that can then be analyzed further.

Uploaded by

maxzoel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Rule-based matching

Find phrases and tokens, and match entities

Compared to using regular expressions on raw text, spaCy’s rule-based matcher engines and
components not only let you find the words and phrases you’re looking for – they also give you
access to the tokens within the document and their relationships. This means you can easily access
and analyze the surrounding tokens, merge spans into single tokens or add entries to the named
entities in doc.ents .

Should I use rules or train a model? ¶

For complex tasks, it’s usually better to train a statistical entity recognition model. However,
statistical models require training data, so for many situations, rule-based approaches are
more practical. This is especially true at the start of a project: you can use a rule-based
approach as part of a data collection process, to help you “bootstrap” a statistical model.

Training a model is useful if you have some examples and you want your system to be able to
generalize based on those examples. It works especially well if there are clues in the local
context. For instance, if you’re trying to detect person or company names, your application
may benefit from a statistical named entity recognition model.

Rule-based systems are a good choice if there’s a more or less finite number of examples that
you want to find in the data, or if there’s a very clear, structured pattern you can express with
token rules or regular expressions. For instance, country names, IP addresses or URLs are
things you might be able to handle well with a purely rule-based approach.

You can also combine both approaches and improve a statistical model with rules to handle
very specific cases and boost accuracy. For details, see the section on rule-based entity
recognition.

When should I use the token matcher vs. the phrase matcher? ¶

The PhraseMatcher is useful if you already have a large terminology list or gazetteer
consisting of single or multi-token phrases that you want to find exact instances of in your
data. As of spaCy v2.1.0, you can also match on the LOWER attribute for fast and case-
insensitive matching.

The Matcher isn’t as blazing fast as the PhraseMatcher , since it compares across individual
token attributes. However, it allows you to write very abstract representations of the tokens
you’re looking for, using lexical attributes, linguistic features predicted by the model,
operators, set membership and rich comparison. For example, you can find a noun, followed
by a verb with the lemma “love” or “like”, followed by an optional determiner and another
token that’s at least 10 characters long.

Token-based matching
spaCy features a rule-matching engine, the Matcher , that operates over tokens, similar to
regular expressions. The rules can refer to token annotations (e.g. the token text or tag_ , and
flags like IS_PUNCT ). The rule matcher also lets you pass in a custom callback to act on matches –
for example, to merge entities and apply custom labels. You can also associate patterns with entity
IDs, to allow some basic entity linking or disambiguation. To match large terminology lists, you can
use the PhraseMatcher , which accepts Doc objects as match patterns.

Adding patterns
Let’s say we want to enable spaCy to find a combination of three tokens:

1. A token whose lowercase form matches “hello”, e.g. “Hello” or “HELLO”.

2. A token whose is_punct flag is set to True , i.e. any punctuation.

3. A token whose lowercase form matches “world”, e.g. “World” or “WORLD”.

[{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]


Important note

When writing patterns, keep in mind that each dictionary represents one token. If spaCy’s
tokenization doesn’t match the tokens defined in a pattern, the pattern is not going to produce
any results. When developing complex patterns, make sure to check examples against spaCy’s
tokenization:

doc = nlp("A complex-example,!")


print([token.text for token in doc])

First, we initialize the Matcher with a vocab. The matcher must always share the same vocab with
the documents it will operate on. We can now call matcher.add() with an ID and a list of
patterns.

Editable Code spaCy v3.0 · Python 3 · via Binder

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
# Add match ID "HelloWorld" with no callback and one pattern
pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
matcher.add("HelloWorld", [pattern])

doc = nlp("Hello, world! Hello world!")


matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] # Get string representation
span = doc[start:end] # The matched span
print(match_id string_id start end text)

RUN

The matcher returns a list of (match_id, start, end) tuples – in this case,
[('15578876784678163569', 0, 3)] , which maps to the span doc[0:3] of our original
document. The match_id is the hash value of the string ID “HelloWorld”. To get the string value,
you can look up the ID in the StringStore .

for match_id, start, end in matches:


string_id = nlp.vocab.strings[match_id] # 'HelloWorld'
span = doc[start:end] # The matched span

Optionally, we could also choose to add more than one pattern, for example to also match
sequences without punctuation between “hello” and “world”:

patterns = [
[{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}],
[{"LOWER": "hello"}, {"LOWER": "world"}]
]
matcher.add("HelloWorld", patterns)

By default, the matcher will only return the matches and not do anything else, like merge entities
or assign labels. This is all up to you and can be defined individually for each pattern, by passing in
a callback function as the on_match argument on add() . This is useful, because it lets you write
entirely custom and pattern-specific logic. For example, you might want to merge some patterns
into one token, while adding entity labels for other pattern types. You shouldn’t have to create
different matchers for each of those processes.

Available token attributes


The available token pattern keys correspond to a number of Token attributes . The supported
attributes for rule-based matching are:
ATTRIBUTE  DESCRIPTION

ORTH The exact verbatim text of a token.

TYPE: str

TEXT The exact verbatim text of a token.

TYPE: str

LOWER The lowercase form of the token text.

TYPE: str

LENGTH The length of the token text.

TYPE: int

IS_ALPHA , Token text consists of alphabetic characters, ASCII characters, digits.


IS_ASCII , IS_DIGIT
TYPE: bool

IS_LOWER , Token text is in lowercase, uppercase, titlecase.


IS_UPPER , IS_TITLE
TYPE: bool

IS_PUNCT , Token is punctuation, whitespace, stop word.


IS_SPACE , IS_STOP
TYPE: bool

IS_SENT_START Token is start of sentence.

TYPE: bool

LIKE_NUM , Token text resembles a number, URL, email.


LIKE_URL ,
TYPE: bool
LIKE_EMAIL

SPACY Token has a trailing space.

TYPE: bool

POS , TAG , MORPH , The token’s simple and extended part-of-speech tag, morphological
DEP , LEMMA , SHAPE analysis, dependency label, lemma, shape. Note that the values of these
attributes are case-sensitive. For a list of available part-of-speech tags and
dependency labels, see the Annotation Specifications .

TYPE: str

ENT_TYPE The token’s entity label.

TYPE: str
Does it matter if the attribute names are uppercase or lowercase?

No, it shouldn’t. spaCy will normalize the names internally and {"LOWER": "text"} and
{"lower": "text"} will both produce the same result. Using the uppercase version is
mostly a convention to make it clear that the attributes are “special” and don’t exactly map to
the token attributes like Token.lower and Token.lower_ .

Why are not all token attributes supported?

spaCy can’t provide access to all of the attributes because the Matcher loops over the Cython
data, not the Python objects. Inside the matcher, we’re dealing with a TokenC struct – we
don’t have an instance of Token . This means that all of the attributes that refer to
computed properties can’t be accessed.

The uppercase attribute names like LOWER or IS_PUNCT refer to symbols from the
spacy.attrs enum table. They’re passed into a function that essentially is a big
case/switch statement, to figure out which struct field to return. The same attribute identifiers
are used in Doc.to_array , and a few other places in the code where you need to
describe fields like this.

Tip: Try the interactive matcher explorer


The Matcher Explorer lets you test the rule-based Matcher by creating token patterns
interactively and running them over your text. Each token can set multiple attributes like text
value, part-of-speech tag or boolean flags. The token-based view lets you explore how spaCy
processes your text – and why your pattern matches, or why it doesn’t.

Extended pattern syntax and attributes


Instead of mapping to a single value, token patterns can also map to a dictionary of properties.
For example, to specify that the value of a lemma should be part of a list of values, or to set a
minimum character length. The following rich comparison attributes are available:

EXAMPLE

# Matches "love cats" or "likes flowers"


pattern1 = [{"LEMMA": {"IN": ["like", "love"]}},
{"POS": "NOUN"}]

# Matches tokens of length >= 10


pattern2 = [{"LENGTH": {">=": 10}}]

# Match based on morph attributes


pattern3 = [{"MORPH": {"IS_SUBSET": ["Number=Sing", "Gender=Neut"]}}]
# "", "Number=Sing" and "Number=Sing|Gender=Neut" will match as subsets
# "Number=Plur|Gender=Neut" will not match
# "Number=Sing|Gender=Neut|Polite=Infm" will not match because it's a superset
ATTRIBUTE DESCRIPTION

IN Attribute value is member of a list.

TYPE: Any

NOT_IN Attribute value is not member of a list.

TYPE: Any

IS_SUBSET Attribute value (for MORPH or custom list attributes) is a subset of a list.

TYPE: Any

IS_SUPERSET Attribute value (for MORPH or custom list attributes) is a superset of a list.

TYPE: Any

INTERSECTS Attribute value (for MORPH or custom list attributes) has a non-empty
intersection with a list.

TYPE: Any

== , >= , <= , > , Attribute value is equal, greater or equal, smaller or equal, greater or smaller.
<
TYPE: Union[int, float]

Regular expressions
In some cases, only matching tokens and token attributes isn’t enough – for example, you might
want to match different spellings of a word, without having to add a new pattern for each spelling.

pattern = [{"TEXT": {"REGEX": "^[Uu](\.?|nited)$"}},


{"TEXT": {"REGEX": "^[Ss](\.?|tates)$"}},
{"LOWER": "president"}]

The REGEX operator allows defining rules for any attribute string value, including custom
attributes. It always needs to be applied to an attribute like TEXT , LOWER or TAG :

# Match different spellings of token texts


pattern = [{"TEXT": {"REGEX": "deff?in[ia]tely"}}]
# Match tokens with fine-grained POS tags starting with 'V'
pattern = [{"TAG": {"REGEX": "^V"}}]

# Match custom attribute values with regular expressions


pattern = [{"_": {"country": {"REGEX": "^[Uu](nited|\.?) ?[Ss](tates|\.?)$"}}}]

Important note

When using the REGEX operator, keep in mind that it operates on single tokens, not the whole
text. Each expression you provide will be matched on a token. If you need to match on the
whole text instead, see the details on regex matching on the whole text.

Matching regular expressions on the full text

If your expressions apply to multiple tokens, a simple solution is to match on the doc.text with
re.finditer and use the Doc.char_span method to create a Span from the character
indices of the match. If the matched characters don’t map to one or more valid tokens,
Doc.char_span returns None .

WHAT’S A VALID TOKEN SEQUENCE?

In the example, the expression will also match "US" in "USA" . However, "USA" is a single token and
Span objects are sequences of tokens. So "US" cannot be its own span, because it does not end on
a token boundary.

Editable Code spaCy v3.0 · Python 3 · via Binder

import spacy
import re

nlp = spacy.load("en_core_web_sm")
doc = nlp("The United States of America (USA) are commonly known as the United States (U.S

expression = r"[Uu](nited|\.?) ?[Ss](tates|\.?)"


for match in re.finditer(expression, doc.text):
start, end = match.span()
span = doc.char_span(start, end)
# This is a Span object or None if match doesn't map to valid token sequence
if span is not None:
print("Found match:" )

RUN

How can I expand the match to a valid token sequence?

In some cases, you might want to expand the match to the closest token boundaries, so you
can create a Span for "USA" , even though only the substring "US" is matched. You can
calculate this using the character offsets of the tokens in the document, available as
Token.idx . This lets you create a list of valid token start and end boundaries and leaves
you with a rather basic algorithmic problem: Given a number, find the next lowest (start token)
or the next highest (end token) number that’s part of a given list of numbers. This will be the
closest valid token boundary.

There are many ways to do this and the most straightforward one is to create a dict keyed by
characters in the Doc , mapped to the token they’re part of. It’s easy to write and less error-
prone, and gives you a constant lookup time: you only ever need to create the dict once per
Doc .

chars_to_tokens = {}
for token in doc:
for i in range(token.idx, token.idx + len(token.text)):
chars_to_tokens[i] = token.i

You can then look up character at a given position, and get the index of the corresponding
token that the character is part of. Your span would then be doc[token_start:token_end] .
If a character isn’t in the dict, it means it’s the (white)space tokens are split on. That hopefully
shouldn’t happen, though, because it’d mean your regex is producing matches with leading or
trailing whitespace.

span = doc.char_span(start, end)


if span is not None:
print("Found match:", span.text)
else:
start_token = chars_to_tokens.get(start)
end_token = chars_to_tokens.get(end)
if start_token is not None and end_token is not None:
span = doc[start_token:end_token + 1]
print("Found closest match:", span.text)

Operators and quantifiers


The matcher also lets you use quantifiers, specified as the 'OP' key. Quantifiers let you define
sequences of tokens to be matched, e.g. one or more punctuation marks, or specify optional
tokens. Note that there are no nested or scoped quantifiers – instead, you can build those
behaviors with on_match callbacks.

OP DESCRIPTION

! Negate the pattern, by requiring it to match exactly 0 times.

? Make the pattern optional, by allowing it to match 0 or 1 times.

+ Require the pattern to match 1 or more times.

* Allow the pattern to match zero or more times.

EXAMPLE

pattern = [{"LOWER": "hello"},


{"IS_PUNCT": True, "OP": "?"}]

Note on operator behaviour


In versions before v2.1.0, the semantics of the + and * operators behave inconsistently. They
were usually interpreted “greedily”, i.e. longer matches are returned where possible. However, if
you specify two + and * patterns in a row and their matches overlap, the first operator will
behave non-greedily. This quirk in the semantics is corrected in spaCy v2.1.0.

Using wildcard token patterns


While the token attributes offer many options to write highly specific patterns, you can also use an
empty dictionary, {} as a wildcard representing any token. This is useful if you know the context
of what you’re trying to match, but very little about the specific token and its characters. For
example, let’s say you’re trying to extract people’s user names from your data. All you know is that
they are listed as “User name: {username}“. The name itself may contain any character, but no
whitespace – so you’ll know it will be handled as one token.

[{"ORTH": "User"}, {"ORTH": "name"}, {"ORTH": ":"}, {}]

Validating and debugging patterns


The Matcher can validate patterns against a JSON schema with the option validate=True . This
is useful for debugging patterns during development, in particular for catching unsupported
attributes.

Editable Code spaCy v3.0 · Python 3 · via Binder


import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab, validate=True)
# Add match ID "HelloWorld" with unsupported attribute CASEINSENSITIVE
pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"CASEINSENSITIVE": "world"}]
matcher.add("HelloWorld", [pattern])
# � Raises an error:
# MatchPatternError: Invalid token patterns for matcher rule 'HelloWorld'
# Pattern 0:
# - [pattern -> 2 -> CASEINSENSITIVE] extra fields not permitted

RUN

Adding on_match rules


To move on to a more realistic example, let’s say you’re working with a large corpus of blog articles,
and you want to match all mentions of “Google I/O” (which spaCy tokenizes as
['Google', 'I', '/', 'O' ]). To be safe, you only match on the uppercase versions, avoiding
matches with phrases such as “Google i/o”.

Editable Code spaCy v3.0 · Python 3 · via Binder

from spacy.lang.en import English


from spacy.matcher import Matcher
from spacy.tokens import Span

nlp = English()
matcher = Matcher(nlp.vocab)

def add_event_ent(matcher, doc, i, matches):


# Get the current match and create tuple of entity label, start and end.
# Append entity to the doc's entity. (Don't overwrite doc.ents!)
match_id, start, end = matches[i]
entity = Span(doc, start, end, label="EVENT")
doc.ents += (entity,)
print(entity.text)
pattern = [{"ORTH": "Google"}, {"ORTH": "I"}, {"ORTH": "/"}, {"ORTH": "O"}]
matcher.add("GoogleIO", [pattern], on_match=add_event_ent)
doc = nlp("This is a text about Google I/O")
( )

RUN

A very similar logic has been implemented in the built-in EntityRuler by the way. It also
takes care of handling overlapping matches, which you would otherwise have to take care of
yourself.

TIP: VISUALIZING MATCHES

When working with entities, you can use displaCy to quickly generate a NER visualization from
your updated Doc , which can be exported as an HTML file:

from spacy import displacy


html = displacy.render(doc, style="ent", page=True,
options={"ents": ["EVENT"]})

For more info and examples, see the usage guide on visualizing spaCy.

We can now call the matcher on our documents. The patterns will be matched in the order they
occur in the text. The matcher will then iterate over the matches, look up the callback for the match
ID that was matched, and invoke it.

doc = nlp(YOUR_TEXT_HERE)
matcher(doc)

When the callback is invoked, it is passed four arguments: the matcher itself, the document, the
position of the current match, and the total list of matches. This allows you to write callbacks that
consider the entire set of matched phrases, so that you can resolve overlaps and other conflicts in
whatever way you prefer.
ARGUMENT DESCRIPTION

matcher The matcher instance.

TYPE: Matcher

doc The document the matcher was used on.

TYPE: Doc

i Index of the current match ( matches[i ]).

TYPE: int

matches A list of (match_id, start, end) tuples, describing the matches. A match tuple
describes a span doc[start:end ].

TYPE: List[Tuple[int, int int]]

Creating spans from matches


Creating Span objects from the returned matches is a very common use case. spaCy makes
this easy by giving you access to the start and end token of each match, which you can use to
construct a new span with an optional label. As of spaCy v3.0, you can also set as_spans=True
when calling the matcher on a Doc , which will return a list of Span objects using the
match_id as the span label.

Editable Code spaCy v3.0 · Python 3 · via Binder

import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span

nlp = spacy.blank("en")
matcher = Matcher(nlp.vocab)
matcher.add("PERSON", [[{"lower": "barack"}, {"lower": "obama"}]])
doc = nlp("Barack Obama was the 44th president of the United States")

# 1. Return (match_id, start, end) tuples


matches = matcher(doc)
for match_id, start, end in matches:
# Create the matched span and assign the match_id as a label
span = Span(doc, start, end, label=match_id)
print(span.text, span.label_)

# 2. Return Span objects directly


matches = matcher(doc, as_spans=True)
for span in matches:
print( text label_)

RUN

Using custom pipeline components


Let’s say your data also contains some annoying pre-processing artifacts, like leftover HTML line
breaks (e.g. <br> or <BR/> ). To make your text easier to analyze, you want to merge those into
one token and flag them, to make sure you can ignore them later. Ideally, this should all be done
automatically as you process the text. You can achieve this by adding a custom pipeline component
that’s called on each Doc object, merges the leftover HTML spans and sets an attribute bad_html
on the token.

Editable Code spaCy v3.0 · Python 3 · via Binder

import spacy
from spacy.language import Language
from spacy.matcher import Matcher
from spacy.tokens import Token

# We're using a component factory because the component needs to be


# initialized with the shared vocab via the nlp object
@Language.factory("html_merger")
def create_bad_html_merger(nlp, name):
return BadHTMLMerger(nlp.vocab)

class BadHTMLMerger:
def __init__(self, vocab):
patterns = [
[{"ORTH": "<"}, {"LOWER": "br"}, {"ORTH": ">"}],
[{"ORTH": "<"}, {"LOWER": "br/"}, {"ORTH": ">"}],
]
# Register a new token extension to flag bad HTML
Token.set_extension("bad_html", default=False)
self.matcher = Matcher(vocab)
self.matcher.add("BAD_HTML", patterns)

def __call__(self, doc):


# This method is invoked when the component is called on a Doc
matches = self.matcher(doc)
spans = [] # Collect the matched spans here
for match_id, start, end in matches:
spans.append(doc[start:end])
with doc.retokenize() as retokenizer:
for span in spans:
retokenizer.merge(span)
for token in span:
token._.bad_html = True # Mark token as bad HTML
return doc

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("html_merger", last=True) # Add component to the pipeline
doc = nlp("Hello<br>world! <br/> This is a test.")
for token in doc:
print( )

RUN

Instead of hard-coding the patterns into the component, you could also make it take a path to a
JSON file containing the patterns. This lets you reuse the component with different patterns,
depending on your application. When adding the component to the pipeline with
nlp.add_pipe , you can pass in the argument via the config :

@Language.factory("html_merger", default_config={"path": None})


def create_bad_html_merger(nlp, name, path):
return BadHTMLMerger(nlp, path=path)

nlp.add_pipe("html_merger", config={"path": "/path/to/patterns.json"})

� Processing pipelines
For more details and examples of how to create custom pipeline components and extension
attributes, see the usage guide.

Example: Using linguistic annotations


Let’s say you’re analyzing user comments and you want to find out what people are saying about
Facebook. You want to start off by finding adjectives following “Facebook is” or “Facebook was”.
This is obviously a very rudimentary solution, but it’ll be fast, and a great way to get an idea for
what’s in your data. Your pattern could look like this:

[{"LOWER": "facebook"}, {"LEMMA": "be"}, {"POS": "ADV", "OP": "*"}, {"POS": "ADJ"

This translates to a token whose lowercase form matches “facebook” (like Facebook, facebook or
FACEBOOK), followed by a token with the lemma “be” (for example, is, was, or ‘s), followed by an
optional adverb, followed by an adjective. Using the linguistic annotations here is especially useful,
because you can tell spaCy to match “Facebook’s annoying”, but not “Facebook’s annoying ads”.
The optional adverb makes sure you won’t miss adjectives with intensifiers, like “pretty awful” or
“very nice”.

To get a quick overview of the results, you could collect all sentences containing a match and
render them with the displaCy visualizer. In the callback function, you’ll have access to the start
and end of each match, as well as the parent Doc . This lets you determine the sentence
containing the match, doc[start:end].sent , and calculate the start and end of the matched
span within the sentence. Using displaCy in “manual” mode lets you pass in a list of dictionaries
containing the text and entities to render.

Editable Code spaCy v3.0 · Python 3 · via Binder

import spacy
from spacy import displacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matched_sents = [] # Collect data of matched sentences to be visualized
def collect_sents(matcher, doc, i, matches):
match_id, start, end = matches[i]
span = doc[start:end] # Matched span
sent = span.sent # Sentence containing matched span
# Append mock entity for match in displaCy style to matched_sents
# get the match span by ofsetting the start and end of the span with the
# start and end of the sentence in the doc
match_ents = [{
"start": span.start_char - sent.start_char,
"end": span.end_char - sent.start_char,
"label": "MATCH",
}]
matched_sents.append({"text": sent.text, "ents": match_ents})

pattern = [{"LOWER": "facebook"}, {"LEMMA": "be"}, {"POS": "ADV", "OP": "*"},


{"POS": "ADJ"}]
matcher.add("FacebookIs", [pattern], on_match=collect_sents) # add pattern
doc = nlp("I'd say that Facebook is evil. – Facebook is pretty cool, right?")
matches = matcher(doc)

# Serve visualization of sentences containing match with displaCy


# set manual=True to make displaCy render straight from a dictionary
# (if you're not running the code within a Jupyer environment, you can
# use displacy.serve instead)
( )

RUN

Example: Phone numbers


Phone numbers can have many different formats and matching them is often tricky. During
tokenization, spaCy will leave sequences of numbers intact and only split on whitespace and
punctuation. This means that your match pattern will have to look out for number sequences of a
certain length, surrounded by specific punctuation – depending on the national conventions.

The IS_DIGIT flag is not very helpful here, because it doesn’t tell us anything about the length.
However, you can use the SHAPE flag, with each d representing a digit (up to 4 digits /
characters):
[{"ORTH": "("}, {"SHAPE": "ddd"}, {"ORTH": ")"}, {"SHAPE": "dddd"},
{"ORTH": "-", "OP": "?"}, {"SHAPE": "dddd"}]

This will match phone numbers of the format (123) 4567 8901 or (123) 4567-8901. To also match
formats like (123) 456 789, you can add a second pattern using 'ddd' in place of 'dddd' . By
hard-coding some values, you can match only certain, country-specific numbers. For example,
here’s a pattern to match the most common formats of international German numbers:

[{"ORTH": "+"}, {"ORTH": "49"}, {"ORTH": "(", "OP": "?"}, {"SHAPE": "dddd"},
{"ORTH": ")", "OP": "?"}, {"SHAPE": "dddd", "LENGTH": 6}]

Depending on the formats your application needs to match, creating an extensive set of rules like
this is often better than training a model. It’ll produce more predictable results, is much easier to
modify and extend, and doesn’t require any training data – only a set of test cases.

Editable Code spaCy v3.0 · Python 3 · via Binder

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
pattern = [{"ORTH": "("}, {"SHAPE": "ddd"}, {"ORTH": ")"}, {"SHAPE": "ddd"},
{"ORTH": "-", "OP": "?"}, {"SHAPE": "ddd"}]
matcher.add("PHONE_NUMBER", [pattern])

doc = nlp("Call me at (123) 456 789 or (123) 456 789!")


print([t.text for t in doc])
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end]
print( )

RUN
Example: Hashtags and emoji on social media
Social media posts, especially tweets, can be difficult to work with. They’re very short and often
contain various emoji and hashtags. By only looking at the plain text, you’ll lose a lot of valuable
semantic information.

Let’s say you’ve extracted a large sample of social media posts on a specific topic, for example posts
mentioning a brand name or product. As the first step of your data exploration, you want to filter
out posts containing certain emoji and use them to assign a general sentiment score, based on
whether the expressed emotion is positive or negative, e.g. � or �. You also want to find, merge
and label hashtags like #MondayMotivation , to be able to ignore or analyze them later.

NOTE ON SENTIMENT ANALYSIS

Ultimately, sentiment analysis is not always that easy. In addition to the emoji, you’ll also want to take
specific words into account and check the subtree for intensifiers like “very”, to increase the
sentiment score. At some point, you might also want to train a sentiment model. However, the
approach described in this example is very useful for bootstrapping rules to collect training data. It’s
also an incredibly fast way to gather first insights into your data – with about 1 million tweets, you’d be
looking at a processing time of under 1 minute.

By default, spaCy’s tokenizer will split emoji into separate tokens. This means that you can create a
pattern for one or more emoji tokens. Valid hashtags usually consist of a # , plus a sequence of
ASCII characters with no whitespace, making them easy to match as well.

Editable Code spaCy v3.0 · Python 3 · via Binder

from spacy.lang.en import English


from spacy.matcher import Matcher

nlp = English() # We only want the tokenizer, so no need to load a pipeline


matcher = Matcher(nlp.vocab)

pos_emoji = ["�
�", "�
�", "�
�", "�
�", "�
�", "�
�"] # Positive emoji
neg_emoji = ["�
�", "�
�", "�
�", "�
�", "�
�", "�
�"] # Negative emoji

# Add patterns to match one or more emoji tokens


pos_patterns = [[{"ORTH": emoji}] for emoji in pos_emoji]
neg_patterns = [[{"ORTH": emoji}] for emoji in neg_emoji]
# Function to label the sentiment
def label_sentiment(matcher, doc, i, matches):
match_id, start, end = matches[i]
if doc.vocab.strings[match_id] == "HAPPY": # Don't forget to get string!
doc.sentiment += 0.1 # Add 0.1 for positive sentiment
elif doc.vocab.strings[match_id] == "SAD":
doc.sentiment -= 0.1 # Subtract 0.1 for negative sentiment

matcher.add("HAPPY", pos_patterns, on_match=label_sentiment) # Add positive pattern


matcher.add("SAD", neg_patterns, on_match=label_sentiment) # Add negative pattern

# Add pattern for valid hashtag, i.e. '#' plus any ASCII token
matcher.add("HASHTAG", [[{"ORTH": "#"}, {"IS_ASCII": True}]])

doc = nlp("Hello world � #MondayMotivation")


matches = matcher(doc)
for match_id, start, end in matches:
string_id = doc.vocab.strings[match_id] # Look up string ID
span = doc[start:end]
( )

RUN

Because the on_match callback receives the ID of each match, you can use the same function to
handle the sentiment assignment for both the positive and negative pattern. To keep it simple, we’ll
either add or subtract 0.1 points – this way, the score will also reflect combinations of emoji, even
positive and negative ones.

With a library like Emojipedia , we can also retrieve a short description for each emoji – for
example, �‘s official title is “Smiling Face With Heart-Eyes”. Assigning it to a custom attribute on
the emoji span will make it available as span._.emoji_desc .

from emojipedia import Emojipedia # Installation: pip install emojipedia


from spacy.tokens import Span # Get the global Span object

Span.set_extension("emoji_desc", default=None) # Register the custom attribute

def label_sentiment(matcher, doc, i, matches):


match_id, start, end = matches[i]
if doc.vocab.strings[match_id] == "HAPPY": # Don't forget to get string!
doc.sentiment += 0.1 # Add 0.1 for positive sentiment
elif doc.vocab.strings[match_id] == "SAD":
doc.sentiment -= 0.1 # Subtract 0.1 for negative sentiment
span = doc[start:end]
emoji = Emojipedia.search(span[0].text) # Get data for emoji
span._.emoji_desc = emoji.title # Assign emoji description

To label the hashtags, we can use a custom attribute set on the respective token:

Editable Code spaCy v3.0 · Python 3 · via Binder


import spacy
from spacy.matcher import Matcher
from spacy.tokens import Token

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

# Add pattern for valid hashtag, i.e. '#' plus any ASCII token
matcher.add("HASHTAG", [[{"ORTH": "#"}, {"IS_ASCII": True}]])

# Register token extension


Token.set_extension("is_hashtag", default=False)

doc = nlp("Hello world � #MondayMotivation")


matches = matcher(doc)
hashtags = []
for match_id, start, end in matches:
if doc.vocab.strings[match_id] == "HASHTAG":
hashtags.append(doc[start:end])
with doc.retokenize() as retokenizer:
for span in hashtags:
retokenizer.merge(span)
for token in span:
token._.is_hashtag = True

for token in doc:


print(token text token is_hashtag)

RUN

Efficient phrase matching


If you need to match large terminology lists, you can also use the PhraseMatcher and create
Doc objects instead of token patterns, which is much more efficient overall. The Doc patterns
can contain single or multiple tokens.

Adding phrase patterns


Editable Code spaCy v3.0 · Python 3 · via Binder

import spacy
from spacy.matcher import PhraseMatcher

nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
terms = ["Barack Obama", "Angela Merkel", "Washington, D.C."]
# Only run nlp.make_doc to speed things up
patterns = [nlp.make_doc(text) for text in terms]
matcher.add("TerminologyList", patterns)

doc = nlp("German Chancellor Angela Merkel and US President Barack Obama "
"converse in the Oval Office inside the White House in Washington, D.C."
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end]
print( )

RUN

Since spaCy is used for processing both the patterns and the text to be matched, you won’t have to
worry about specific tokenization – for example, you can simply pass in
nlp("Washington, D.C.") and won’t have to write a complex token pattern covering the exact
tokenization of the term.

Important note on creating patterns

To create the patterns, each phrase has to be processed with the nlp object. If you have a
trained pipeline loaded, doing this in a loop or list comprehension can easily become inefficient
and slow. If you only need the tokenization and lexical attributes, you can run
nlp.make_doc instead, which will only run the tokenizer. For an additional speed boost,
you can also use the nlp.tokenizer.pipe method, which will process the texts as a
stream.
Matching on other token attributes
By default, the PhraseMatcher will match on the verbatim token text, e.g. Token.text . By
setting the attr argument on initialization, you can change which token attribute the matcher
should use when comparing the phrase pattern to the matched Doc . For example, using the
attribute LOWER lets you match on Token.lower and create case-insensitive match patterns:

Editable Code spaCy v3.0 · Python 3 · via Binder

from spacy.lang.en import English


from spacy.matcher import PhraseMatcher

nlp = English()
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
patterns = [nlp.make_doc(name) for name in ["Angela Merkel", "Barack Obama"]]
matcher.add("Names", patterns)

doc = nlp("angela merkel and us president barack Obama")


for match_id, start, end in matcher(doc):
print("Matched based on lowercase token text:" doc[start end])

RUN

Important note on creating patterns

The examples here use nlp.make_doc to create Doc object patterns as efficiently as
possible and without running any of the other pipeline components. If the token attribute you
want to match on are set by a pipeline component, make sure that the pipeline component
runs when you create the pattern. For example, to match on POS or LEMMA , the pattern Doc
objects need to have part-of-speech tags set by the tagger or morphologizer . You can
either call the nlp object on your pattern texts instead of nlp.make_doc , or use
nlp.select_pipes to disable components selectively.

Another possible use case is matching number tokens like IP addresses based on their shape. This
means that you won’t have to worry about how those string will be tokenized and you’ll be able to
find tokens and combinations of tokens based on a few examples. Here, we’re matching on the
shapes ddd.d.d.d and ddd.ddd.d.d :

Editable Code spaCy v3.0 · Python 3 · via Binder

from spacy.lang.en import English


from spacy.matcher import PhraseMatcher

nlp = English()
matcher = PhraseMatcher(nlp.vocab, attr="SHAPE")
matcher.add("IP", [nlp("127.0.0.1"), nlp("127.127.0.0")])

doc = nlp("Often the router will have an IP address such as 192.168.1.1 or 192.168.2.1."
for match_id, start, end in matcher(doc):
( [ ])

RUN

In theory, the same also works for attributes like POS . For example, a pattern
nlp("I like cats") matched based on its part-of-speech tag would return a match for “I love
dogs”. You could also match on boolean flags like IS_PUNCT to match phrases with the same
sequence of punctuation and non-punctuation tokens as the pattern. But this can easily get
confusing and doesn’t have much of an advantage over writing one or two token patterns.

Dependency Matcher V3.0 NEEDS MODEL

The DependencyMatcher lets you match patterns within the dependency parse using
Semgrex operators. It requires a model containing a parser such as the DependencyParser .
Instead of defining a list of adjacent tokens as in Matcher patterns, the DependencyMatcher
patterns match tokens in the dependency parse and specify the relations between them.

EXAMPLE

from spacy.matcher import DependencyMatcher

# "[subject] ... initially founded"


pattern = [
# anchor token: founded
{
"RIGHT_ID": "founded",
"RIGHT_ATTRS": {"ORTH": "founded"}
},
# founded -> subject
{
"LEFT_ID": "founded",
"REL_OP": ">",
"RIGHT_ID": "subject",
"RIGHT_ATTRS": {"DEP": "nsubj"}
},
# "founded" follows "initially"
{
"LEFT_ID": "founded",
"REL_OP": ";",
"RIGHT_ID": "initially",
"RIGHT_ATTRS": {"ORTH": "initially"}
}
]

matcher = DependencyMatcher(nlp.vocab)
matcher.add("FOUNDED", [pattern])
matches = matcher(doc)

A pattern added to the dependency matcher consists of a list of dictionaries, with each dictionary
describing a token to match and its relation to an existing token in the pattern. Except for the
first dictionary, which defines an anchor token using only RIGHT_ID and RIGHT_ATTRS , each
pattern should have the following keys:
NAME DESCRIPTION

LEFT_ID The name of the left-hand node in the relation, which has been defined in an
earlier node.

TYPE: str

REL_OP An operator that describes how the two nodes are related.

TYPE: str

RIGHT_ID A unique name for the right-hand node in the relation.

TYPE: str

RIGHT_ATTRS The token attributes to match for the right-hand node in the same format as
patterns provided to the regular token-based Matcher .

TYPE: Dict[str, Any]

Each additional token added to the pattern is linked to an existing token LEFT_ID by the relation
REL_OP . The new token is given the name RIGHT_ID and described by the attributes
RIGHT_ATTRS .

Important note

Because the unique token names in LEFT_ID and RIGHT_ID are used to identify tokens, the
order of the dicts in the patterns is important: a token name needs to be defined as RIGHT_ID
in one dict in the pattern before it can be used as LEFT_ID in another dict.

Dependency matcher operators


The following operators are supported by the DependencyMatcher , most of which come directly
from Semgrex:
SYMBOL DESCRIPTION

A < B A is the immediate dependent of B .

A > B A is the immediate head of B .

A << B A is the dependent in a chain to B following dep → head paths.

A >> B A is the head in a chain to B following head → dep paths.

A . B A immediately precedes B , i.e. A.i == B.i - 1 , and both are within the same
dependency tree.

A .* B A precedes B , i.e. A.i < B.i , and both are within the same dependency tree (not in
Semgrex).

A ; B A immediately follows B , i.e. A.i == B.i + 1 , and both are within the same
dependency tree (not in Semgrex).

A ;* B A follows B , i.e. A.i > B.i , and both are within the same dependency tree (not in
Semgrex).

A $+ B B is a right immediate sibling of A , i.e. A and B have the same parent and
A.i == B.i - 1 .

A $- B B is a left immediate sibling of A , i.e. A and B have the same parent and
A.i == B.i + 1 .

A $++ B B is a right sibling of A , i.e. A and B have the same parent and A.i < B.i .

A $-- B B is a left sibling of A , i.e. A and B have the same parent and A.i > B.i .

Designing dependency matcher patterns


Let’s say we want to find sentences describing who founded what kind of company:

● Smith founded a healthcare company in 2005.


● Williams initially founded an insurance company in 1987.
● Lee, an experienced CEO, has founded two AI startups.
The dependency parse for “Smith founded a healthcare company” shows types of relations and
tokens we want to match:

VISUALIZING THE PARSE

The displacy visualizer lets you render Doc objects and their dependency parse and part-of-speech
tags:

import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Smith founded a healthcare company")
displacy.serve(doc)

dobj

de t

nsubj

Smith founded a healthcare

The relations we’re interested in are:

● the founder is the subject ( nsubj ) of the token with the text founded
● the company is the object ( dobj ) of founded
● the kind of company may be an adjective ( amod , not shown above) or a compound
( compound )

The first step is to pick an anchor token for the pattern. Since it’s the root of the dependency
parse, founded is a good choice here. It is often easier to construct patterns when all dependency
relation operators point from the head to the children. In this example, we’ll only use > , which
connects a head to an immediate dependent as head > child .

The simplest dependency matcher pattern will identify and name a single token in the tree:

Editable Code spaCy v3.0 · Python 3 · via Binder

import spacy
from spacy.matcher import DependencyMatcher

nlp = spacy.load("en_core_web_sm")
matcher = DependencyMatcher(nlp.vocab)
pattern = [
{
"RIGHT_ID": "anchor_founded", # unique name
"RIGHT_ATTRS": {"ORTH": "founded"} # token pattern for "founded"
}
]
matcher.add("FOUNDED", [pattern])
doc = nlp("Smith founded two companies.")
matches = matcher(doc)
print( ) # [(4851363122962674176, [1])]

RUN

Now that we have a named anchor token ( anchor_founded ), we can add the founder as the
immediate dependent ( > ) of founded with the dependency label nsubj :

STEP 1

pattern = [
{
"RIGHT_ID": "anchor_founded",
"RIGHT_ATTRS": {"ORTH": "founded"}
},
{
"LEFT_ID": "anchor_founded",
"REL_OP": ">",
"RIGHT_ID": "founded_subject",
"RIGHT_ATTRS": {"DEP": "nsubj"},
}
# ...
]

The direct object ( dobj ) is added in the same way:

STEP 2

pattern = [
#...
{
"LEFT_ID": "anchor_founded",
"REL_OP": ">",
"RIGHT_ID": "founded_object",
"RIGHT_ATTRS": {"DEP": "dobj"},
}
# ...
]

When the subject and object tokens are added, they are required to have names under the key
RIGHT_ID , which are allowed to be any unique string, e.g. founded_subject . These names can
then be used as LEFT_ID to link new tokens into the pattern. For the final part of our pattern,
we’ll specify that the token founded_object should have a modifier with the dependency relation
amod or compound :

STEP 3

pattern = [
# ...
{
"LEFT_ID": "founded_object",
"REL_OP": ">",
"RIGHT_ID": "founded_object_modifier",
"RIGHT_ATTRS": {"DEP": {"IN": ["amod", "compound"]}},
}
]

You can picture the process of creating a dependency matcher pattern as defining an anchor token
on the left and building up the pattern by linking tokens one-by-one on the right using relation
operators. To create a valid pattern, each new token needs to be linked to an existing token on its
left. As for founded in this example, a token may be linked to more than one token on its right:

The full pattern comes together as shown in the example below:

Editable Code spaCy v3.0 · Python 3 · via Binder

import spacy
from spacy.matcher import DependencyMatcher

nlp = spacy.load("en_core_web_sm")
matcher = DependencyMatcher(nlp.vocab)

pattern = [
{
"RIGHT_ID": "anchor_founded",
"RIGHT_ATTRS": {"ORTH": "founded"}
},
{
"LEFT_ID": "anchor_founded",
"REL_OP": ">",
"RIGHT_ID": "founded_subject",
"RIGHT_ATTRS": {"DEP": "nsubj"},
},
{
"LEFT_ID": "anchor_founded",
"REL_OP": ">",
"RIGHT_ID": "founded_object",
"RIGHT_ATTRS": {"DEP": "dobj"},
},
{
"LEFT_ID": "founded_object",
"REL_OP": ">",
"RIGHT_ID": "founded_object_modifier",
"RIGHT_ATTRS": {"DEP": {"IN": ["amod", "compound"]}},
}
]

matcher.add("FOUNDED", [pattern])
doc = nlp("Lee, an experienced CEO, has founded two AI startups.")
matches = matcher(doc)

print(matches) # [(4851363122962674176, [6, 0, 10, 9])]


# Each token_id corresponds to one pattern dict
match_id, token_ids = matches[0]
for i in range(len(token_ids)):
print(pattern[i]["RIGHT_ID"] ":" doc[token_ids[i]].text)

RUN

Important note on speed

The dependency matcher may be slow when token patterns can potentially match many tokens
in the sentence or when relation operators allow longer paths in the dependency parse, e.g.
<< , >> , .* and ;* .

To improve the matcher speed, try to make your token patterns and operators as specific as
possible. For example, use > instead of >> if possible and use token patterns that include
dependency labels and other token attributes instead of patterns such as {} that match any
token in the sentence.
Rule-based entity recognition
The EntityRuler is a component that lets you add named entities based on pattern
dictionaries, which makes it easy to combine rule-based and statistical named entity recognition for
even more powerful pipelines.

Entity Patterns
Entity patterns are dictionaries with two keys: "label" , specifying the label to assign to the entity
if the pattern is matched, and "pattern" , the match pattern. The entity ruler accepts two types of
patterns:

1. Phrase patterns for exact string matches (string).

{"label": "ORG", "pattern": "Apple"}

2. Token patterns with one dictionary describing one token (list).

{"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]}

Using the entity ruler


The EntityRuler is a pipeline component that’s typically added via nlp.add_pipe .
When the nlp object is called on a text, it will find matches in the doc and add them as entities
to the doc.ents , using the specified pattern label as the entity label. If any matches were to
overlap, the pattern matching most tokens takes priority. If they also happen to be equally long,
then the match occurring first in the Doc is chosen.

Editable Code spaCy v3.0 · Python 3 · via Binder

from spacy.lang.en import English


nlp = English()
ruler = nlp.add_pipe("entity_ruler")
patterns = [{"label": "ORG", "pattern": "Apple"},
{"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"
ruler.add_patterns(patterns)

doc = nlp("Apple is opening its first big office in San Francisco.")


([( ) ])

RUN

The entity ruler is designed to integrate with spaCy’s existing pipeline components and enhance the
named entity recognizer. If it’s added before the "ner" component, the entity recognizer will
respect the existing entity spans and adjust its predictions around it. This can significantly improve
accuracy in some cases. If it’s added after the "ner" component, the entity ruler will only add
spans to the doc.ents if they don’t overlap with existing entities predicted by the model. To
overwrite overlapping entities, you can set overwrite_ents=True on initialization.

Editable Code spaCy v3.0 · Python 3 · via Binder

import spacy

nlp = spacy.load("en_core_web_sm")
ruler = nlp.add_pipe("entity_ruler")
patterns = [{"label": "ORG", "pattern": "MyCorp Inc."}]
ruler.add_patterns(patterns)

doc = nlp("MyCorp Inc. is a company in the U.S.")


print([( ) in ])

RUN

Validating and debugging EntityRuler patterns


The entity ruler can validate patterns against a JSON schema with the config setting "validate" .
See details under Validating and debugging patterns.
ruler = nlp.add_pipe("entity_ruler", config={"validate": True})

Adding IDs to patterns


The EntityRuler can also accept an id attribute for each pattern. Using the id attribute
allows multiple patterns to be associated with the same entity.

Editable Code spaCy v3.0 · Python 3 · via Binder

from spacy.lang.en import English

nlp = English()
ruler = nlp.add_pipe("entity_ruler")
patterns = [{"label": "ORG", "pattern": "Apple", "id": "apple"},
{"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"
{"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "fran"}],
ruler.add_patterns(patterns)

doc1 = nlp("Apple is opening its first big office in San Francisco.")


print([(ent.text, ent.label_, ent.ent_id_) for ent in doc1.ents])

doc2 = nlp("Apple is opening its first big office in San Fran.")


print([( label_ ent_id_) in doc2 ])

RUN

If the id attribute is included in the EntityRuler patterns, the ent_id_ property of the
matched entity is set to the id given in the patterns. So in the example above it’s easy to identify
that “San Francisco” and “San Fran” are both the same entity.

Using pattern files


The to_disk and from_disk let you save and load patterns to and from JSONL
(newline-delimited JSON) files, containing one pattern object per line.
PATTERNS.JSONL

{"label": "ORG", "pattern": "Apple"}


{"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]}

ruler.to_disk("./patterns.jsonl")
new_ruler = nlp.add_pipe("entity_ruler").from_disk("./patterns.jsonl")

Integration with Prodigy

If you’re using the Prodigy annotation tool, you might recognize these pattern files from
bootstrapping your named entity and text classification labelling. The patterns for the
EntityRuler follow the same syntax, so you can use your existing Prodigy pattern files in
spaCy, and vice versa.

When you save out an nlp object that has an EntityRuler added to its pipeline, its patterns are
automatically exported to the pipeline directory:

nlp = spacy.load("en_core_web_sm")
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns([{"label": "ORG", "pattern": "Apple"}])
nlp.to_disk("/path/to/pipeline")

The saved pipeline now includes the "entity_ruler" in its config.cfg and the pipeline
directory contains a file entityruler.jsonl with the patterns. When you load the pipeline back
in, all pipeline components will be restored and deserialized – including the entity ruler. This lets
you ship powerful pipeline packages with binary weights and rules included!

Using a large number of phrase patterns


When using a large amount of phrase patterns (roughly > 10000) it’s useful to understand how the
add_patterns function of the entity ruler works. For each phrase pattern, the EntityRuler calls the
nlp object to construct a doc object. This happens in case you try to add the EntityRuler at the end
of an existing pipeline with, for example, a POS tagger and want to extract matches based on the
pattern’s POS signature. In this case you would pass a config value of
"phrase_matcher_attr": "POS" for the entity ruler.

Running the full language pipeline across every pattern in a large list scales linearly and can
therefore take a long time on large amounts of phrase patterns. As of spaCy v2.2.4 the
add_patterns function has been refactored to use nlp.pipe on all phrase patterns resulting in
about a 10x-20x speed up with 5,000-100,000 phrase patterns respectively. Even with this speedup
(but especially if you’re using an older version) the add_patterns function can still take a long
time. An easy workaround to make this function run faster is disabling the other language pipes
while adding the phrase patterns.

ruler = nlp.add_pipe("entity_ruler")
patterns = [{"label": "TEST", "pattern": str(i)} for i in range(100000)]
with nlp.select_pipes(enable="tagger"):
ruler.add_patterns(patterns)

Combining models and rules


You can combine statistical and rule-based components in a variety of ways. Rule-based
components can be used to improve the accuracy of statistical models, by presetting tags, entities
or sentence boundaries for specific tokens. The statistical models will usually respect these preset
annotations, which sometimes improves the accuracy of other decisions. You can also use rule-
based components after a statistical model to correct common errors. Finally, rule-based
components can reference the attributes set by statistical models, in order to implement more
abstract logic.

Example: Expanding named entities


When using a trained named entity recognition model to extract information from your texts, you
may find that the predicted span only includes parts of the entity you’re looking for. Sometimes,
this happens if statistical model predicts entities incorrectly. Other times, it happens if the way the
entity type was defined in the original training corpus doesn’t match what you need for your
application.

WHERE CORPORA COME FROM

Corpora used to train pipelines from scratch are often produced in academia. They contain text from
various sources with linguistic features labeled manually by human annotators (following a set of
specific guidelines). The corpora are then distributed with evaluation data, so other researchers can
benchmark their algorithms and everyone can report numbers on the same data. However, most
applications need to learn information that isn’t contained in any available corpus.

For example, the corpus spaCy’s English pipelines were trained on defines a PERSON entity as
just the person name, without titles like “Mr.” or “Dr.”. This makes sense, because it makes it easier
to resolve the entity type back to a knowledge base. But what if your application needs the full
names, including the titles?

Editable Code spaCy v3.0 · Python 3 · via Binder

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Dr. Alex Smith chaired first board meeting of Acme Corp Inc.")
([( ) ])

RUN

While you could try and teach the model a new definition of the PERSON entity by updating it with
more examples of spans that include the title, this might not be the most efficient approach. The
existing model was trained on over 2 million words, so in order to completely change the definition
of an entity type, you might need a lot of training examples. However, if you already have the
predicted PERSON entities, you can use a rule-based approach that checks whether they come with
a title and if so, expands the entity span by one token. After all, what all titles in this example have
in common is that if they occur, they occur in the previous token right before the person entity.

from spacy.language import Language


from spacy.tokens import Span
@Language.component("expand_person_entities")
def expand_person_entities(doc):
new_ents = []
for ent in doc.ents:
# Only check for title if it's a person and not the first token
if ent.label_ == "PERSON" and ent.start != 0:
prev_token = doc[ent.start - 1]
if prev_token.text in ("Dr", "Dr.", "Mr", "Mr.", "Ms", "Ms."):
new_ent = Span(doc, ent.start - 1, ent.end, label=ent.label)
new_ents.append(new_ent)
else:
new_ents.append(ent)
else:
new_ents.append(ent)
doc.ents = new_ents
return doc

The above function takes a Doc object, modifies its doc.ents and returns it. Using the
@Language.component decorator, we can register it as a pipeline component so it can run
automatically when processing a text. We can use nlp.add_pipe to add it to the current
pipeline.

Editable Code spaCy v3.0 · Python 3 · via Binder

import spacy
from spacy.language import Language
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")

@Language.component("expand_person_entities")
def expand_person_entities(doc):
new_ents = []
for ent in doc.ents:
if ent.label_ == "PERSON" and ent.start != 0:
prev_token = doc[ent.start - 1]
if prev_token.text in ("Dr", "Dr.", "Mr", "Mr.", "Ms", "Ms."):
new_ent = Span(doc, ent.start - 1, ent.end, label=ent.label)
new_ents.append(new_ent)
else:
new_ents.append(ent)
doc.ents = new_ents
return doc

# Add the component after the named entity recognizer


nlp.add_pipe("expand_person_entities", after="ner")

doc = nlp("Dr. Alex Smith chaired first board meeting of Acme Corp Inc.")
print([( ) in ])

RUN

An alternative approach would be to use an extension attribute like ._.person_title and add it
to Span objects (which includes entity spans in doc.ents ). The advantage here is that the entity
text stays intact and can still be used to look up the name in a knowledge base. The following
function takes a Span object, checks the previous token if it’s a PERSON entity and returns the title
if one is found. The Span.doc attribute gives us easy access to the span’s parent document.

def get_person_title(span):
if span.label_ == "PERSON" and span.start != 0:
prev_token = span.doc[span.start - 1]
if prev_token.text in ("Dr", "Dr.", "Mr", "Mr.", "Ms", "Ms."):
return prev_token.text

We can now use the Span.set_extension method to add the custom extension attribute
"person_title" , using get_person_title as the getter function.

Editable Code spaCy v3.0 · Python 3 · via Binder

import spacy
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")

def get_person_title(span):
if span.label_ == "PERSON" and span.start != 0:
prev_token = span.doc[span.start - 1]
if prev_token.text in ("Dr", "Dr.", "Mr", "Mr.", "Ms", "Ms."):
return prev_token.text

# Register the Span extension as 'person_title'


Span.set_extension("person_title", getter=get_person_title)

doc = nlp("Dr Alex Smith chaired first board meeting of Acme Corp Inc.")
print([( person_title) in ])

RUN

Example: Using entities, part-of-speech tags and


the dependency parse

LINGUISTIC FEATURES

This example makes extensive use of part-of-speech tag and dependency attributes and related Doc ,
Token and Span methods. For an introduction on this, see the guide on linguistic features. Also see
the label schemes in the models directory for details on the labels.

Let’s say you want to parse professional biographies and extract the person names and company
names, and whether it’s a company they’re currently working at, or a previous company. One
approach could be to try and train a named entity recognizer to predict CURRENT_ORG and
PREVIOUS_ORG – but this distinction is very subtle and something the entity recognizer may
struggle to learn. Nothing about “Acme Corp Inc.” is inherently “current” or “previous”.

However, the syntax of the sentence holds some very important clues: we can check for trigger
words like “work”, whether they’re past tense or present tense, whether company names are
attached to it and whether the person is the subject. All of this information is available in the part-
of-speech tags and the dependency parse.

Editable Code spaCy v3.0 · Python 3 · via Binder

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Alex Smith worked at Acme Corp Inc.")
([( ) ])

RUN

nsubj : Nominal subject.


prep : Preposition.
pobj : Object of preposition.
NNP : Proper noun, singular.
VBD : Verb, past tense.
IN : Conjunction, subordinating or preposition.

nsubj prep pobj

Alex Smith worked at Acme Corp Inc.

NNP VBD IN NNP

spacy.displacy visualization with options={'fine_grained': True} to output the fine-grained part-of-


speech tags, i.e. Token.tag_

In this example, “worked” is the root of the sentence and is a past tense verb. Its subject is “Alex
Smith”, the person who worked. “at Acme Corp Inc.” is a prepositional phrase attached to the verb
“worked”. To extract this relationship, we can start by looking at the predicted PERSON entities, find
their heads and check whether they’re attached to a trigger word like “work”. Next, we can check for
prepositional phrases attached to the head and whether they contain an ORG entity. Finally, to
determine whether the company affiliation is current, we can check the head’s part-of-speech tag.

person_entities = [ent for ent in doc.ents if ent.label_ == "PERSON"]


for ent in person_entities:
# Because the entity is a span, we need to use its root token. The head
# is the syntactic governor of the person, e.g. the verb
head = ent.root.head
if head.lemma_ == "work":
# Check if the children contain a preposition
preps = [token for token in head.children if token.dep_ == "prep"]
for prep in preps:
# Check if tokens part of ORG entities are in the preposition's
# children, e.g. at -> Acme Corp Inc.
orgs = [token for token in prep.children if token.ent_type_ == "ORG"]
# If the verb is in past tense, the company was a previous company
print({"person": ent, "orgs": orgs, "past": head.tag_ == "VBD"})

To apply this logic automatically when we process a text, we can add it to the nlp object as a
custom pipeline component. The above logic also expects that entities are merged into single
tokens. spaCy ships with a handy built-in merge_entities that takes care of that. Instead of just
printing the result, you could also write it to custom attributes on the entity Span – for example
._.orgs or ._.prev_orgs and ._.current_orgs .

MERGING ENTITIES

Under the hood, entities are merged using the Doc.retokenize context manager:

with doc.retokenize() as retokenize:


for ent in doc.ents:
retokenizer.merge(ent)

Editable Code spaCy v3.0 · Python 3 · via Binder

import spacy
from spacy.language import Language
from spacy import displacy

nlp = spacy.load("en_core_web_sm")

@Language.component("extract_person_orgs")
def extract_person_orgs(doc):
person_entities = [ent for ent in doc.ents if ent.label_ == "PERSON"]
for ent in person_entities:
head = ent.root.head
if head.lemma_ == "work":
preps = [token for token in head.children if token.dep_ == "prep"]
for prep in preps:
orgs = [token for token in prep.children if token.ent_type_ ==
print({'person': ent, 'orgs': orgs, 'past': head.tag_ == "VBD"})
return doc

# To make the entities easier to work with, we'll merge them into single tokens
nlp.add_pipe("merge_entities")
nlp.add_pipe("extract_person_orgs")

doc = nlp("Alex Smith worked at Acme Corp Inc.")


# If you're not in a Jupyter / IPython environment, use displacy.serve
displacy ( options {"fine_grained" })

RUN

If you change the sentence structure above, for example to “was working”, you’ll notice that our
current logic fails and doesn’t correctly detect the company as a past organization. That’s because
the root is a participle and the tense information is in the attached auxiliary “was”:

nsub j

aux prep pobj

Alex Smith was working at Acme Corp Inc.

NNP VBD VBG IN NNP

To solve this, we can adjust the rules to also check for the above construction:

@Language.component("extract_person_orgs")
def extract_person_orgs(doc):
person_entities = [ent for ent in doc.ents if ent.label_ == "PERSON"]
for ent in person_entities:
head = ent.root.head
if head.lemma_ == "work":
preps = [token for token in head.children if token.dep_ ==
"prep"]
for prep in preps:
orgs = [t for t in prep.children if t.ent_type_ == "ORG"]
aux = [token for token in head.children if token.dep_ ==
"aux"]
past_aux = any(t.tag_ == "VBD" for t in aux)
past = head.tag_ == "VBD" or head.tag_ == "VBG" and past_aux
print({'person': ent, 'orgs': orgs, 'past': past})
return doc

In your final rule-based system, you may end up with several different code paths to cover the
types of constructions that occur in your data.

SUGGEST EDITS

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy