Skip to content

Commit 11040f0

Browse files
author
Montana Low
committed
update typos
1 parent b4703d9 commit 11040f0

File tree

1 file changed

+6
-6
lines changed

1 file changed

+6
-6
lines changed

pgml-docs/docs/blog/postgres-full-text-search-is-awesome.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -6,15 +6,15 @@
66
August 25, 2022
77
</p>
88

9-
Normalized data is a powerful tool leveraged by 10x engineering organizations. If you haven't read [Postgres Full Text Search is Good Enough!](http://rachbelaid.com/postgres-full-text-search-is-good-enough/) you should, unless you're willing to take that statement at face value, without the code samples to prove it. We'll go beyond that original claim in this post, but to reiterate the main points, Postgres supports:
9+
Normalized data is a powerful tool leveraged by 10x engineering organizations. If you haven't read [Postgres Full Text Search is Good Enough!](http://rachbelaid.com/postgres-full-text-search-is-good-enough/) you should, unless you're willing to take that statement at face value, without the code samples to prove it. We'll go beyond that claim in this post, but to reiterate the main points, Postgres supports:
1010

1111
- Stemming
1212
- Ranking / Boost
1313
- Support Multiple languages
1414
- Fuzzy search for misspelling
1515
- Accent support
1616

17-
This is good enough for most of the use cases out there, without introducing any additional concerns to your application. But, if you've ever tried to deliver relevant search results at scale, you'll realize that you need a lot more than these fundamentals. ElasticSearch has all kinds of best in class features, like a modified version of BM25 that is state of the art (developed in the 1970's), which is one of the many features you need beyond the Term Frequency (TF) based ranking that Postgres uses... but, _the ElasticSearch approach is a dead end_ for 2 reasons:
17+
This is good enough for most of the use cases out there, without introducing any additional concerns to your application. But, if you've ever tried to deliver relevant search results at scale, you'll realize that you need a lot more than these fundamentals. ElasticSearch has all kinds of best in class features, like a modified version of BM25 that is state of the art (developed in the 1970's), which is one of the many features you need beyond the Term Frequency (TF) based ranking that Postgres uses... but, _the ElasticSearch approach is a dead end_ for 2 reasons:
1818

1919
1. Trying to improve search relevance with statistics like TF-IDF and BM25 is like trying to make a flying car. What you want is a helicopter instead.
2020
2. Computing inverse document frequency for BM25 brutalizes your search indexing performance, which leads to a [host of follow on issues via distributed computation](https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing), for the originally dubious reason.
@@ -26,25 +26,25 @@ This is good enough for most of the use cases out there, without introducing any
2626
<figcaption>What we were promised</figcaption>
2727
</figure>
2828

29-
Academics have spent a decades inventing many algorithms that use orders of magnitude more compute eking out marginally better results that often aren't worth it in practice. Not to generally disparage academia, their work has consistently improved our world, but we need to pay attention to tradeoffs.
29+
Academics have spent decades inventing many algorithms that use orders of magnitude more compute eking out marginally better results that often aren't worth it in practice. Not to generally disparage academia, their work has consistently improved our world, but we need to pay attention to tradeoffs.
3030

3131
If you actually want to meaningfully improve search results, you generally need to add new data sources. Relevance is much more often revealed by the way other things **_relate_** to the document, rather than the content of the document itself. Google proved the point 23 years ago. Pagerank doesn't rely on the page content itself as much as it uses metadata from _links to the pages_. We live in a connected world and it's the interplay among things that reveal their relevance, whether that is links for websites, sales for products, shares for social posts... It's the greater context around the document that matters.
3232

3333
> _If you want to improve your search results, don't rely on expensive O(n*m) word frequency statistics. Get new sources of data instead. It's the relational nature of relevance that underpins why a relational database forms the ideal search engine._
3434
35-
Postgres made the right call to avoid the costs required to compute Inverse Document Frequency in their search indexing, given its meager benefit. Instead, it offers the most feature complete relational data platform. [Elasticsearch will tell you](https://www.elastic.co/guide/en/elasticsearch/reference/current/joining-queries.html) you can't join data in a **_naively_** distributed system at read time, because it is prohibitively expensive. Instead you'll have to join the data eagerly at indexing time, which is even more prohibitively expensive. That's good for their business since you're the one paying for it, and it will scale until it you're bankrupt.
35+
Postgres made the right call to avoid the costs required to compute Inverse Document Frequency in their search indexing, given its meager benefit. Instead, it offers the most feature-complete relational data platform. [Elasticsearch will tell you](https://www.elastic.co/guide/en/elasticsearch/reference/current/joining-queries.html) you can't join data in a **_naively_** distributed system at read time, because it is prohibitively expensive. Instead you'll have to join the data eagerly at indexing time, which is even more prohibitively expensive. That's good for their business since you're the one paying for it, and it will scale until you're bankrupt.
3636

3737
What you really should do, is leave the data normalized inside Postgres, which will allow you to join additional, related data at query time. It will take multiple orders of magnitude less compute to index and search a normalized corpus, meaning you'll have a lot longer (potentially forever) before you need to distribute your workload, and then maybe you can do that intelligently instead of naively. Instead of spending your time building and maintaining pipelines to shuffle updates between systems, you can work on new sources of data to really improve relevance.
3838

39-
With PostgresML, you can now skip straight to full on machine learning when you have the related data. You can load your feature store into the same database as your search corpus. Each feature can live in it's own independent table, with it's own update cadence, rather than having to reindex and denormalize entire documents back to ElasticSearch, or worse, large portions of the entire corpus, when a single thing changes.
39+
With PostgresML, you can now skip straight to full on machine learning when you have the related data. You can load your feature store into the same database as your search corpus. Each feature can live in its own independent table, with its own update cadence, rather than having to reindex and denormalize entire documents back to ElasticSearch, or worse, large portions of the entire corpus, when a single thing changes.
4040

4141
With a single SQL query, you can do multiple passes of re-ranking, pruning and personalization to refine a search relevance score.
4242

4343
- basic term relevance
4444
- embedding similarities
4545
- XGBoost or LightGBM inference
4646

47-
These queries can execute in miliseconds on large production sized corpora with Postgres's multiple indexing strategies. You can do all of this without adding any new infrastructure to your stack.
47+
These queries can execute in milliseconds on large production-sized corpora with Postgres's multiple indexing strategies. You can do all of this without adding any new infrastructure to your stack.
4848

4949
The following full blown example is for demonstration purposes only. You may want to try the PostgresML Gym to work up to the full understanding.
5050

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy