Skip to content

Functions using blas cause a segfault (SIGSEV) #617

@eeeebbbbrrrr

Description

@eeeebbbbrrrr

After working with @levkk and @montanalow to install PostgresML (as of master: 63ebce3) on my linux box, I discovered that functions such as pgml.cosine_similarity and pgml.norm_l1 cause Postgres to segfault.

As an example:

[v15.1][5126] pgml=# select pgml.norm_l1(ARRAY[1,2,3]::real[]);
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.
Time: 188.973 ms
[v][] ?!> 

Postgres logs leading up to a crash against pgml.cosine_similarity() are:

/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/utils/logging.py:65: RuntimeWarning: Error deriving logger module name, using <None>. Exception: <module '' from '/home/pg/15/data'> is a built-in module
  warnings.warn(
No sentence-transformers model found with name /home/zombodb/.cache/torch/sentence_transformers/intfloat_e5-large. Creating a new one with MEAN pooling.
2023-05-04 18:35:48.950 UTC [20973] LOG:  server process (PID 21218) was terminated by signal 11: Segmentation fault
2023-05-04 18:35:48.950 UTC [20973] DETAIL:  Failed process was running: select *, pgml.cosine_similarity(embed, pgml.embed('intfloat/e5-large', 'meetings with beer or wine and cheese')) from embeddings_e5large_100k limit 10;
2023-05-04 18:35:48.950 UTC [20973] LOG:  terminating any other active server processes
2023-05-04 18:35:48.953 UTC [20973] LOG:  all server processes terminated; reinitializing
2023-05-04 18:35:48.979 UTC [20973] FATAL:  Can't attach, lock is not in an empty state: PgLwLockInner
2023-05-04 18:35:48.980 UTC [20973] LOG:  database system is shut down

The backtrace from a --debug build of pgml is:

Thread 1 "postgres" received signal SIGSEGV, Segmentation fault.
0x00007ff52bc65d76 in sdot_ () from /home/pg/15/lib/postgresql/pgml.so
(gdb) bt
#0  0x00007ff52bc65d76 in sdot_ () from /home/pg/15/lib/postgresql/pgml.so
#1  0x00007ff52b8b363a in blas::sdot (n=1024, x=..., incx=1, y=..., incy=1)
    at /home/zombodb/.cargo/registry/src/github.com-1ecc6299db9ec823/blas-0.22.0/src/lib.rs:109
#2  0x00007ff52b7c6aa6 in pgml::vectors::cosine_similarity_s (vector=..., other=...) at src/vectors.rs:304
#3  0x00007ff52b7c6d9a in pgml::vectors::cosine_similarity_s_wrapper::cosine_similarity_s_wrapper_inner (_fcinfo=0x55fa5656e560) at src/vectors.rs:302
#4  0x00007ff52b4ae1c1 in pgml::vectors::cosine_similarity_s_wrapper::{closure#0} () at src/vectors.rs:302
#5  0x00007ff52b6edb8c in std::panicking::try::do_call<pgml::vectors::cosine_similarity_s_wrapper::{closure_env#0}, pgrx_pg_sys::submodules::datum::Datum> (
    data=0x7ffe798f2828) at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:483
#6  0x00007ff52b6f0f6b in __rust_try.llvm.11079318101650794703 () from /home/pg/15/lib/postgresql/pgml.so
#7  0x00007ff52b6ea049 in std::panicking::try<pgrx_pg_sys::submodules::datum::Datum, pgml::vectors::cosine_similarity_s_wrapper::{closure_env#0}> (f=...)
    at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:447
#8  0x00007ff52b75a0f6 in std::panic::catch_unwind<pgml::vectors::cosine_similarity_s_wrapper::{closure_env#0}, pgrx_pg_sys::submodules::datum::Datum> (f=...)
    at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panic.rs:137
#9  0x00007ff52b765983 in pgrx_pg_sys::submodules::panic::run_guarded<pgml::vectors::cosine_similarity_s_wrapper::{closure_env#0}, pgrx_pg_sys::submodules::datum::Datum> (f=...) at /home/zombodb/.cargo/registry/src/github.com-1ecc6299db9ec823/pgrx-pg-sys-0.8.3/src/submodules/panic.rs:403
#10 0x00007ff52b77111c in pgrx_pg_sys::submodules::panic::pgrx_extern_c_guard<pgml::vectors::cosine_similarity_s_wrapper::{closure_env#0}, pgrx_pg_sys::submodules::datum::Datum> (f=...) at /home/zombodb/.cargo/registry/src/github.com-1ecc6299db9ec823/pgrx-pg-sys-0.8.3/src/submodules/panic.rs:380
#11 0x00007ff52b7c6c9d in pgml::vectors::cosine_similarity_s_wrapper (_fcinfo=0x55fa5656e560) at src/vectors.rs:302
#12 0x000055fa54ce4b43 in ExecInterpExpr ()
#13 0x000055fa54cf15a2 in ExecScan ()
#14 0x000055fa54d0c368 in ExecLimit ()
#15 0x000055fa54ce88a2 in standard_ExecutorRun ()

My box is a (humblebrag):

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         43 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  64
  On-line CPU(s) list:   0-63
Vendor ID:               AuthenticAMD
  Model name:            AMD Ryzen Threadripper 3970X 32-Core Processor
    CPU family:          23
    Model:               49
    Thread(s) per core:  2
    Core(s) per socket:  32
    Socket(s):           1
    Stepping:            0
    Frequency boost:     enabled
    CPU max MHz:         3700.0000
    CPU min MHz:         2200.0000
    BogoMIPS:            7386.30
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb 
                         rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 mo
                         vbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt t
                         ce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 s
                         mep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_loc
                         al clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter p
                         fthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
Virtualization features: 
  Virtualization:        AMD-V
Caches (sum of all):     
  L1d:                   1 MiB (32 instances)
  L1i:                   1 MiB (32 instances)
  L2:                    16 MiB (32 instances)
  L3:                    128 MiB (8 instances)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-63
Vulnerabilities:         
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Vulnerable
  Spec store bypass:     Vulnerable
  Spectre v1:            Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
  Spectre v2:            Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

With an nvidia RTX 4080:

  nvidia-debugdump -l
Found 1 NVIDIA devices
   Device ID:              0
   Device name:            NVIDIA GeForce RTX 4080   (*PrimaryCard)
   GPU internal ID:        GPU-b772ddf7-d413-e1bb-d1e1-8e7022c59343

Lev helped me discover that by commenting out this line,

println!("cargo:rustc-link-lib=static=openblas");
, everything works:

[v15.1][8595] pgml=# select pgml.norm_l1(ARRAY[1,2,3]::real[]);
 norm_l1 
---------
       6
(1 row)

Time: 0.620 ms

This crash seems to be isolated to blas as I created 100k embeddings with pgml.embed() in a mere 7m 50s, using 4 parallel workers, even. So that part is good.

I had a thought that rebooting the computer might help since I had just stressed the GPU making all those embeddings, but naw, that didn't change anything.

A theory is that since pgml links to so many libraries (probably directly and indirectly) that maybe there's some kind of symbol resolution problem and the wrong symbols are being called? Just a theory.

@thomcc might be able to offer some help with this if it's some kind of linking problem? Offering up his services as PostgresML's success is pgrx's success!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingml

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      pFad - Phonifier reborn

      Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

      Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


      Alternative Proxies:

      Alternative Proxy

      pFad Proxy

      pFad v3 Proxy

      pFad v4 Proxy