Description
Versions
Latest main
.
Bug description
Start with a clean slate (no persistent/
dir and new empty DB).
Run Electric in dev mode (iex -S mix
).
Create a table in the database called items
:
$ psql postgresql://postgres:password@localhost:54321/electric
[localhost] postgres:electric=# create table items(val text);
CREATE TABLE
Attempt no. 1
Try inserting 100 mln rows into the items
table:
[localhost] postgres:electric=# insert into items select generate_series::text from generate_series(1, 100000000);
While the query is still running, try requesting a shape from Electric:
$ curl -i http://localhost:3000/v1/shape\?table=items&offset=-1
HTTP/1.1 500 Internal Server Error
date: Mon, 14 Apr 2025 16:00:10 GMT
content-length: 102
vary: accept-encoding
cache-control: no-cache
x-request-id: GDY6qbcQUwwFCRoAAAAD
electric-server: ElectricSQL/1.0.5
access-control-allow-origin: *
access-control-expose-headers: electric-cursor,electric-handle,electric-offset,electric-schema,electric-up-to-date
access-control-allow-methods: GET, HEAD, DELETE, OPTIONS
content-type: application/json; charset=utf-8
electric-schema: {"val":{"type":"text"}}
{"message":"Unable to retrieve shape log: ** (RuntimeError) Timed out while waiting for a table lock"}
Electric's log output
18:00:06.271 pid=<0.792.0> request_id=GDY6qbcQUwwFCRoAAAAD [info] GET /v1/shape
18:00:06.308 pid=<0.792.0> request_id=GDY6qbcQUwwFCRoAAAAD [info] Query String: table=items&offset=-1
18:00:06.345 pid=<0.792.0> request_id=GDY6qbcQUwwFCRoAAAAD [debug] Table {"public", "items"} found with 1 columns
18:00:06.349 pid=<0.790.0> [debug] Starting consumer for 74391704-1744646406349
18:00:06.383 pid=<0.788.0> [debug] 1 consumers of replication stream
18:00:06.387 pid=<0.790.0> [debug] Returning shape id 74391704-1744646406349 for shape Shape.new!("public.items" [OID 16390])
18:00:06.394 pid=<0.798.0> shape_handle=74391704-1744646406349 [debug] Starting a wait on the snapshot 74391704-1744646406349 for {#PID<0.792.0>, [:alias | #Reference<0.0.101379.1137765335.1537802241.148014>]}}
18:00:06.398 pid=<0.787.0> [info] Altering identity of public.items to FULL
18:00:10.128 pid=<0.760.0> [debug] Primary Keepalive: wal_end=9712335958078128128 (86C92618/80800000) reply=0
18:00:11.388 pid=<0.798.0> shape_handle=74391704-1744646406349 [error] Snapshot creation failed for 74391704-1744646406349 because of:
** (RuntimeError) Timed out while waiting for a table lock
(elixir 1.18.1) lib/gen_server.ex:1128: GenServer.call/3
(electric 1.0.6) lib/electric/replication/publication_manager.ex:101: Electric.Replication.PublicationManager.add_shape/2
(opentelemetry_api 1.4.0) src/otel_tracer_noop.erl:59: :otel_tracer_noop.with_span/5
(electric 1.0.6) lib/electric/telemetry/open_telemetry.ex:87: anonymous fn/3 in Electric.Telemetry.OpenTelemetry.do_with_span/4
(telemetry 1.3.0) /home/alco/code/electric-sql/electric/packages/sync-service/deps/telemetry/src/telemetry.erl:324: :telemetry.span/3
(electric 1.0.6) lib/electric/shapes/consumer/snapshotter.ex:64: anonymous fn/10 in Electric.Shapes.Consumer.Snapshotter.handle_continue/2
(opentelemetry_api 1.4.0) src/otel_tracer_noop.erl:59: :otel_tracer_noop.with_span/5
(electric 1.0.6) lib/electric/telemetry/open_telemetry.ex:87: anonymous fn/3 in Electric.Telemetry.OpenTelemetry.do_with_span/4
(telemetry 1.3.0) /home/alco/code/electric-sql/electric/packages/sync-service/deps/telemetry/src/telemetry.erl:324: :telemetry.span/3
(electric 1.0.6) lib/electric/shapes/consumer/snapshotter.ex:58: Electric.Shapes.Consumer.Snapshotter.handle_continue/2
(stdlib 6.2) gen_server.erl:2335: :gen_server.try_handle_continue/3
(stdlib 6.2) gen_server.erl:2244: :gen_server.loop/7
(stdlib 6.2) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
18:00:11.401 pid=<0.792.0> request_id=GDY6qbcQUwwFCRoAAAAD [info] Sent 500 in 5130ms
18:00:14.659 pid=<0.760.0> [debug] Primary Keepalive: wal_end=9712335958290266496 (86C92618/8D24F980) reply=1
18:00:15.972 pid=<0.760.0> [debug] Primary Keepalive: wal_end=9712335958346563568 (86C92618/907FFFF0) reply=0
18:00:21.387 pid=<0.776.0> [error] Postgrex.Protocol (#PID<0.776.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.787.0> timed out because it queued and checked out the connection for longer than 15000ms
#PID<0.787.0> was at location:
:prim_inet.recv0/3
(postgrex 0.19.0) lib/postgrex/protocol.ex:3298: Postgrex.Protocol.msg_recv/4
(postgrex 0.19.0) lib/postgrex/protocol.ex:2292: Postgrex.Protocol.recv_bind/3
(postgrex 0.19.0) lib/postgrex/protocol.ex:2147: Postgrex.Protocol.bind_execute_close/4
(db_connection 2.7.0) lib/db_connection/holder.ex:354: DBConnection.Holder.holder_apply/4
(db_connection 2.7.0) lib/db_connection.ex:1558: DBConnection.run_execute/5
(db_connection 2.7.0) lib/db_connection.ex:772: DBConnection.parsed_prepare_execute/5
(db_connection 2.7.0) lib/db_connection.ex:764: DBConnection.prepare_execute/4
(postgrex 0.19.0) lib/postgrex.ex:316: Postgrex.query_prepare_execute/4
(postgrex 0.19.0) lib/postgrex.ex:328: Postgrex.query!/4
(electric 1.0.6) lib/electric/postgres/configuration.ex:155: anonymous fn/3 in Electric.Postgres.Configuration.set_replica_identity!/2
(elixir 1.18.1) lib/enum.ex:2546: Enum."-reduce/3-lists^foldl/2-0-"/3
(electric 1.0.6) lib/electric/postgres/configuration.ex:144: Electric.Postgres.Configuration.set_replica_identity!/2
(electric 1.0.6) lib/electric/postgres/configuration.ex:138: Electric.Postgres.Configuration.configure_tables_for_replication_internal!/4
(db_connection 2.7.0) lib/db_connection.ex:1756: DBConnection.run_transaction/4
(electric 1.0.6) lib/electric/replication/publication_manager.ex:297: Electric.Replication.PublicationManager.update_publication/1
(electric 1.0.6) lib/electric/replication/publication_manager.ex:218: Electric.Replication.PublicationManager.handle_info/2
(stdlib 6.2) gen_server.erl:2345: :gen_server.try_handle_info/3
(stdlib 6.2) gen_server.erl:2433: :gen_server.handle_msg/6
(stdlib 6.2) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
18:00:21.403 pid=<0.787.0> [warning] Failed to configure publication, retrying: %DBConnection.ConnectionError{message: "tcp recv: closed (the connection was closed by the pool, possibly due to a timeout or because the pool has been terminated)", severity: :error, reason: :error}
18:00:21.395 pid=<0.798.0> shape_handle=74391704-1744646406349 [error] GenServer {:"Elixir.Electric.ProcessRegistry:single_stack", {Electric.Shapes.Consumer, "74391704-1744646406349"}} terminating
** (stop) exited in: GenServer.call({:via, Registry, {:"Elixir.Electric.ProcessRegistry:single_stack", {Electric.Replication.PublicationManager, nil}}}, {:remove_shape, Shape.new!("public.items" [OID 16390])}, 5000)
** (EXIT) time out
(elixir 1.18.1) lib/gen_server.ex:1128: GenServer.call/3
(electric 1.0.6) lib/electric/replication/publication_manager.ex:117: Electric.Replication.PublicationManager.remove_shape/2
(electric 1.0.6) lib/electric/shapes/consumer.ex:465: Electric.Shapes.Consumer.cleanup/1
(electric 1.0.6) lib/electric/shapes/consumer.ex:188: Electric.Shapes.Consumer.terminate/2
(stdlib 6.2) gen_server.erl:2393: :gen_server.try_terminate/3
(stdlib 6.2) gen_server.erl:2594: :gen_server.terminate/10
(stdlib 6.2) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
Process Label: {:consumer, "74391704-1744646406349"}
Last message: {:"$gen_cast", {:snapshot_failed, "74391704-1744646406349", %RuntimeError{message: "Timed out while waiting for a table lock"}, [{GenServer, :call, 3, [file: ~c"lib/gen_server.ex", line: 1128]}, {Electric.Replication.PublicationManager, :add_shape, 2, [file: ~c"lib/electric/replication/publication_manager.ex", line: 101]}, {:otel_tracer_noop, :with_span, 5, [file: ~c"src/otel_tracer_noop.erl", line: 59]}, {Electric.Telemetry.OpenTelemetry, :"-do_with_span/4-fun-1-", 3, [file: ~c"lib/electric/telemetry/open_telemetry.ex", line: 87]}, {:telemetry, :span, 3, [file: ~c"/home/alco/code/electric-sql/electric/packages/sync-service/deps/telemetry/src/telemetry.erl", line: 324]}, {Electric.Shapes.Consumer.Snapshotter, :"-handle_continue/2-fun-1-", 10, [file: ~c"lib/electric/shapes/consumer/snapshotter.ex", line: 64]}, {:otel_tracer_noop, :with_span, 5, [file: ~c"src/otel_tracer_noop.erl", line: 59]}, {Electric.Telemetry.OpenTelemetry, :"-do_with_span/4-fun-1-", 3, [file: ~c"lib/electric/telemetry/open_telemetry.ex", line: 87]}, {:telemetry, :span, 3, [file: ~c"/home/alco/code/electric-sql/electric/packages/sync-service/deps/telemetry/src/telemetry.erl", line: 324]}, {Electric.Shapes.Consumer.Snapshotter, :handle_continue, 2, [file: ~c"lib/electric/shapes/consumer/snapshotter.ex", line: 58]}, {:gen_server, :try_handle_continue, 3, [file: ~c"gen_server.erl", line: 2335]}, {:gen_server, :loop, 7, [file: ~c"gen_server.erl", line: 2244]}, {:proc_lib, :init_p_do_apply, 3, [file: ~c"proc_lib.erl", line: 329]}]}}
State: %{monitors: [], buffer: [], registry: :"Elixir.Registry.ShapeChanges:single_stack", otel_ctx: %{:"$__otel_baggage_ctx_key" => %{"stack_id" => {"single_stack", []}}, {:otel_tracer, :span_ctx} => {:span_ctx, 0, 0, 0, {:tracestate, []}, false, false, false, :undefined}}, inspector: {Electric.Postgres.Inspector.EtsInspector, [stack_id: "single_stack", server: {:via, Registry, {:"Elixir.Electric.ProcessRegistry:single_stack", {Electric.Postgres.Inspector.EtsInspector, nil}}}]}, stack_id: "single_stack", storage: {Electric.ShapeCache.FileStorage, %Electric.ShapeCache.FileStorage{base_path: "./persistent/shapes/single_stack", shape_handle: "74391704-1744646406349", db: {:via, Registry, {:"Elixir.Electric.ProcessRegistry:single_stack", {Electric.ShapeCache.FileStorage, "74391704-1744646406349"}}}, data_dir: "./persistent/shapes/single_stack/74391704-1744646406349", cubdb_dir: "./persistent/shapes/single_stack/74391704-1744646406349/cubdb", snapshot_dir: "./persistent/shapes/single_stack/74391704-1744646406349/snapshots", log_dir: "./persistent/shapes/single_stack/74391704-1744646406349/log", stack_id: "single_stack", extra_opts: %{}, chunk_bytes_threshold: 25000000, version: 3}}, shape: Shape.new!("public.items" [OID 16390]), log_state: %{current_chunk_byte_size: 0, current_txn_bytes: 0}, shape_handle: "74391704-1744646406349", log_producer: {:via, Registry, {:"Elixir.Electric.ProcessRegistry:single_stack", {Electric.Replication.ShapeLogCollector, nil}}}, shape_status: {Electric.ShapeCache.ShapeStatus, %Electric.ShapeCache.ShapeStatus{root: "./shape_cache", shape_meta_table: :"single_stack:shape_meta_table", storage: {Electric.ShapeCache.FileStorage, %{stack_id: "single_stack", chunk_bytes_threshold: 25000000, base_path: "./persistent/shapes/single_stack"}}}}, latest_offset: LogOffset.last_before_real_offsets(), pg_snapshot: nil, snapshot_started: false, awaiting_snapshot_start: [{#PID<0.792.0>, [:alias | #Reference<0.0.101379.1137765335.1537802241.148014>]}], chunk_bytes_threshold: 25000000, publication_manager: {Electric.Replication.PublicationManager, [stack_id: "single_stack"]}, db_pool: {:via, Registry, {:"Elixir.Electric.ProcessRegistry:single_stack", {Electric.DbPool, nil}}}, run_with_conn_fn: &Electric.Shapes.Consumer.Snapshotter.run_with_conn/2, create_snapshot_fn: &Electric.Shapes.Consumer.Snapshotter.query_in_readonly_txn/7}
18:00:21.407 pid=<0.788.0> [debug] 0 consumers of replication stream
18:00:21.718 pid=<0.787.0> [info] Altering identity of public.items to FULL
18:00:22.950 pid=<0.760.0> [debug] Primary Keepalive: wal_end=9712335958656942072 (86C92618/A2FFFFF8) reply=0
18:00:31.288 pid=<0.760.0> [debug] Primary Keepalive: wal_end=9712335959026040832 (86C92618/B9000000) reply=0
18:00:36.707 pid=<0.772.0> [error] Postgrex.Protocol (#PID<0.772.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.787.0> timed out because it queued and checked out the connection for longer than 15000ms
#PID<0.787.0> was at location:
:prim_inet.recv0/3
(postgrex 0.19.0) lib/postgrex/protocol.ex:3298: Postgrex.Protocol.msg_recv/4
(postgrex 0.19.0) lib/postgrex/protocol.ex:2292: Postgrex.Protocol.recv_bind/3
(postgrex 0.19.0) lib/postgrex/protocol.ex:2147: Postgrex.Protocol.bind_execute_close/4
(db_connection 2.7.0) lib/db_connection/holder.ex:354: DBConnection.Holder.holder_apply/4
(db_connection 2.7.0) lib/db_connection.ex:1558: DBConnection.run_execute/5
(db_connection 2.7.0) lib/db_connection.ex:772: DBConnection.parsed_prepare_execute/5
(db_connection 2.7.0) lib/db_connection.ex:764: DBConnection.prepare_execute/4
(postgrex 0.19.0) lib/postgrex.ex:316: Postgrex.query_prepare_execute/4
(postgrex 0.19.0) lib/postgrex.ex:328: Postgrex.query!/4
(electric 1.0.6) lib/electric/postgres/configuration.ex:155: anonymous fn/3 in Electric.Postgres.Configuration.set_replica_identity!/2
(elixir 1.18.1) lib/enum.ex:2546: Enum."-reduce/3-lists^foldl/2-0-"/3
(electric 1.0.6) lib/electric/postgres/configuration.ex:144: Electric.Postgres.Configuration.set_replica_identity!/2
(electric 1.0.6) lib/electric/postgres/configuration.ex:138: Electric.Postgres.Configuration.configure_tables_for_replication_internal!/4
(db_connection 2.7.0) lib/db_connection.ex:1756: DBConnection.run_transaction/4
(electric 1.0.6) lib/electric/replication/publication_manager.ex:297: Electric.Replication.PublicationManager.update_publication/1
(electric 1.0.6) lib/electric/replication/publication_manager.ex:218: Electric.Replication.PublicationManager.handle_info/2
(stdlib 6.2) gen_server.erl:2345: :gen_server.try_handle_info/3
(stdlib 6.2) gen_server.erl:2433: :gen_server.handle_msg/6
(stdlib 6.2) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
18:00:36.710 pid=<0.787.0> [warning] Failed to configure publication, retrying: %Postgrex.Error{message: nil, postgres: %{code: :query_canceled, line: "3425", message: "canceling statement due to user request", file: "postgres.c", unknown: "ERROR", severity: "ERROR", pg_code: "57014", routine: "ProcessInterrupts"}, connection_id: 92, query: nil}
18:00:37.026 pid=<0.787.0> [info] Altering identity of public.items to FULL
18:00:41.243 pid=<0.760.0> [debug] Primary Keepalive: wal_end=9712335959474831336 (86C92618/D3BFFFE8) reply=0
18:00:44.660 pid=<0.760.0> [debug] Primary Keepalive: wal_end=9712335959621560376 (86C92618/DC7EE838) reply=1
18:00:52.014 pid=<0.776.0> [error] Postgrex.Protocol (#PID<0.776.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.787.0> timed out because it queued and checked out the connection for longer than 15000ms
#PID<0.787.0> was at location:
:prim_inet.recv0/3
(postgrex 0.19.0) lib/postgrex/protocol.ex:3298: Postgrex.Protocol.msg_recv/4
(postgrex 0.19.0) lib/postgrex/protocol.ex:2292: Postgrex.Protocol.recv_bind/3
(postgrex 0.19.0) lib/postgrex/protocol.ex:2147: Postgrex.Protocol.bind_execute_close/4
(db_connection 2.7.0) lib/db_connection/holder.ex:354: DBConnection.Holder.holder_apply/4
(db_connection 2.7.0) lib/db_connection.ex:1558: DBConnection.run_execute/5
(db_connection 2.7.0) lib/db_connection.ex:772: DBConnection.parsed_prepare_execute/5
(db_connection 2.7.0) lib/db_connection.ex:764: DBConnection.prepare_execute/4
(postgrex 0.19.0) lib/postgrex.ex:316: Postgrex.query_prepare_execute/4
(postgrex 0.19.0) lib/postgrex.ex:328: Postgrex.query!/4
(electric 1.0.6) lib/electric/postgres/configuration.ex:155: anonymous fn/3 in Electric.Postgres.Configuration.set_replica_identity!/2
(elixir 1.18.1) lib/enum.ex:2546: Enum."-reduce/3-lists^foldl/2-0-"/3
(electric 1.0.6) lib/electric/postgres/configuration.ex:144: Electric.Postgres.Configuration.set_replica_identity!/2
(electric 1.0.6) lib/electric/postgres/configuration.ex:138: Electric.Postgres.Configuration.configure_tables_for_replication_internal!/4
(db_connection 2.7.0) lib/db_connection.ex:1756: DBConnection.run_transaction/4
(electric 1.0.6) lib/electric/replication/publication_manager.ex:297: Electric.Replication.PublicationManager.update_publication/1
(electric 1.0.6) lib/electric/replication/publication_manager.ex:218: Electric.Replication.PublicationManager.handle_info/2
(stdlib 6.2) gen_server.erl:2345: :gen_server.try_handle_info/3
(stdlib 6.2) gen_server.erl:2433: :gen_server.handle_msg/6
(stdlib 6.2) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
18:00:52.018 pid=<0.787.0> [warning] Failed to configure publication, retrying: %Postgrex.Error{message: nil, postgres: %{code: :query_canceled, line: "3425", message: "canceling statement due to user request", file: "postgres.c", unknown: "ERROR", severity: "ERROR", pg_code: "57014", routine: "ProcessInterrupts"}, connection_id: 115, query: nil}
18:00:52.333 pid=<0.787.0> [info] Altering identity of public.items to FULL
It keeps trying to alter public.items
's identity until it succeeds.
Oddly enough, psql
at some point fails with an error and the transaction rolls back:
PANIC: could not write to file "pg_wal/xlogtemp.109": No space left on device
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.
[] :!>? \q
$ psql postgresql://postgres:password@localhost:54321/electric
[localhost] postgres:electric=# select count(1) from items;
count
───────
0
(1 row)
The device in question is probably related to some limit in Docker because the memory usage of neither psql nor Postgres grows noticeably and I have plenty of disk space on my machine:
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/dm-0 953G 97G 849G 11% /
devtmpfs 4.0M 0 4.0M 0% /dev
tmpfs 16G 31M 16G 1% /dev/shm
efivarfs 566K 289K 273K 52% /sys/firmware/efi/efivars
tmpfs 6.3G 2.9M 6.3G 1% /run
/dev/dm-0 953G 97G 849G 11% /home
/dev/nvme1n1p2 974M 380M 527M 42% /boot
tmpfs 16G 84M 16G 1% /tmp
/dev/nvme1n1p1 599M 20M 580M 4% /boot/efi
tmpfs 3.2G 23M 3.1G 1% /run/user/1000
Attempt no. 2
Leaving Electric running, I try to insert 100 mln rows once again. At this point Electric has one active shape for the items
table.
[localhost] postgres:electric=# insert into items select generate_series::text from generate_series(1, 100000000);
ERROR: could not extend file "base/16384/16390.3": No space left on device
HINT: Check free disk space.
Electric's log output
18:09:59.511 pid=<0.697.0> [info] Starting replication from postgres
18:09:59.519 pid=<0.697.0> [debug] Primary Keepalive: wal_end=9712335959621560377 (86C92618/DC7EE839) reply=0
18:09:59.520 pid=<0.733.0> shape_handle=74391704-1744646406349 [debug] Snapshot known for shape_handle: 74391704-1744646406349 xmin: 759, xmax: 759, xip_list:
18:10:03.973 pid=<0.733.0> shape_handle=74391704-1744646406349 [debug] Snapshot started shape_handle: 74391704-1744646406349
18:10:03.974 pid=<0.735.0> shape_handle=74391704-1744646406349 [debug] Opening snapshot chunk 0 for writing
18:10:29.514 pid=<0.697.0> [debug] Primary Keepalive: wal_end=9712335958076722936 (86C92618/806A8EF8) reply=1
18:10:59.514 pid=<0.697.0> [debug] Primary Keepalive: wal_end=9712335959381264616 (86C92618/CE2C48E8) reply=1
18:11:29.628 pid=<0.697.0> [debug] Primary Keepalive: wal_end=9712335960680346648 (86C92619/1B9AB418) reply=1
18:11:30.231 pid=<0.697.0> [error] :gen_statem {:"Elixir.Electric.ProcessRegistry:single_stack", {Electric.Postgres.ReplicationClient, nil}} terminating
** (Postgrex.Error) ERROR 53100 (disk_full) could not write to data file for XID 753: No space left on device
(stdlib 6.2) gen_statem.erl:3864: :gen_statem.loop_state_callback_result/11
(stdlib 6.2) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
Process Label: :replication_client
Queue: [info: {:tcp, #Port<0.13>, <<69, 0, 0, 0, 146, 83, 69, 82, 82, 79, 82, 0, 86, 69, 82, 82, 79, 82, 0, 67, 53, 51, 49, 48, 48, 0, 77, 99, 111, 117, 108, 100, 32, 110, 111, 116, 32, 119, 114, 105, 116, 101, 32, 116, 111, 32, ...>>}]
Postponed: []
State: {:no_state, %Postgrex.ReplicationConnection{protocol: %Postgrex.Protocol{sock: {:gen_tcp, #Port<0.13>}, connection_id: 131, connection_key: -1960164997, peer: {{127, 0, 0, 1}, 54321}, types: {Postgrex.DefaultTypes, #Reference<0.3807092993.2077097985.188528>}, null: nil, timeout: 15000, ping_timeout: 15000, parameters: #Reference<0.3807092993.2076966913.188638>, queries: #Reference<0.3807092993.2077097985.188635>, postgres: :idle, transactions: :naive, buffer: "", disconnect_on_error_codes: [], scram: %{auth_message: "n=,r=plpCbHjyaZg0VM2Lenf+4vVd,r=plpCbHjyaZg0VM2Lenf+4vVd/hpi3iyEMnaQL4VEmIa6m0Et,s=gdcurhnueiDTw3IpnKHduA==,i=4096,c=biws,r=plpCbHjyaZg0VM2Lenf+4vVd/hpi3iyEMnaQL4VEmIa6m0Et", iterations: 4096, salt: <<129, 215, 46, 174, 25, 238, 122, 32, 211, 195, 114, 41, 156, 161, 221, 184>>}, disable_composite_types: false, messages: []}, state: {Electric.Postgres.ReplicationClient, %Electric.Postgres.ReplicationClient.State{stack_id: "single_stack", connection_manager: #PID<0.386.0>, transaction_received: {Electric.Replication.ShapeLogCollector, :store_transaction, [{:via, Registry, {:"Elixir.Electric.ProcessRegistry:single_stack", {Electric.Replication.ShapeLogCollector, nil}}}]}, relation_received: {Electric.Replication.ShapeLogCollector, :handle_relation_msg, [{:via, Registry, {:"Elixir.Electric.ProcessRegistry:single_stack", {Electric.Replication.ShapeLogCollector, nil}}}]}, publication_name: "electric_publication_default", try_creating_publication?: true, start_streaming?: false, slot_name: "electric_slot_default", slot_temporary?: false, display_settings: [], origin: "postgres", txn_collector: %Electric.Postgres.ReplicationClient.Collector{transaction: nil, tx_op_index: nil, relations: %{}}, step: :streaming, applied_wal: 9712335960680346648}}, auto_reconnect: false, reconnect_backoff: 500, streaming: 500}}
Callback mode: :handle_event_function, state_enter: false
18:11:30.259 pid=<0.386.0> [debug] Handling the exit of the replication client #PID<0.697.0> with reason %Postgrex.Error{message: nil, postgres: %{code: :disk_full, line: "3997", message: "could not write to data file for XID 753: No space left on device", file: "reorderbuffer.c", unknown: "ERROR", severity: "ERROR", pg_code: "53100", routine: "ReorderBufferSerializeChange"}, connection_id: 131, query: nil}
18:11:30.259 pid=<0.386.0> [warning] Reconnecting in 2000ms
18:11:32.260 pid=<0.386.0> [debug] Starting replication client for stack single_stack
18:11:32.273 pid=<0.736.0> [debug] ReplicationClient step: pg_info_query
18:11:32.273 pid=<0.386.0> [info] Reconnection succeeded after 2014ms
18:11:32.275 pid=<0.736.0> [info] Postgres server version = 170001, system identifier = 7493198447926501411, timeline_id = 1
18:11:32.275 pid=<0.736.0> [debug] ReplicationClient step: create_publication_query
18:11:32.276 pid=<0.736.0> [debug] ReplicationClient step: create_slot
18:11:32.277 pid=<0.736.0> [debug] Found existing replication slot
18:11:32.277 pid=<0.736.0> [debug] ReplicationClient step: set_display_setting
18:11:32.277 pid=<0.736.0> [debug] ReplicationClient step: set_display_setting
18:11:32.278 pid=<0.736.0> [debug] ReplicationClient step: set_display_setting
18:11:32.291 pid=<0.736.0> [debug] ReplicationClient step: set_display_setting
18:11:32.292 pid=<0.736.0> [debug] ReplicationClient step: set_display_setting
18:11:32.293 pid=<0.736.0> [debug] ReplicationClient step: start_streaming
18:11:32.293 pid=<0.736.0> [debug] ReplicationClient step: start_replication_slot
18:11:32.293 pid=<0.736.0> [info] Starting replication from postgres
18:11:33.325 pid=<0.736.0> [debug] Primary Keepalive: wal_end=9712335960680346649 (86C92619/1B9AB419) reply=0
18:12:03.324 pid=<0.736.0> [debug] Primary Keepalive: wal_end=9712335958046330696 (86C92618/7E9ACF48) reply=1
18:12:33.325 pid=<0.736.0> [debug] Primary Keepalive: wal_end=9712335959345128320 (86C92618/CC04E380) reply=1
18:13:03.512 pid=<0.736.0> [debug] Primary Keepalive: wal_end=9712335960648904552 (86C92619/19BAEF68) reply=1
18:13:04.828 pid=<0.736.0> [error] :gen_statem {:"Elixir.Electric.ProcessRegistry:single_stack", {Electric.Postgres.ReplicationClient, nil}} terminating
** (Postgrex.Error) ERROR 53100 (disk_full) could not write to data file for XID 753: No space left on device
(stdlib 6.2) gen_statem.erl:3864: :gen_statem.loop_state_callback_result/11
(stdlib 6.2) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
Process Label: :replication_client
Queue: [info: {:tcp, #Port<0.34>, <<69, 0, 0, 0, 146, 83, 69, 82, 82, 79, 82, 0, 86, 69, 82, 82, 79, 82, 0, 67, 53, 51, 49, 48, 48, 0, 77, 99, 111, 117, 108, 100, 32, 110, 111, 116, 32, 119, 114, 105, 116, 101, 32, 116, 111, 32, ...>>}]
Postponed: []
State: {:no_state, %Postgrex.ReplicationConnection{protocol: %Postgrex.Protocol{sock: {:gen_tcp, #Port<0.34>}, connection_id: 157, connection_key: -783689893, peer: {{127, 0, 0, 1}, 54321}, types: {Postgrex.DefaultTypes, #Reference<0.3807092993.2077097985.188528>}, null: nil, timeout: 15000, ping_timeout: 15000, parameters: #Reference<0.3807092993.2076966925.185296>, queries: #Reference<0.3807092993.2077097997.185293>, postgres: :idle, transactions: :naive, buffer: "", disconnect_on_error_codes: [], scram: %{auth_message: "n=,r=JEJQ9Z248C6kVt9muLqEA/QY,r=JEJQ9Z248C6kVt9muLqEA/QYRQLBQYGs4ZGCOm5Bl/aB9Otp,s=gdcurhnueiDTw3IpnKHduA==,i=4096,c=biws,r=JEJQ9Z248C6kVt9muLqEA/QYRQLBQYGs4ZGCOm5Bl/aB9Otp", iterations: 4096, salt: <<129, 215, 46, 174, 25, 238, 122, 32, 211, 195, 114, 41, 156, 161, 221, 184>>}, disable_composite_types: false, messages: []}, state: {Electric.Postgres.ReplicationClient, %Electric.Postgres.ReplicationClient.State{stack_id: "single_stack", connection_manager: #PID<0.386.0>, transaction_received: {Electric.Replication.ShapeLogCollector, :store_transaction, [{:via, Registry, {:"Elixir.Electric.ProcessRegistry:single_stack", {Electric.Replication.ShapeLogCollector, nil}}}]}, relation_received: {Electric.Replication.ShapeLogCollector, :handle_relation_msg, [{:via, Registry, {:"Elixir.Electric.ProcessRegistry:single_stack", {Electric.Replication.ShapeLogCollector, nil}}}]}, publication_name: "electric_publication_default", try_creating_publication?: true, start_streaming?: false, slot_name: "electric_slot_default", slot_temporary?: false, display_settings: [], origin: "postgres", txn_collector: %Electric.Postgres.ReplicationClient.Collector{transaction: nil, tx_op_index: nil, relations: %{}}, step: :streaming, applied_wal: 9712335960648904552}}, auto_reconnect: false, reconnect_backoff: 500, streaming: 500}}
Callback mode: :handle_event_function, state_enter: false
18:13:04.830 pid=<0.386.0> [debug] Handling the exit of the replication client #PID<0.736.0> with reason %Postgrex.Error{message: nil, postgres: %{code: :disk_full, line: "3997", message: "could not write to data file for XID 753: No space left on device", file: "reorderbuffer.c", unknown: "ERROR", severity: "ERROR", pg_code: "53100", routine: "ReorderBufferSerializeChange"}, connection_id: 157, query: nil}
18:13:04.830 pid=<0.386.0> [warning] Reconnecting in 2000ms
An observation
I was surprised to see that during the executing of the INSERT
statement, in addition to the Postgres OS process consuming 100% CPU there was also the walsender
process also consuming 100% CPU at the same time, spamming Electric with Primary keepalive
messages periodically. Presumably the walsender
process starts preparing a transaction for streaming even though it's not yet committed.
htop
displays the walsender command as
postgres: walsender postgres electric 192.168.80.1(43510) START_REPLICATION
Another oddity as that long after psql
has given up and printed the Check free disk space.
HINT, the walsender
process keep using 100% CPU and at some point during that post-insert processing Electric logs the out of disk space
error it gets from Postgrex.