Skip to content

Commit 671f3e1

Browse files
committed
2 parents 61c71a4 + ecac018 commit 671f3e1

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

69 files changed

+270
-9423
lines changed

.travis.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ os:
55
compiler:
66
- gcc
77
- clang
8-
install: cpanm IPC::Run DBD::Pg
8+
install: cpanm IPC::Run DBD::Pg Proc::ProcessTable
99
before_script: ./configure --enable-tap-tests && make -j4
1010
env:
1111
#- TESTDIR=.

README.md

Lines changed: 23 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -1,65 +1,42 @@
1-
# Postgres_cluster
1+
# postgres_cluster
22

33
[![Build Status](https://travis-ci.org/postgrespro/postgres_cluster.svg?branch=master)](https://travis-ci.org/postgrespro/postgres_cluster)
44

55
Various experiments with PostgreSQL clustering perfomed at PostgresPro.
66

7-
This is mirror of postgres repo with several changes to the core and few extra extensions.
7+
This is a mirror of postgres repo with several changes to the core and a few extra extensions.
88

99
## Core changes:
1010

11-
* Transaction manager interface (eXtensible Transaction Manager, xtm). Generic interface to plug distributed transaction engines. More info at [[https://wiki.postgresql.org/wiki/DTM]] and [[http://www.postgresql.org/message-id/flat/F2766B97-555D-424F-B29F-E0CA0F6D1D74@postgrespro.ru]].
11+
* Transaction manager interface (eXtensible Transaction Manager, xtm). Generic interface to plug distributed transaction engines. More info on [postgres wiki](https://wiki.postgresql.org/wiki/DTM) and on [the email thread](http://www.postgresql.org/message-id/flat/F2766B97-555D-424F-B29F-E0CA0F6D1D74@postgrespro.ru).
1212
* Distributed deadlock detection API.
13-
* Logical decoding of two-phase transactions.
14-
13+
* Logical decoding of transactions.
1514

1615
## New extensions:
1716

18-
* pg_tsdtm. Coordinator-less transaction management by tracking commit timestamps.
19-
* multimaster. Synchronous multi-master replication based on logical_decoding and pg_dtm.
20-
21-
22-
## Changed extension:
23-
24-
* postgres_fdw. Added support of pg_tsdtm.
25-
26-
## Installing multimaster
27-
28-
1. Build and install postgres from this repo on all machines in cluster.
29-
1. Install contrib/raftable and contrib/mmts extensions.
30-
1. Right now we need clean postgres installation to spin up multimaster cluster.
31-
1. Create required database inside postgres before enabling multimaster extension.
32-
1. We are requiring following postgres configuration:
33-
* 'max_prepared_transactions' > 0 -- in multimaster all writing transaction along with ddl are wrapped as two-phase transaction, so this number will limit maximum number of writing transactions in this cluster node.
34-
* 'synchronous_commit - off' -- right now we do not support async commit. (one can enable it, but that will not bring desired effect)
35-
* 'wal_level = logical' -- multimaster built on top of logical replication so this is mandatory.
36-
* 'max_wal_senders' -- this should be at least number of nodes - 1
37-
* 'max_replication_slots' -- this should be at least number of nodes - 1
38-
* 'max_worker_processes' -- at least 2*N + 1 + P, where N is number of nodes in cluster, P size of pool of workers(see below) (1 raftable, n-1 receiver, n-1 sender, mtm-sender, mtm-receiver, + number of pool worker).
39-
* 'default_transaction_isolation = 'repeatable read'' -- multimaster isn't supporting default read commited level.
40-
1. Multimaster have following configuration parameters:
41-
* 'multimaster.conn_strings' -- connstrings for all nodes in cluster, separated by comma.
42-
* 'multimaster.node_id' -- id of current node, number starting from one.
43-
* 'multimaster.workers' -- number of workers that can apply transactions from neighbouring nodes.
44-
* 'multimaster.use_raftable = true' -- just set this to true. Deprecated.
45-
* 'multimaster.queue_size = 52857600' -- queue size for applying transactions from neighbouring nodes.
46-
* 'multimaster.ignore_tables_without_pk = 1' -- do not replicate tables without primary key
47-
* 'multimaster.heartbeat_send_timeout = 250' -- heartbeat period (ms).
48-
* 'multimaster.heartbeat_recv_timeout = 1000' -- disconnect node if we miss heartbeats all that time (ms).
49-
* 'multimaster.twopc_min_timeout = 40000' -- rollback stalled transaction after this period (ms).
50-
* 'raftable.id' -- id of current node, number starting from one.
51-
* 'raftable.peers' -- id of current node, number starting from one.
52-
1. Allow replication in pg_hba.conf.
53-
54-
## Multimaster status functions
17+
The following table describes the features and the way they are implemented in our four main extensions:
5518

56-
* mtm.get_nodes_state() -- show status of nodes on cluster
57-
* mtm.get_cluster_state() -- show whole cluster status
58-
* mtm.get_cluster_info() -- print some debug info
59-
* mtm.make_table_local(relation regclass) -- stop replication for a given table
19+
| |commit timestamps |snapshot sharing |
20+
|---------------------------:|:----------------------------:|:----------------------------------:|
21+
|**distributed transactions**|[`pg_tsdtm`](contrib/pg_tsdtm)|[`pg_dtm`](contrib/pg_dtm) |
22+
|**multimaster replication** |[`mmts`](contrib/mmts) |[`multimaster`](contrib/multimaster)|
6023

24+
### [`mmts`](contrib/mmts)
25+
An implementation of synchronous **multi-master replication** based on **commit timestamps**.
6126

27+
### [`multimaster`](contrib/multimaster)
28+
An implementation of synchronous **multi-master replication** based on **snapshot sharing**.
6229

30+
### [`pg_dtm`](contrib/pg_dtm)
31+
An implementation of **distributed transaction** management based on **snapshot sharing**.
6332

33+
### [`pg_tsdtm`](contrib/pg_tsdtm)
34+
An implementation of **distributed transaction** management based on **commit timestamps**.
6435

36+
### [`arbiter`](contrib/arbiter)
37+
A distributed transaction management daemon.
38+
Used by `pg_dtm` and `multimaster`.
6539

40+
### [`raftable`](contrib/raftable)
41+
A key-value table replicated over Raft protocol.
42+
Used by `mmts`.

contrib/mmts/Cluster.pm

Lines changed: 65 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ package Cluster;
33
use strict;
44
use warnings;
55

6+
use Proc::ProcessTable;
67
use PostgresNode;
78
use TestLib;
89
use Test::More;
@@ -103,6 +104,7 @@ sub configure
103104
multimaster.use_raftable = true
104105
multimaster.heartbeat_recv_timeout = 1000
105106
multimaster.heartbeat_send_timeout = 250
107+
multimaster.max_nodes = 3
106108
multimaster.ignore_tables_without_pk = true
107109
multimaster.twopc_min_timeout = 2000
108110
));
@@ -129,13 +131,21 @@ sub start
129131
sub stopnode
130132
{
131133
my ($node, $mode) = @_;
132-
my $port = $node->port;
133-
my $pgdata = $node->data_dir;
134-
my $name = $node->name;
134+
return 1 unless defined $node->{_pid};
135135
$mode = 'fast' unless defined $mode;
136-
diag("stopping node $name ${mode}ly at $pgdata port $port");
137-
next unless defined $node->{_pid};
136+
my $name = $node->name;
137+
diag("stopping $name ${mode}ly");
138+
139+
if ($mode eq 'kill') {
140+
killtree($node->{_pid});
141+
return 1;
142+
}
143+
144+
my $pgdata = $node->data_dir;
138145
my $ret = TestLib::system_log('pg_ctl', '-D', $pgdata, '-m', 'fast', 'stop');
146+
my $pidfile = $node->data_dir . "/postmaster.pid";
147+
diag("unlink $pidfile");
148+
unlink $pidfile;
139149
$node->{_pid} = undef;
140150
$node->_update_pid;
141151

@@ -147,6 +157,51 @@ sub stopnode
147157
return 1;
148158
}
149159

160+
sub stopid
161+
{
162+
my ($self, $idx, $mode) = @_;
163+
return stopnode($self->{nodes}->[$idx]);
164+
}
165+
166+
sub killtree
167+
{
168+
my $root = shift;
169+
diag("killtree $root\n");
170+
171+
my $t = new Proc::ProcessTable;
172+
173+
my %parent = ();
174+
#my %cmd = ();
175+
foreach my $p (@{$t->table}) {
176+
$parent{$p->pid} = $p->ppid;
177+
# $cmd{$p->pid} = $p->cmndline;
178+
}
179+
180+
if (!defined $root) {
181+
return;
182+
}
183+
my @queue = ($root);
184+
my @killist = ();
185+
186+
while (scalar @queue) {
187+
my $victim = shift @queue;
188+
while (my ($pid, $ppid) = each %parent) {
189+
if ($ppid == $victim) {
190+
push @queue, $pid;
191+
}
192+
}
193+
diag("SIGSTOP to $victim");
194+
kill 'STOP', $victim;
195+
unshift @killist, $victim;
196+
}
197+
198+
diag("SIGKILL to " . join(' ', @killist));
199+
kill 'KILL', @killist;
200+
#foreach my $victim (@killist) {
201+
# print("kill $victim " . $cmd{$victim} . "\n");
202+
#}
203+
}
204+
150205
sub stop
151206
{
152207
my ($self, $mode) = @_;
@@ -155,12 +210,13 @@ sub stop
155210

156211
my $ok = 1;
157212
diag("stopping cluster ${mode}ly");
158-
foreach my $node (@$nodes)
159-
{
213+
214+
foreach my $node (@$nodes) {
160215
if (!stopnode($node, $mode)) {
161216
$ok = 0;
162-
if (!stopnode($node, 'immediate')) {
163-
BAIL_OUT("failed to stop $node immediately");
217+
if (!stopnode($node, 'kill')) {
218+
my $name = $node->name;
219+
BAIL_OUT("failed to kill $name");
164220
}
165221
}
166222
}

contrib/mmts/README.md

Lines changed: 52 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -1,53 +1,55 @@
1-
# Postgres Multimaster
1+
# `mmts`
2+
3+
An implementation of synchronous **multi-master replication** based on **commit timestamps**.
4+
5+
## Usage
6+
7+
1. Install `contrib/raftable` and `contrib/mmts` on each instance.
8+
1. Add these required options to the `postgresql.conf` of each instance in the cluster.
9+
10+
```sh
11+
max_prepared_transactions = 200 # should be > 0, because all
12+
# transactions are implicitly two-phase
13+
max_connections = 200
14+
max_worker_processes = 100 # at least (2 * n + p + 1)
15+
# this figure is calculated as:
16+
# 1 raftable worker
17+
# n-1 receiver
18+
# n-1 sender
19+
# 1 mtm-sender
20+
# 1 mtm-receiver
21+
# p workers in the pool
22+
max_parallel_degree = 0
23+
wal_level = logical # multimaster is build on top of
24+
# logical replication and will not work otherwise
25+
max_wal_senders = 10 # at least the number of nodes
26+
wal_sender_timeout = 0
27+
default_transaction_isolation = 'repeatable read'
28+
max_replication_slots = 10 # at least the number of nodes
29+
shared_preload_libraries = 'raftable,multimaster'
30+
multimaster.workers = 10
31+
multimaster.queue_size = 10485760 # 10mb
32+
multimaster.node_id = 1 # the 1-based index of the node in the cluster
33+
multimaster.conn_strings = 'dbname=... host=....0.0.1 port=... raftport=..., ...'
34+
# comma-separated list of connection strings
35+
multimaster.use_raftable = true
36+
multimaster.heartbeat_recv_timeout = 1000
37+
multimaster.heartbeat_send_timeout = 250
38+
multimaster.ignore_tables_without_pk = true
39+
multimaster.twopc_min_timeout = 2000
40+
```
41+
1. Allow replication in `pg_hba.conf`.
42+
43+
## Status functions
44+
45+
`create extension mmts;` to gain access to these functions:
46+
47+
* `mtm.get_nodes_state()` -- show status of nodes on cluster
48+
* `mtm.get_cluster_state()` -- show whole cluster status
49+
* `mtm.get_cluster_info()` -- print some debug info
50+
* `mtm.make_table_local(relation regclass)` -- stop replication for a given table
251

352
## Testing
453

5-
The testing process involves multiple modules that perform different tasks. The
6-
modules and their APIs are listed below.
7-
8-
### Modules
9-
10-
#### `combineaux`
11-
12-
Governs the whole testing process. Runs different workloads during different
13-
troubles.
14-
15-
#### `stresseaux`
16-
17-
Puts workloads against the database. Writes logs that are later used by
18-
`valideaux`.
19-
20-
* `start(id, workload, cluster)` - starts a `workload` against the `cluster`
21-
and call it `id`.
22-
* `stop(id)` - stops a previously started workload called `id`.
23-
24-
#### `starteaux`
25-
26-
Manages the database nodes.
27-
28-
* `deploy(driver, ...)` - deploys a cluster using the specified `driver` and
29-
other parameters specific to that driver. Returns a `cluster` instance that is
30-
used in other methods.
31-
* `cluster->up(id)` - adds a node named `id` to the `cluster`.
32-
* `cluster->down(id)` - removes a node named `id` from the `cluster`.
33-
* `cluster->drop(src, dst, ratio)` - drop `ratio` packets flowing from node
34-
`src` to node `dst`.
35-
* `cluster->delay(src, dst, msec)` - delay packets flowing from node `src` to
36-
node `dst` by `msec` milliseconds.
37-
38-
#### `troubleaux`
39-
40-
This is the troublemaker that messes with the network, nodes and time.
41-
42-
* `cause(cluster, trouble, ...)` - causes the specified `trouble` in the
43-
specified `cluster` with some trouble-specific parameters.
44-
* `fix(cluster)` - fixes all troubles caused in the `cluster`.
45-
46-
#### `valideaux`
47-
48-
Validates the logs of stresseaux.
49-
50-
#### `reporteaux`
51-
52-
Generates reports on the test results. This is usually a table that with
53-
`trouble` vs `workload` axes.
54+
* `make -C contrib/mmts check` to run TAP-tests.
55+
* `make -C contrib/mmts xcheck` to run blockade tests. The blockade tests require `docker`, `blockade`, and some other packages installed, see [requirements.txt](tests2/requirements.txt) for the list. You might also want to gain superuser privileges to run these tests successfully.

contrib/mmts/arbiter.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -340,8 +340,8 @@ static void MtmScheduleHeartbeat()
340340
if (!stop) {
341341
enable_timeout_after(heartbeat_timer, MtmHeartbeatSendTimeout);
342342
send_heartbeat = true;
343-
PGSemaphoreUnlock(&Mtm->votingSemaphore);
344343
}
344+
PGSemaphoreUnlock(&Mtm->votingSemaphore);
345345
}
346346

347347
static void MtmSendHeartbeat()
@@ -377,7 +377,7 @@ static void MtmSendHeartbeat()
377377

378378
void MtmCheckHeartbeat()
379379
{
380-
if (send_heartbeat) {
380+
if (send_heartbeat && !stop) {
381381
send_heartbeat = false;
382382
enable_timeout_after(heartbeat_timer, MtmHeartbeatSendTimeout);
383383
MtmSendHeartbeat();

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy