Conflict-free Replicated
Data Types (CRDTs)
for collaborative environments
Marc Shapiro, INRIA & LIP6
Nuno Preguiça, U. Nova de Lisboa
Carlos Baquero, U. Minho
Marek Zawirski, INRIA & UPMC
Conflict-free objects for largescale distribution
•Large, dynamic graph
•Incremental, parallel,
asynchronous:
- updates
- processing
Conflict-free Replicated Data Types
Shared Mutable data
• Read replicate
• Updates?
Novel, principled approach:
Conflict-free objects
Can we design useful object types
without any synchronisation
whatsoever?
Can we build practical systems
from such objects?
2
Conflict-free objects for largescale distribution
•Large, dynamic graph
•Incremental, parallel,
asynchronous:
- updates
- processing
Conflict-free Replicated Data Types
Shared Mutable data
• Read replicate
• Updates?
Novel, principled approach:
Conflict-free objects
Can we design useful object types
without any synchronisation
whatsoever?
Can we build practical systems
from such objects?
2
Conflict-free objects for largescale distribution
•Large, dynamic graph
•Incremental, parallel,
asynchronous:
- updates
- processing
Conflict-free Replicated Data Types
Shared Mutable data
• Read replicate
• Updates?
Novel, principled approach:
Conflict-free objects
Can we design useful object types
without any synchronisation
whatsoever?
Can we build practical systems
from such objects?
2
Conflict-free objects for largescale distribution
•Large, dynamic graph
•Incremental, parallel,
asynchronous:
- updates
- processing
Conflict-free Replicated Data Types
Shared Mutable data
• Read replicate
• Updates?
Novel, principled approach:
Conflict-free objects
Can we design useful object types
without any synchronisation
whatsoever?
Can we build practical systems
from such objects?
2
Conflict-free objects for largescale distribution
•Large, dynamic graph
•Incremental, parallel,
asynchronous:
- updates
- processing
Conflict-free Replicated Data Types
Shared Mutable data
• Read replicate
• Updates?
Novel, principled approach:
Conflict-free objects
Can we design useful object types
without any synchronisation
whatsoever?
Can we build practical systems
from such objects?
2
Replication for beginners
Replicated data
Share data Replicate at many locations
• Performance: local reads
• Availability: immune from network failure
• Fault-tolerance: replicate computation
• Scalability: load balancing
•Fault tolerance
Updates
•and parallelism too?
• Push to all replicas
•Conflict!!
• Conflicts: Consistency?
Conflict-free Replicated Data Types
4
Strong consistency
Preclude conflicts
• All replicas execute updates
in same total order
• Any deterministic object
•Simultaneous Nway agreement
Consensus
• Serialisation bottleneck
• Tolerates < n/2 faults
Conflict-free Replicated Data Types
•Very general
•Correct
•Doesn't scale
5
Strong consistency
1
Preclude conflicts
• All replicas execute updates
in same total order
• Any deterministic object
•Simultaneous Nway agreement
Consensus
• Serialisation bottleneck
• Tolerates < n/2 faults
Conflict-free Replicated Data Types
•Very general
•Correct
•Doesn't scale
5
Strong consistency
2
1
Preclude conflicts
• All replicas execute updates
in same total order
• Any deterministic object
•Simultaneous Nway agreement
Consensus
• Serialisation bottleneck
• Tolerates < n/2 faults
Conflict-free Replicated Data Types
•Very general
•Correct
•Doesn't scale
5
Strong consistency
3
2
1
Preclude conflicts
• All replicas execute updates
in same total order
• Any deterministic object
•Simultaneous Nway agreement
Consensus
• Serialisation bottleneck
• Tolerates < n/2 faults
Conflict-free Replicated Data Types
•Very general
•Correct
•Doesn't scale
5
Strong consistency
4
3
2
1
Preclude conflicts
• All replicas execute updates
in same total order
• Any deterministic object
•Simultaneous Nway agreement
Consensus
• Serialisation bottleneck
• Tolerates < n/2 faults
Conflict-free Replicated Data Types
•Very general
•Correct
•Doesn't scale
5
Strong consistency
5
4
3
2
1
Preclude conflicts
• All replicas execute updates
in same total order
• Any deterministic object
•Simultaneous Nway agreement
Consensus
• Serialisation bottleneck
• Tolerates < n/2 faults
Conflict-free Replicated Data Types
•Very general
•Correct
•Doesn't scale
5
Strong consistency
6
5
4
3
2
1
Preclude conflicts
• All replicas execute updates
in same total order
• Any deterministic object
•Simultaneous Nway agreement
Consensus
• Serialisation bottleneck
• Tolerates < n/2 faults
Conflict-free Replicated Data Types
•Very general
•Correct
•Doesn't scale
5
Eventual Consistency
Update local + propagate
• No foreground synch
• Expose tentative state
delivery
• Eventual, reliable•Availability
++
•Parallelism++
On conflict
•Latency -• Arbitrate
•Complexity ++
• Roll back
Consensus moved to background
Conflict-free Replicated Data Types
6
Eventual Consistency
Update local + propagate
• No foreground synch
• Expose tentative state
delivery
• Eventual, reliable•Availability
++
•Parallelism++
On conflict
•Latency -• Arbitrate
•Complexity ++
• Roll back
Consensus moved to background
Conflict-free Replicated Data Types
6
Eventual Consistency
Update local + propagate
• No foreground synch
• Expose tentative state
delivery
• Eventual, reliable•Availability
++
•Parallelism++
On conflict
•Latency -• Arbitrate
•Complexity ++
• Roll back
Consensus moved to background
Conflict-free Replicated Data Types
6
Eventual Consistency
Update local + propagate
• No foreground synch
• Expose tentative state
delivery
• Eventual, reliable•Availability
++
•Parallelism++
On conflict
•Latency -• Arbitrate
•Complexity ++
• Roll back
Consensus moved to background
Conflict-free Replicated Data Types
6
Eventual Consistency
Update local + propagate
• No foreground synch
• Expose tentative state
delivery
• Eventual, reliable•Availability
++
•Parallelism++
On conflict
•Latency -• Arbitrate
•Complexity ++
• Roll back
Consensus moved to background
Conflict-free Replicated Data Types
6
Eventual Consistency
Conflict!
Update local + propagate
• No foreground synch
• Expose tentative state
delivery
• Eventual, reliable•Availability
++
•Parallelism++
On conflict
•Latency -• Arbitrate
•Complexity ++
• Roll back
Consensus moved to background
Conflict-free Replicated Data Types
6
Eventual Consistency
Update local + propagate
• No foreground synch
• Expose tentative state
delivery
• Eventual, reliable•Availability
++
•Parallelism++
On conflict
•Latency -• Arbitrate
•Complexity ++
• Roll back
Consensus moved to background
Conflict-free Replicated Data Types
6
Eventual Consistency
Update local + propagate
• No foreground synch
• Expose tentative state
delivery
• Eventual, reliable•Availability
++
•Parallelism++
On conflict
•Latency -• Arbitrate
•Complexity ++
• Roll back
Consensus moved to background
Conflict-free Replicated Data Types
6
Eventual Consistency
Update local + propagate
• No foreground synch
• Expose tentative state
delivery
• Eventual, reliable•Availability
++
•Parallelism++
On conflict
•Latency -• Arbitrate
•Complexity ++
• Roll back
Consensus moved to background
Conflict-free Replicated Data Types
6
Strong Eventual Consistency
•Available, responsive
•More parallelism
•No conflicts
•No rollback
Update local + propagate
• No synch
• Expose intermediate state
• Eventual, reliable delivery
No conflict
• Deterministic outcome of
concurrent updates
No consensus: ≤ n-1 faults
Not universal
Solves the CAP problem
Conflict-free Replicated Data Types
7
Strong Eventual Consistency
•Available, responsive
•More parallelism
•No conflicts
•No rollback
Update local + propagate
• No synch
• Expose intermediate state
• Eventual, reliable delivery
No conflict
• Deterministic outcome of
concurrent updates
No consensus: ≤ n-1 faults
Not universal
Solves the CAP problem
Conflict-free Replicated Data Types
7
Strong Eventual Consistency
•Available, responsive
•More parallelism
•No conflicts
•No rollback
Update local + propagate
• No synch
• Expose intermediate state
• Eventual, reliable delivery
No conflict
• Deterministic outcome of
concurrent updates
No consensus: ≤ n-1 faults
Not universal
Solves the CAP problem
Conflict-free Replicated Data Types
7
Strong Eventual Consistency
•Available, responsive
•More parallelism
•No conflicts
•No rollback
Update local + propagate
• No synch
• Expose intermediate state
• Eventual, reliable delivery
No conflict
• Deterministic outcome of
concurrent updates
No consensus: ≤ n-1 faults
Not universal
Solves the CAP problem
Conflict-free Replicated Data Types
7
Strong Eventual Consistency
•Available, responsive
•More parallelism
•No conflicts
•No rollback
Update local + propagate
• No synch
• Expose intermediate state
• Eventual, reliable delivery
No conflict
• Deterministic outcome of
concurrent updates
No consensus: ≤ n-1 faults
Not universal
Solves the CAP problem
Conflict-free Replicated Data Types
7
The challenge:
What interesting objects can
we design with no
synchronisation whatsoever?
Portfolio of CRDTs
Register
Counter
• Last-Writer Wins
• Unlimited
• Multi-Value
• Non-negative
Set
Graphs
• Grow-Only
• Directed
• 2P
• Monotonic DAG
• Edit graph
• Observed-Remove
Map
Sequence
• Edit sequence
• Set of Registers
Conflict-free Replicated Data Types
9
Set design alternatives
•linearisable: sequential
order
•equivalent to real-time
order
•Requires consensus
Sequential specification:
• {true} add(e) {e ∈ S}
• {true} remove(e) {e ∉ S}
{true} add(e) || remove(e) {????}
• linearisable?
• add wins?
• remove wins?
• last writer wins?
• error state?
Conflict-free Replicated Data Types
10
Observed-Remove Set
s
s1
{}
{}
s2
s3
{}
Conflict-free Replicated Data Types
•Can never remove
more tokens than
exist
•Op order removed
tokens have been
previously added
11
Observed-Remove Set
s
s1
{} add(a)
S
{}
s2
s3
{}
Conflict-free Replicated Data Types
{aα}
•Can never remove
more tokens than
exist
•Op order removed
tokens have been
previously added
11
Observed-Remove Set
s
s1
s2
s3
{} add(a)
S
{aα}
{} add(a)
S
{aβ}
{}
Conflict-free Replicated Data Types
•Can never remove
more tokens than
exist
•Op order removed
tokens have been
previously added
11
Observed-Remove Set
s
s1
s2
s3
{} add(a)
S
{aα}
{} add(a)
S
{aβ}
{}
Conflict-free Replicated Data Types
{aβ}
M
{aβ}
•Can never remove
more tokens than
exist
•Op order removed
tokens have been
previously added
11
Observed-Remove Set
s
s1
s2
s3
{} add(a)
S
{aα}
{} add(a)
S
{aβ}
{}
Conflict-free Replicated Data Types
{aα}
{aβ}
M
{aβ}
M
{aβ, aα}
•Can never remove
more tokens than
exist
•Op order removed
tokens have been
previously added
11
Observed-Remove Set
s
s1
s2
s3
rmv (a)
{} add(a)
S
{aα}
{} add(a)
S
{aβ}
{}
Conflict-free Replicated Data Types
S
{aα}
{aα}
{aβ}
M
{aβ}
M
{aβ, aα}
•Can never remove
more tokens than
exist
•Op order removed
tokens have been
previously added
11
Observed-Remove Set
s
s1
s2
s3
rmv (a)
{} add(a)
S
{aα}
{} add(a)
S
{aβ}
{}
Conflict-free Replicated Data Types
S
{aα}
{aα}
{aβ}
M
{aβ}
M
{aα}
{aβ, aα}
M
•Can never remove
more tokens than
exist
•Op order removed
{aβ, aα}tokens have been
previously added
11
Observed-Remove Set
s
s1
s2
s3
rmv (a)
{} add(a)
S
{aα}
{} add(a)
S
{aβ}
{}
Conflict-free Replicated Data Types
S
{aα}
{aα}
{aβ}
M
{aβ}
M
M
{aβ}
{aα}
{aβ, aα}
M
{aβ, aα}
•Can never remove
more tokens than
exist
•Op order removed
{aβ, aα}tokens have been
previously added
11
Observed-Remove Set
s
s1
s2
s3
rmv (a)
{} add(a)
S
{aα}
{} add(a)
S
{aβ}
{}
S
{aα}
{aα}
{aβ}
M
{aβ}
M
M
{aβ}
{aα}
{aβ, aα}
M
{aβ, aα}
•Can never remove
more tokens than
exist
•Op order removed
{aβ, aα}tokens have been
previously added
• Payload: added, removed (element, unique-token)
•
•
•
•
add(e) = A ≔ A ∪ {(e, α)}
Remove: all unique elements observed
remove(e) = R ≔ R ∪ { (e, –) ∈ A}
lookup(e) = ∃ (e, –) ∈ A \ R
merge (S, S') = (A ∪ A', R ∪ R')
{true} add(e) || remove(e) {e ∈ S}
Conflict-free Replicated Data Types
11
OR-Set
Set: solves Dynamo Shopping Cart anomaly
Optimisations
• Just mark tombstones
• Garbage-collect tombstones
• Operation-based approach
Conflict-free Replicated Data Types
12
Graph design alternatives
Graph = (V, E) where E ⊆ V ! V
Sequential specification:
• {v,v' ∈ V} addEdge(v,v') {…}
• {∄(v,v') ∈ E} removeVertex(v) {…}
Concurrent: removeVertex(v') || addEdge(v,v')
• linearisable?
•for our Web Search Engine
• addEdge wins?
application, removeVertex
wins
• removeVertex wins? •Do
not check precondition at
add/remove
• etc.
Conflict-free Replicated Data Types
13
Graph
Payload = OR-Set V, OR-Set E
Updates add/remove to V, E
• addVertex(v), removeVertex(v)
• addEdge(v,v'), removeEdge(v,v')
Do not enforce invariant a priori
• lookupEdge(v,v') = (v,v') ∈ E
∧ v ∈ V ∧ v' ∈ V
removeVertex(v') || addEdge(v,v')
• remove wins
"
Conflict-free Replicated Data Types
14
Graph
Payload = OR-Set V, OR-Set E
Updates add/remove to V, E
• addVertex(v), removeVertex(v)
• addEdge(v,v'), removeEdge(v,v')
Do not enforce invariant a priori
• lookupEdge(v,v') = (v,v') ∈ E
∧ v ∈ V ∧ v' ∈ V
removeVertex(v') || addEdge(v,v')
• remove wins
"
Conflict-free Replicated Data Types
14
Graph
Payload = OR-Set V, OR-Set E
Updates add/remove to V, E
• addVertex(v), removeVertex(v)
• addEdge(v,v'), removeEdge(v,v')
Do not enforce invariant a priori
• lookupEdge(v,v') = (v,v') ∈ E
∧ v ∈ V ∧ v' ∈ V
removeVertex(v') || addEdge(v,v')
• remove wins
"
Conflict-free Replicated Data Types
14
Graph
Payload = OR-Set V, OR-Set E
Updates add/remove to V, E
• addVertex(v), removeVertex(v)
• addEdge(v,v'), removeEdge(v,v')
Do not enforce invariant a priori
• lookupEdge(v,v') = (v,v') ∈ E
∧ v ∈ V ∧ v' ∈ V
removeVertex(v') || addEdge(v,v')
• remove wins
"
Conflict-free Replicated Data Types
14
Graph
Payload = OR-Set V, OR-Set E
Updates add/remove to V, E
• addVertex(v), removeVertex(v)
• addEdge(v,v'), removeEdge(v,v')
Do not enforce invariant a priori
• lookupEdge(v,v') = (v,v') ∈ E
∧ v ∈ V ∧ v' ∈ V
removeVertex(v') || addEdge(v,v')
• remove wins
"
Conflict-free Replicated Data Types
14
Co-operative editing
• Local
constraint
implies globally
acyclic
{x,z ∈ graphi ∧ x < z}
add-betweeni (x, y, z)
{y ∈ graphi ∧ x<y<z}
• Deep (internal)
view:
• DAG representing
insert order
⊢
• add:
bet ween
alreadyordered
elements
• strong
order
takes
preceden
ce over
weak
• begin and
end
sentinels
•This spec is
implemented
directly by
WOOT
•Clean
⊣
• Surface view:
summarises
total order
• ensure
consistent
ordering at all
replicas
Conflict-free Replicated Data Types
I I NN R R I AA
⊢
⊢ αα ββ γ γδ εε ⊣
⊣
Co-operative editing
• Local
constraint
implies globally
acyclic
{x,z ∈ graphi ∧ x < z}
add-betweeni (x, y, z)
{y ∈ graphi ∧ x<y<z}
• Deep (internal)
view:
• DAG representing
insert order
⊢
• add:
bet ween
alreadyordered
elements
I
δ
• ensure
consistent
ordering at all
replicas
Conflict-free Replicated Data Types
• strong
order
takes
preceden
ce over
weak
• begin and
end
sentinels
•This spec is
implemented
directly by
WOOT
•Clean
⊣
• Surface view:
summarises
total order
I I NN R R I AA
⊢
⊢ αα ββ γ γδ εε ⊣
⊣
Co-operative editing
• Local
constraint
implies globally
acyclic
{x,z ∈ graphi ∧ x < z}
add-betweeni (x, y, z)
{y ∈ graphi ∧ x<y<z}
• Deep (internal)
view:
• DAG representing
insert order
⊢
• add:
bet ween
alreadyordered
elements
I
δ
• ensure
consistent
ordering at all
replicas
Conflict-free Replicated Data Types
• begin and
end
sentinels
• strong
order
takes
preceden
ce over
weak
•This spec is
implemented
directly by
WOOT
•Clean
⊣
A
ε
• Surface view:
summarises
total order
I I NN R R I AA
⊢
⊢ αα ββ γ γδ εε ⊣
⊣
Co-operative editing
• Local
constraint
implies globally
acyclic
{x,z ∈ graphi ∧ x < z}
add-betweeni (x, y, z)
{y ∈ graphi ∧ x<y<z}
• Deep (internal)
view:
• DAG representing
insert order
⊢
• add:
bet ween
alreadyordered
elements
N
β
I
δ
• ensure
consistent
ordering at all
replicas
Conflict-free Replicated Data Types
• begin and
end
sentinels
• strong
order
takes
preceden
ce over
weak
•This spec is
implemented
directly by
WOOT
•Clean
⊣
A
ε
• Surface view:
summarises
total order
I I NN R R I AA
⊢
⊢ αα ββ γ γδ εε ⊣
⊣
Co-operative editing
• Local
constraint
implies globally
acyclic
{x,z ∈ graphi ∧ x < z}
add-betweeni (x, y, z)
{y ∈ graphi ∧ x<y<z}
• Deep (internal)
view:
• DAG representing
insert order
R
γ
⊢
• add:
bet ween
alreadyordered
elements
N
β
I
δ
• ensure
consistent
ordering at all
replicas
Conflict-free Replicated Data Types
• begin and
end
sentinels
• strong
order
takes
preceden
ce over
weak
•This spec is
implemented
directly by
WOOT
•Clean
⊣
A
ε
• Surface view:
summarises
total order
I I NN R R I AA
⊢
⊢ αα ββ γ γδ εε ⊣
⊣
Co-operative editing
{x,z ∈ graphi ∧ x < z}
add-betweeni (x, y, z)
{y ∈ graphi ∧ x<y<z}
• Deep (internal)
view:
• DAG representing
insert order
R
γ
⊢
• add:
bet ween
alreadyordered
elements
N
β
I
α
• Local
constraint
implies globally
acyclic
I
δ
• ensure
consistent
ordering at all
replicas
Conflict-free Replicated Data Types
• begin and
end
sentinels
• strong
order
takes
preceden
ce over
weak
•This spec is
implemented
directly by
WOOT
•Clean
⊣
A
ε
• Surface view:
summarises
total order
I I NN R R I AA
⊢
⊢ αα ββ γ γδ εε ⊣
⊣
Co-operative editing
{x,z ∈ graphi ∧ x < z}
add-betweeni (x, y, z)
{y ∈ graphi ∧ x<y<z}
• Deep (internal)
view:
• DAG representing
insert order
R
γ
⊢
• add:
bet ween
alreadyordered
elements
N
β
I
α
• Local
constraint
implies globally
acyclic
I
δ
• ensure
consistent
ordering at all
replicas
Conflict-free Replicated Data Types
• begin and
end
sentinels
• strong
order
takes
preceden
ce over
weak
•This spec is
implemented
directly by
WOOT
•Clean
⊣
A
ε
• Surface view:
summarises
total order
I I NN R R I AA
⊢
⊢ αα ββ γ γδ εε ⊣
⊣
Co-operative editing
{x,z ∈ graphi ∧ x < z}
add-betweeni (x, y, z)
{y ∈ graphi ∧ x<y<z}
• Deep (internal)
view:
• DAG representing
insert order
R
γ
⊢
• add:
bet ween
alreadyordered
elements
N
β
I
α
• Local
constraint
implies globally
acyclic
I
δ
• ensure
consistent
ordering at all
replicas
Conflict-free Replicated Data Types
• begin and
end
sentinels
• strong
order
takes
preceden
ce over
weak
•This spec is
implemented
directly by
WOOT
•Clean
⊣
A
ε
• Surface view:
summarises
total order
I I NN R R I AA
⊢
⊢ αα ββ γ γδ εε ⊣
⊣
Co-operative editing
{x,z ∈ graphi ∧ x < z}
add-betweeni (x, y, z)
{y ∈ graphi ∧ x<y<z}
• Deep (internal)
view:
• DAG representing
insert order
R
γ
⊢
• add:
bet ween
alreadyordered
elements
N
β
I
α
• Local
constraint
implies globally
acyclic
I
δ
• ensure
consistent
ordering at all
replicas
Conflict-free Replicated Data Types
• begin and
end
sentinels
• strong
order
takes
preceden
ce over
weak
•This spec is
implemented
directly by
WOOT
•Clean
⊣
A
ε
• Surface view:
summarises
total order
I I NN R R I AA
⊢
⊢ αα ββ γ γδ εε ⊣
⊣
Co-operative editing
{x,z ∈ graphi ∧ x < z}
add-betweeni (x, y, z)
{y ∈ graphi ∧ x<y<z}
• Deep (internal)
view:
• DAG representing
insert order
R
γ
⊢
• add:
bet ween
alreadyordered
elements
N
β
I
α
• Local
constraint
implies globally
acyclic
I
δ
• ensure
consistent
ordering at all
replicas
Conflict-free Replicated Data Types
• begin and
end
sentinels
• strong
order
takes
preceden
ce over
weak
•This spec is
implemented
directly by
WOOT
•Clean
⊣
A
ε
• Surface view:
summarises
total order
I I NN R R I AA
⊢
⊢ αα ββ γ γδ εε ⊣
⊣
Continuum
⊢
I
N
0
100
⊣
Assign each element a unique real number
• position
Real numbers not appropriate
• approximate by tree
Conflict-free Replicated Data Types
16
Continuum
⊢
I
N
A
0
100
101
⊣
Assign each element a unique real number
• position
Real numbers not appropriate
• approximate by tree
Conflict-free Replicated Data Types
16
Continuum
⊢
I
N
I
A
0
100
100.5
101
⊣
Assign each element a unique real number
• position
Real numbers not appropriate
• approximate by tree
Conflict-free Replicated Data Types
16
Continuum
⊢
L
’
I
N
-1.01
-1.00
0
100
R
100.25
I
A
100.5
101
⊣
Assign each element a unique real number
• position
Real numbers not appropriate
• approximate by tree
Conflict-free Replicated Data Types
16
Treedoc binary tree
0
I
1
1
N
•Low arity:
binary,
quaternary
R
0
I
A
•Compact, lowarity tree
•In the following
slides, will
=L’INRI
Binary naming tree:
• compact, self-adjusting •logarithmic:
• Logarithmic properties assuming tree
add appends leaf non-destructive, IDs don’t change
remove: tombstone, IDs don't change
Conflict-free Replicated Data Types
17
Treedoc binary tree
0
0
L
•Low arity:
binary,
quaternary
I
R
1
1
N
0
I
A
•Compact, lowarity tree
•In the following
slides, will
=L’INRI
Binary naming tree:
• compact, self-adjusting •logarithmic:
• Logarithmic properties assuming tree
add appends leaf non-destructive, IDs don’t change
remove: tombstone, IDs don't change
Conflict-free Replicated Data Types
17
Treedoc binary tree
0
0
L
•Low arity:
binary,
quaternary
I
R
1
1
N
0
I
A
•Compact, lowarity tree
•In the following
slides, will
’
=L’INRI
Binary naming tree:
• compact, self-adjusting •logarithmic:
• Logarithmic properties assuming tree
add appends leaf non-destructive, IDs don’t change
remove: tombstone, IDs don't change
Conflict-free Replicated Data Types
17
Treedoc binary tree
0
0
L
•Low arity:
binary,
quaternary
I
R
1
1
N
0
I
A
•Compact, lowarity tree
•In the following
slides, will
’
=L’INRI
Binary naming tree:
• compact, self-adjusting •logarithmic:
• Logarithmic properties assuming tree
add appends leaf non-destructive, IDs don’t change
remove: tombstone, IDs don't change
Conflict-free Replicated Data Types
17
Treedoc binary tree
0
0
L
•Low arity:
binary,
quaternary
I
R
1
1
N
0
I
A
•Compact, lowarity tree
•In the following
slides, will
’
=L’INRI
Binary naming tree:
• compact, self-adjusting •logarithmic:
• Logarithmic properties assuming tree
add appends leaf non-destructive, IDs don’t change
remove: tombstone, IDs don't change
Conflict-free Replicated Data Types
17
Layered Treedoc
binary tree
Conflict-free Replicated Data Types
18
Layered Treedoc
sparse 864-ary tree
Site 34
Site 79
binary tree
Conflict-free Replicated Data Types
18
Layered Treedoc
sparse 864-ary tree
Site 34
Site 79
Site 22
binary tree
Conflict-free Replicated Data Types
18
Layered Treedoc
Site 34
sparse 864-ary tree
Site 34
Site 79
Site 22
binary tree
Conflict-free Replicated Data Types
18
Layered Treedoc
Site 34
sparse 864-ary tree
Site 34
Site 22
Site 79
Site 34
Site 66
Site 79
binary tree
Conflict-free Replicated Data Types
18
Layered Treedoc
Site 34
sparse 864-ary tree
Site 34
Site 22
Site 79
Site 34
Site 66
Site 79
binary tree
Edit: Binary tree
Concurrency: Sparse tree
Conflict-free Replicated Data Types
18
The theory
Two simple conditions
for correctness without
synchronisation
Query
client
s
s1
s2
s3
•Example: Amazon shopping cart is
replicated
•unspecified client, e.g., Web frontend
•One or more
•load-balancer, failures may direct
client to different replicas
Local at source replica
• Client's choice
Conflict-free Replicated Data Types
20
Query
client
s
s1.q(a)
s1
S
s2
s3
•Example: Amazon shopping cart is
replicated
•unspecified client, e.g., Web frontend
•One or more
•load-balancer, failures may direct
client to different replicas
Local at source replica
• Client's choice
Conflict-free Replicated Data Types
20
Query
client
s
s1.q(a)
s1
S
s2.q(b)
s2
S
s3
•Example: Amazon shopping cart is
replicated
•unspecified client, e.g., Web frontend
•One or more
•load-balancer, failures may direct
client to different replicas
Local at source replica
• Client's choice
Conflict-free Replicated Data Types
20
State-based replication
client
s
s1
s2
s3
Local at source s1.u(a), s2.u(b), …
• Compute
• Update local payload
Convergence:
• Episodically: send si payload
• On delivery: merge payloads m
Conflict-free Replicated Data Types
•merge t wo valid
states
•produce valid state
•no historical info
available
•Inefficient if
payload is large
21
State-based replication
client
s
s1.u(a)
s1
S
s2
s3
Local at source s1.u(a), s2.u(b), …
• Compute
• Update local payload
Convergence:
• Episodically: send si payload
• On delivery: merge payloads m
Conflict-free Replicated Data Types
•merge t wo valid
states
•produce valid state
•no historical info
available
•Inefficient if
payload is large
21
State-based replication
client
s
s1.u(a)
s1
S
s2.u(b)
S
s2
s3
Local at source s1.u(a), s2.u(b), …
• Compute
• Update local payload
Convergence:
• Episodically: send si payload
• On delivery: merge payloads m
Conflict-free Replicated Data Types
•merge t wo valid
states
•produce valid state
•no historical info
available
•Inefficient if
payload is large
21
State-based replication
s
s1.u(a)
s1
S
s2.u(b)
S
s2
s3
Local at source s1.u(a), s2.u(b), …
• Compute
• Update local payload
Convergence:
• Episodically: send si payload
• On delivery: merge payloads m
Conflict-free Replicated Data Types
•merge t wo valid
states
•produce valid state
•no historical info
available
•Inefficient if
payload is large
21
State-based replication
s
s1.u(a)
s1
S
s1
s2.u(b)
S
s2
M
s2.m(s1)
s3
Local at source s1.u(a), s2.u(b), …
• Compute
• Update local payload
Convergence:
• Episodically: send si payload
• On delivery: merge payloads m
Conflict-free Replicated Data Types
•merge t wo valid
states
•produce valid state
•no historical info
available
•Inefficient if
payload is large
21
State-based replication
s
s1.u(a)
s1
S
s1
s2.u(b)
S
s2
M
s2.m(s1)
s3
s2
M
s3.m(s2)
Local at source s1.u(a), s2.u(b), …
• Compute
• Update local payload
Convergence:
• Episodically: send si payload
• On delivery: merge payloads m
Conflict-free Replicated Data Types
•merge t wo valid
states
•produce valid state
•no historical info
available
•Inefficient if
payload is large
21
State-based replication
s
s1.u(a)
s1
s1.m(s2)
S
s1
s2.u(b)
S
s2
s2
M
s2.m(s1)
s3
M
s2
M
s3.m(s2)
Local at source s1.u(a), s2.u(b), …
• Compute
• Update local payload
Convergence:
• Episodically: send si payload
• On delivery: merge payloads m
Conflict-free Replicated Data Types
•merge t wo valid
states
•produce valid state
•no historical info
available
•Inefficient if
payload is large
21
State-based: monotonic semilattice CRDT
s
s1.u(a)
s1
s1.m(s2)
S
s2.u(b)
S
s2
s1
s2
M
s2.m(s1)
s3
M
s2
M
s3.m(s2)
If
• payload type forms a semi-lattice
• updates are increasing
• merge computes Least Upper Bound
•⊔ = Least
Upper Bound
LUB = merge
•no reference
to history
then replicas converge to LUB of last values
Example: Payload = int, merge = max
Conflict-free Replicated Data Types
22
Operation-based replication
s
s1.u(a)
s1
S
s2
S
s2.u(b)
•push to all replicas
eventually
•push small updates
- more efficient than
state-based
s3
At source:
Eventually, at all replicas:
• prepare
update local replica
•
• broadcast to all replicas
Conflict-free Replicated Data Types
23
Operation-based replication
s
s1.u(a)
s1
S
s2
S
s2.u(b)
s3
b
b
s1.u(b)
D
•push to all replicas
eventually
•push small updates
- more efficient than
state-based
D
s3.u(b)
At source:
Eventually, at all replicas:
• prepare
update local replica
•
• broadcast to all replicas
Conflict-free Replicated Data Types
23
Operation-based replication
s
s1.u(a)
s1
S
s2
S
s1.u(b)
b
D
a
s2.u(b)
s3
s2.u(a)
D
b
a
D
s3.u(b)
•push to all replicas
eventually
•push small updates
- more efficient than
state-based
D
s3.u(a)
At source:
Eventually, at all replicas:
• prepare
update local replica
•
• broadcast to all replicas
Conflict-free Replicated Data Types
23
Op-based: commute
s
s1.u(a)
s1
S
s2
S
s1.u(b)
b
D
a
s2.u(b)
s3
CRDT
s2.u(a)
D
b
a
D
s3.u(b)
D
s3.u(a)
•Delivery order ≃ ensures
downstream precondition
•happened-before or weaker
If:! •! (Liveness) all replicas execute all operations
! ! ! in delivery order
• (Safety) concurrent operations all commute
Then: replicas converge
Conflict-free Replicated Data Types
24
Monotonic semi-lattice
commutative
•Systematic transformation
•Inefficient
• Hand-crafted op-based implementation
1. A state-based object can emulate an
operation-based object, and vice-versa
2. State-based emulation of a CvRDT is a
CmRDT
3. Operation-based emulation of a CvRDT is a
CmRDT
Conflict-free Replicated Data Types
25
Operation-based OR-Set
Payload: S = { (e,α), (e, β), (e', γ), … }
! !
where α, β,… unique
Operations:
• lookup(e) = ∃ α: (e, α) ∈ S
Conflict-free Replicated Data Types
•Set of IDs
associated
with a in the
old state
•As observed
by source!
26
Operation-based OR-Set
Payload: S = { (e,α), (e, β), (e', γ), … }
•Set of IDs
! !
where α, β,… unique
associated
with a in the
Operations:
old state
•As observed
• lookup(e) = ∃ α: (e, α) ∈ S
by source!
• add(e) = S ≔ S ∪ {(e, α)} where α fresh
Conflict-free Replicated Data Types
26
Operation-based OR-Set
Payload: S = { (e,α), (e, β), (e', γ), … }
•Set of IDs
! !
where α, β,… unique
associated
with a in the
Operations:
old state
•As observed
• lookup(e) = ∃ α: (e, α) ∈ S
by source!
• add(e) = S ≔ S ∪ {(e, α)} where α fresh
• remove (e) =
! (at source) R = {(e, α) ∈ S}
! (downstream) S ≔ S \ R
Conflict-free Replicated Data Types
26
Operation-based OR-Set
Payload: S = { (e,α), (e, β), (e', γ), … }
•Set of IDs
! !
where α, β,… unique
associated
with a in the
Operations:
old state
•As observed
• lookup(e) = ∃ α: (e, α) ∈ S
by source!
• add(e) = S ≔ S ∪ {(e, α)} where α fresh
• remove (e) =
! (at source) R = {(e, α) ∈ S}
! (downstream) S ≔ S \ R
No tombstones
Conflict-free Replicated Data Types
26
Operation-based OR-Set
Payload: S = { (e,α), (e, β), (e', γ), … }
•Set of IDs
! !
where α, β,… unique
associated
with a in the
Operations:
old state
•As observed
• lookup(e) = ∃ α: (e, α) ∈ S
by source!
• add(e) = S ≔ S ∪ {(e, α)} where α fresh
• remove (e) =
! (at source) R = {(e, α) ∈ S}
! (downstream) S ≔ S \ R
No tombstones
{true} add(e) || remove(e) {e ∈ S}
Conflict-free Replicated Data Types
26
Ongoing work
CRDTs
for P2P
& Cloud Computing
ConcoRDanT: ANR 2010–2013
Systematic study of conflict-free design space
• Theory and practice
• Characterise invariants
• Library of data types
Not universal
• Conflict-free vs. conflict semantics
• Move consensus off critical path, non-critical ops
Conflict-free Replicated Data Types
28
CRDT + dataflow
Set
Web
site 1
spam
detector
Whitelist
Graph
Web
site 2
Web
site 3
crawl
Content
DB
Map URL to
last 2 versions
extract
links
extract
words
Graph
Words
Map word to
Set of URLs
Incremental, asynchronous processing
• Replicate, shard CRDTs near the edge
• Propagate updates ≈ dataflow
• Throttle according to QoS metrics
(freshness, availability, cost, etc.)
Scale: sharded
Synchronous processing: snapshot, at centre
Conflict-free Replicated Data Types
29
OR-Set + Snapshot
Read consistent snapshot
• Despite concurrent, incremental updates
Unique token = time (vector clock)
• α = Lamport (process i, counter t)
• UIDs identify snapshot version
• Snapshot: vector clock value
• Retain tombstones until not needed
lookup(e, t) = ∃ (e, i, t')∈A : t'>t ∧ ∄ (e, i, t')∈R: t'>t
Conflict-free Replicated Data Types
30
Sharded OR-Set
Very large objects
requires
• Independent shards •(Dynamic:
consensus to
rebalance)
• Static: hash
Statically-Sharded CRDT
• Each shard is a CRDT
• Update: single shard
• No cross-object invariants
• The combination remains a CRDT
Statically Sharded OR-Set
• Combination of smaller OR-Sets
• Snapshot: clock across shards
Conflict-free Replicated Data Types
31
Take aways
Principled approach
• Strong Eventual Consistency
Two sufficient conditions:
• State: monotonic semi-lattice
• Operation: commutativity
Useful CRDTs
• Register, Counter, Set, Map (KVS),
Graph, Monotonic DAG, Sequence
Future work
• Snapshot, sharding, dataflow
• A wee bit of synchronisation
Conflict-free Replicated Data Types
32
Portfolio of CRDTs
Register
• Last-Writer Wins
• Multi-Value
Set
• Grow-Only
• 2P
• Observed-Remove
Map
• Set of Registers
Conflict-free Replicated Data Types
Counter
• Unlimited
• Non-negative
Graphs
• Directed
• Monotonic DAG
• Edit graph
Sequence
• Edit sequence
33