Benefits and Challenges of Implementing Riak in Erlang
Steve Vinoski - http://basho.com/
Erlang
Requirements
Large number of concurrent activities
Distributed systems
Continuous operation for years
Live updates and maintenance
Hardware/software fault-tolerant
Concurrency - common theme
Erlang processes are much more lightweight than OS threads
Reliability - what it gives you
Isolation - processes communicate only by message passing
Distribution - works across nodes
Linking/supervision/monitoring - one process takes action when another fails
Small language; few elements; functional - relatively easy to learn
Variables are immutable; no globals
Flow control via pattern matching; recursion
Concurrency primitives
Processes, not mutexes, etc.
Selective receive lets you receive specific messages from anywhere in the message queue (even if other types are ahead of them)
OTP framework - everybody uses this
Riak
CAP theorem
A distributed highly available eventually consistent highly scalable open source key-value database written primarily in Erlang.
Modeled after Amazon Dynamo. See Andy Gross’s 5 years later talk.
Also provides MapReduce, secondary indexes, full-text search
Built for operational ease - it just runs
Multiple clients - .NET, Java, Node.js, etc.
Consistent hashing - no sharding
Stores replicas - N/R/W values - adjustable
(The ring is very similar to the Security Now discussion of Tor.)
Vector clock - how it determines stale value - number of operations actors performed and a timestamp
All the nodes in a cluster are peers - no masters or slaves
Nodes exchange their understanding of ring state via gossip protocol
Riak uses the Erlang mesh for this
Can simulate multi-node installment on a single machine (nice for development)
At about 150 nodes, the cluster doesn’t scale well.
Control vs. Data
Distributed Erlang is good for control plane, not so good for data plane
Sending large data can block
Use TCP, UDP, etc. directly for data plane traffic
Don’t mix control plane and data plane traffic
Riak still does this in a few places, unfortunately (they’re going to fix this)
Hinted Handoff
Fallback vnode holds data for unavailable primary vnode
Hands it off once primary becomes available
Read Repair
Vnode with stale data is repaired via asynchronous update
Eventual consistency
Active anti-entropy (AAE) - can actively repair stale value before it’s read
Monitoring is great for cleaning up after aborted operations.
Pattern-matching is an elegant way to parse binary data.
gen_fsm - one of the OTP library behaviors (finite state machine)
In Erlang, everybody uses these behaviors; makes for more readable code.
Let It Crash - Joe Armstrong’s doctoral thesis (he created Erlang)
Business logic goes in Workers; Supervisors are very simple and just start and watch Workers. Little can go wrong with Supervisors.
Erlang/OTP System Facilities
Get status of OTP process
Trace function calls, messages
Put a trace on a process - VERY powerful
Seldom need a debugger
Be careful - you can take a system down by improper tracing
Linking with C/C++
NIF - Native Implemented Function
Can ref count binaries
Portable interface to OS
Can crash the whole VM with a bad NIF
NIF calls execute within a VM scheduler thread - can block the thread
NIFs should only block for a millisecond or less
Put long-running activities in their own threads.
Eunit - unit testing
QuickCheck - create model of SUT, and it runs randomly-generated tests against it (like PEX?) - shrinks the test case for easier debugging - awesome
Watch your memory.
Hot code loading really works.
Understand the Erlang VM.
A DSL for distributed systems.
A Little Riak Book, Riak Handbook
Elixir - Ruby-like language on Erlang VM - also some Lisp-style languages
O’Reilly Erlang book is also very good. (Francesco Cesarini, Simon Thompson)
(And he gave me a free book for asking a question! - Bill S.)
Evolving Java
Brian Goetz - Java Language Architect, Oracle
Compatibility - avoiding breaking existing programs - takes up a lot of his time.
Serialization - big regret.
What Would James Do?
Java SE 8 - modernizing Java - Summer 2013
Alonzo Church - 1930’s Lambda Calculus
Language features are only enablers.
Better libraries are what gets things done.
In 1995, few mainstream languages supported closures.
Today, Java is just about the last holdout. (Even C++ added them recently.)
In 1995, everything was sequential. (Single processor.) For loops, iterators, etc.
In today’s world, that’s an impediment to parallelizing code.
Mutability was the default.
Today, that’s the wrong default.
External Iteration
There’s a lot of accidental complexity in a for loop that’s mutating items in a collection. We want to separate the “what” from the “how".
Internal Iteration
Don’t ask me, just do it.
The library can use parallelism, out-of-order, laziness.
(This is an argument for ForEach() as a method. - Bill S.)
Lambda Expressions
What is the type of a lambda?
Functional Interfaces
Predicate<String> isEmpty = s -> s.isEmpty();Runnable r = () -> { System.Out.printLn("Boo!"); };Problem - Interface Evolution
If we had lambdas when we designed Collection libraries, our libraries would look nothing like they do.
New feature: default methods (similar to abstract)
You can add a virtual interface method with a default implementation.
If an existing implementation doesn’t have the new method, it still works (using the default implementation).
Multiple inheritance of behavior, not state.
This is not full-blown traits (like Scala).
Not designed for monkey-patching.
It’s All About the Libraries
Best way to evolve the platform - cheaper, more scalable, etc.
Lambdas Enable Better APIs
Boundary between client and library becomes more permeable
The library owns the “how", the client owns the “what”
Key effect on APIs is: more composability
You want to build behavior by composition.
Sorting today: The code reads nothing like the problem statement.
You want the code to look as much as possible like the requirements.
Bulk operations on Collections
Exposing aggregate operations on collections.
(Very LINQ-y.)
You have to ask for parallelism, but you don’t have to ask very loudly.
Guy Steele’s talk on How to Think about Parallel Programming: Not!
(Highly recommended by presenter.)
Streams
New abstraction
Stream of values - not a data structure
Pipelined
Lazy
(like IEnumerable in C# - Bill S.)
Parallelism
Let the experts write the libraries, and let the libraries handle the hard problems.
JDK7 added Fork/Join - lots of boilerplate code
parallelStream() lets you do map-reduce easily.
Aggregation
sum() is a special case of what functional languages tend to call fold
Reductions work really nicely in parallel.
Composable.
Gentle push towards a more functional style of programming.
People will have to unlearn some things. Want pure methods, immutability, etc.
Tuples coming!?
Democratizing Machine Learning With C# - LINQ goes ML
Erik Meijer
(There are almost certainly transcription errors here; these are entirely my fault. - Bill S.)
Problem: Should I play golf or not?
Streaming, real-time weather information
Want to find Func<Weather, Play>
Once I have this, I can create a real-time LINQ query using IObservable<Weather> as an input.
But I don’t know how to write that function.
What I have is a history of the weather and my calendar history of golf-playing.
I can create a Dictionary<Weather, Play> of examples (test cases).
ML is nothing more than a computer doing test-driven development.
A dictionary is a representation of a finite function.
You want a function that works for all arguments.
People have data and want to write a query to extract data from it, typically.
But how do I group the data? How do I order it? How do I filter it?
They could circle the data points on a whiteboard, but may not be able to write the query.
From example data, you’re generating a function.
If you thought the impedance mismatch between (relational) queries and programs was large, then the mismatch between ML and programming is astronomical.
A query has data and a function, and gives you an answer.
ML has data and answers, and gives you a function.
BigML - super-easy way to do ML (commercial - ka-ching!)
Supports Actional Model Download - gives you the function you’re looking for in your choice of languages.
Writing a decision tree algorithm isn’t that hard.
If you write your algorithms using queries, then you can parallelize them.
Bayes
Mathematicians are extremely sloppy.
Bayes is just monads.
When they talk about vectors, they really mean classes.
Play -> Distribution(Outlook) is the probability monad.
P(Weather|Play) is the way mathematicians represent this.
A monad is a type constructor with some functions.
Probability Monad
Type constructor:
Distribution<T> = List<Pair<T, double>>
Unit:
Return :: T -> Distribution<T>
Return t = [(t,1.0)]
Bind:
Bind :: Dist<S> x (S -> Dist<T>) –> Dist<T>
Bind ds f = [ (t, ps*pt) | (s,ps) <- ds, (t,pt) <- f(s) ]
The trick of functional programming is let the types do the work for you.
I do what the types tell me.
From our data we can compute P(Outlook|Play), P(Temperature|Play), etc.
But we want to know P(Play|Weather).
Bayes’ Theorem gives us a way to invert monadic functions.
It holds for any monad.
There’s a type error in it, though.
Given:
a :: A, f:: A -> B
Want:
g :: A -> AxB
g = a -> (a, f(a))
Using n for intersection, we can type-correct Bayes:
P(BnA|A) = P(B) * P(BnA| B)/P(A)
Keynote: Open Systems - Actors and Cloud
Erik Meijer
(His new company is Applied Duality - watch for great things!)
http://thisisindexed.com/
http://en.wikipedia.org/wiki/Henk_Barendregt
Mike Lee’s “One Minute” keynote
To hack, to help other developers, to solve real problems, to fix the world.
Duality - things have opposite aspects, but you should look at them as one whole. They’re not good or bad.
Nothing’s more different from SQL than Rx.
Question the relational database.
Does your data really look like this? It’s tied to the physical format.
Duality between Modelers (DB designers) & Developers. Plants vs. Zombies.
Modelers always talk about Nouns. Customers, Orders, Line Items.
There’s no reuse in SQL. There’s no difference between a table and a type.
There’s no abstraction.
It’s statically typed. It’s baked in to the layout of your files.
In order to make your database run fast, you have to have intimate knowledge of your data storage (to do joins, etc.).
It’s not really declarative, because you have to know what keys to join on. It doesn’t allow nested results, so you have to have the GROUP BY column in the SELECT. WITH RECURSIVE is another example - you have to explicitly write it out.
Declarative is relative - C is declarative to an assembly programmer.
Optimization - query plan - consultants tweak the SQL to get the plan you wanted - but why can’t you tweak the query plan directly? It’s just function pipeline.
ACID is also a lie - you set your transaction isolation level to get performance. Why so many hints?
The closed world assumption is the presumption that what is not currently known to be true is false.
The fallacies of Distributed Computing - well known
The fallacies of Declarative Computing
1. Exceptions do not exist.
2. Statistics are precise.
3. Memory is infinite.
4. There are no side-effects.
5. Schema doesn’t change.
6. There is one developer
7. Compilation time is free.
8. The language is homogeneous.
Once you go outside of SQL, the optimizer doesn’t know anything about this.
It’s really a B-tree. Just give me that and I’ll be happy! Leaky abstractions are a good thing! Then you know the performance characteristics. We don’t need all these layers of crust. Sometimes you need to go under the hood.
JavaScript people should not use futures, they should use comonads.
Task<T> is a co-monad in C#. See ContinueWith(Func<Task<T>, TResult>).
Avoids the need for multiple futures (one for failure case, etc.)
Comonads are gumball machines.
.Result - remove a gumball
Once out, never in. (It’s not hygenic.)
.Select (s/b map) - turns a gumball machine into a beer machine WITHOUT taking the gumballs out of the machine!
Comonad - .Franchise() - can create a machine of gumball machines
Can compose Franchise with Select using ContinueWith to create a beer machine franchise.
ContinueWith is the duality of SelectMany (aka flatmap).
Developers like Verbs.
We want to perform operations on things, while hiding how they’re accomplished.
We want to swap out implementations of interfaces. You can’t do that in SQL.
ConcurrentDictionary<K,V> implements the same interfaces as Dictionary<K,V>. If single-core code was written against the original interface, it would still work with multiple cores.
A dictionary is nothing more than a database.
Dependency Injection is a work-around for a deficiency in our programming languages.
Moving to the cloud - collections as a service (data structures that live in the cloud) - access with async/await (dream)
Developers don’t want REST. They want to program against objects. The first thing you do is write a wrapper. With an API, you can change the underlying implementation; they’re more abstract.
Actor Framework for Windows (JavaScript) - open source
Similar to database transactions and recovery - intercept all the mutations so you can replay the log (I think this is ActorFx, AKA Ax)
An Actor is something where I send it updates and out come a stream of changes. In Rx we call that a Subject. (The streams are IObservable.)
A UI is also a Subject. As is a web browser, as is a web server.
Dream: Compose communicating stream processors to build applications. Like Erlang, only language-independent. Rx everywhere.
Like Yahoo Pipes but non-visual (developers like code).
Trying to move Rx to an Apache project.
Need to wrap existing services into these Subjects so we can have a consistent interface.
Reactive Message Queue (Qx)
Decouple producer & consumer in space & time.
Either can be offline.
We don’t care about type systems.
Comonads are used when you have an asynchronous communication that gives you one value; Rx is used when you … that gives you multiple values.
Pat Helland’s article on Condos & Clouds
You Are The Subject.
Get rid of all the noise and just give me essence.
One of the great things about the NoSQL movement is that developers are now empowered.
We’re still working through Seven Languages in Seven Weeks in our lunch-and-learn book club at work. In the Cedar Rapids .Net Users Group book club, we’re almost finished with Real-World Functional Programming with examples in F# and C#.
And I’m learning that I hate ceremony.
In my day job, I write mostly C#. I’m used to that - typing curly braces, ending lines with semicolons, declaring types - it’s automatic. A good IDE helps a lot.
More and more, I wonder why the compiler doesn’t do more of this. F# is also a strongly-typed .NET language, and it can infer most types.
I’m starting to appreciate the simplicity of the F#/Ruby way of doing things.
As a side note, why are Apple and Microsoft bringing back the Bad Old Days of memory management? I have no desire to code in Objective-C or C++. I don’t want to use COM. Even the Linux folks seem to have a love affair with C/C++. Give me memory management. Give me modern languages.
And give me less ceremony.
:: Next Page >>
Development Central is the blog of Bill Sorensen, a professional software developer. Much of this will relate to C#, .NET, and OOP in general.
Disclaimer
These postings are provided "AS IS" with no warranties and confer no rights.
| Next >