MAD stands for Magnetic Agile Deep (also mad skills!)— goals for their integration with data mining pipelines and statistic properties.
They provide support for a range of machine learning categories, including supervised learning, unsupervised learning, descriptive statisticsi (e.g. count-min sketch), and support modules (e.g. array operation).
- User Defined Aggregates: transition function to combine two transition states into a new state (and further an optional merge function for parallel execution, pretty standard pattern for dataflow executions)
- Multipass Iteration
- Counted Iteration via Virtual Tables: basically there is no for loop in SQL, so we create a dummy table with n entries which we join the single iteration with to simulate a loop.
- Window Aggregates for Stateful Iteration: this is already supported in some commercial (the
OVERclause). Basically it runs aggregation on the result set (not original database and not nested queries). One example:
SELECT rid, value, (SELECT SUM(value) FROM table) AS total, FROM table;versus
SELECT rid, value, SUM(value) OVER() AS total, FROM table; -- note the OVER() means over all rowsWhile I find the use case for traditional SQL processing a bit contrived, I could see how this is useful for linear algebra. Note that
OVERis often used with
PARTITION BYwhich divides the query result set into partitions.
- Recursive Queries: this is usually implemented as
WITH RECURSIVE(called Common Table Express). Below is a simple example from PostgreSQL documentation,
tbecomes a set of integer 1 through 100.
WITH RECURSIVE t(n) AS ( VALUES (1) UNION ALL SELECT n+1 FROM t WHERE n < 100 )Less contrived use cases of recursion is for hierarchical or tree-structured data.
- Driver Queries: to better utilize databases’s data processing flow, the driver generates queries with temporary tables to ensure that data doesnt move from the database to the python layer.
- Templated Queries: SQL is limited to first order logic, but sometimes we want to modify the query itself, which motivates the use of templated queries whose details are filled in later.
Micro level focuses on data representation and inner loops. For example, sparse matrices is a big concern.
The paper then went on to describe how a few common ML routines are implemented, including least squares, linear regression and k-means.
- Many statistical methods boil down to linear algebra expressions over matrices, and matrix operations (e.g. multiplication) is very easily distributed/parallelized.
- “When implemented correctly, the performance of the scripting language code is not critical, since its logic is invoked only occasionally to kick off much larger bulk tasks that are executed by the core database engine.”
- “It is important to make a clear distinction between the inter-iteration state (the output of the UDA’s final function) and intra-iteration state (as maintained by the UDA’s transition and merge functions).” — I think intuitively this corresponds to where the pipeline of the dataflow needs to block, since the inter-iteration needs to wait for the intra to have merged before processing.
- Convex optimization makes it possible to make very declarative statements about the ML processes.
Birds Eye View of Other ML Systems
The approaches are categorized as the following:
- Top-down, language-based: e.g. R/Matlab.
- Imperative: “an analyst takes the responsibility of expressing how to parallelize the data access”
- Declarative: declare the problem is sufficient
- Framework-based: provide building blocks e.g. MADlib, Graphlab. “the goal is to provide the analyst with low-level primitives and co- ordination primitives built on top of a data processing substrate”
- Other: Pregel: “each function can be viewed as a node that sends and receives messages to its neighbors in the graph” (great one line summary!)
- It seems for system builders that one way to prove a system’s sufficiency is just to enumerate through cases and scenarios that need to be supported.
- It’s interesting and unfortunate that this didn’t quite catch on as the promise sounds very good. Maybe in a field where moving fast and showing incremental progress is everything, any kind of upfront investment is too much?