Class | Description |
---|---|
Version |
Describes the version of the DataRush installation.
|
Exception | Description |
---|---|
DRException |
Base for exceptions thrown by DataRush.
|
ParseExpressionException |
Exception indicating an error while parsing a value expression.
|
Pervasive's DataRush is a framework, written in Java, for developing dataflow applications. But what is dataflow? It is an alternative to the standard von Neumann model of computation. Typically, we think of a program as a series of instructions each executed one after the other by a processor keeping track of its progress with an instruction pointer. In dataflow, on the other hand, a collection of computations are joined to one another by channels transmitting data in one direction. Conceptually, you can think of this structure as a directed graph with data channels as edges and nodes performing computation on the data. The nodes each operate only when data is available—the data flowing through the network is all that's needed to organize the computation. The immediate advantage is many of the nodes can be operating simultaneously, thus allowing dataflow applications to take advantage of hardware with multiple processor cores. Notice the concurrency happens external to the node; the developer does not have to bother with threads, deadlock detection, starvation, or concurrent memory access to build parallelism into his application.
The DataRush framework supplies a simple model for writing nodes and composing them into self-contained graphs in Java, a system for running applications on any standard JVM, and a suite of tools for development. DataRush allows you to plug any Java code into a node as long as you abide by an unrestrictive contract and a few simple conventions. This means you can link existing applications by the data they communicate to one another or, better yet, pull existing code apart to adapt it to run as a dataflow application. The code within a node may even be threaded in itself, allowing you to fine-tune parallelism at all levels of the application.
DataRush operators are either individual nodes or graphs describing the structure of a subset of the dataflow graph. An operator provides a black box abstraction requiring inputs and configuration information and producing outputs. Most applications contain at least some operators from the extensive standard library provided in the SDK, showing the reuse potential of these constructs.
Operators are simply Java classes and so can be compiled with any standard Java compiler. Execution can be as simple as invoking a method when running embedded DataRush applications or you can use tools provided in the SDK to conveniently configure and run your own DataRush applications from the command line. The diagnostics tools included in the framework are a powerful way of visually identifying performance bottlenecks so you can restructure your application for optimum throughput.
DataRush excels at batch processing large amounts of relational data on SMP hardware. When tables are read (whether from a database, delimited or fixed text, a staging data set, or even XML), each row is passed through the system as a collection of fields. This means DataRush immediately exploits vertical parallelism, allowing concurrent operations upon different columns in your data. Further, if you link two nodes to one another, then the later may begin processing the results of the former before all the data has streamed into the initial node. This is known as pipeline parallelism and is enjoyed by virtually every DataRush application. Finally, if you need to apply the same operation to each row independently, you can partition the data using tools from the library to exploit horizontal parallelism.
See the table below for a brief overview of the types of parallelism:
pipeline | If you have more than one node connected sequentially, you're automatically exploiting pipeline parallelism as each can begin operation as soon as some output of the previous is ready. |
horizontal | If you're operating on large amounts of independent data, you can partition it to perform the same operation on different batches simultaneously. |
vertical | If your data has more than one column, you can perform operations on the various columns independently. |
These tools in hand, DataRush is ideally suited for financial and surveillance data processing, ETL, and scientific research. The framework is currently embedded in a commercial data profiling product, used by customers ranging from multi-national banks to government agencies. You might try DataRush if you want to:
Copyright © 2016 Actian Corporation. All rights reserved.