Find information about using and setting up Integrator
This online documentation is under active development, and only contains sandbox installation directions. The Integrator for Spark usage guide is available here.
Want to try Integrator as quick as possible? Try out our spark version in the sandbox available for download here
Integrator is a tool that helps you do data integration and build data warehouses on powerful platforms such as IBM PureData for Analytics (Netezza), Amazon RedShift, Apache Spark, and other systems.
A job is a reusable component that does some defined work. A single job may read from one or more data sources and write to one or more targets. In most cases, the processing within a job is atomic and consistent, meaning the job as a whole will either completely succeed or fail. Data will either be committed or rolled back to its pre-job state based on job success or failure. In either case, the Integrator engine will clean up and remove all temporary tables and other intermediate processing artifacts.
A batch is a collection of jobs that are run together. The batch defines the order in which the component jobs are run. Jobs within a batch are connected by a graph which illustrates how control flow passes from job to job.
We use the term workflow to refer to jobs or batches. A workflow can execute multiple network of workflows and can have infinite hierarchy.
Just as jobs are the components of batches, steps are the components of jobs. Consider the following example: A job might extract data from two tables, combine the two datasets, massage the data into the form required by a target table, and then load the data to the target. Such a job might be composed of the following operations: (1) Extract the data from the first source, (2) extract data from the second source, (3) join the two datasets, and (4) load the resulting dataset into the target table. Such a job would be composed of four steps, as shown below.