Install PDI/Kettle and Agile PDI in a Development Environment
This is where you’ll install and play with one of the most interesting, well crafted, user friendly and enjoyable applications I have ever used.
It is also the heart of the Pentaho BI suite as the tool to Extract, Transform & Load (ETL), process data and execute jobs. It’s name was Kettle, its former name, now is known as PDI for Pentaho Data Integration, and also Spoon, it’s executable file name.
We’ll install PDI as a desktop development tool.
1. Get and Install PDI.
Go to Pentaho files in sourceforge here and download the latest stable release (version 4.2 should be up by july 2011. Its a remarkable new version, check the improvements).
Double click on the file and extract it’s content into a new folder:
You can delete the .bat files and make all the .sh files, specially the spoon.sh as an executable file (right click on permisions tab). And in command terminal start it with:
Close the window dialog that offers to open a repository. You should be in the PDI, a development environment:
Note: You don’t have to configure anything or add drivers for common databases. I told you, this open source application is the result of a great community and is a very well crafted product, you’ll see.
Ups, There is a glitch on the interface and the new ubuntu 11.04 scrollbars as they don’t work or let you put steps into the canvas. The solution I took is disable them as shown on the PDI forums here.
2. Meet the Application
There are several resources you should browse and revisit them as you familiarize yourself with the concepts of ETL and this tool:
- 18 slides presentation by the project founder, manager and lead developer of PDI, explaining its capabilities. Slide #11 ‘use-cases’ list some of its uses. [video].
- Check a guide that comes with your download at
- A detailed spoon user guide in the Pentaho wiki.
- Some videos on you-tube explain specific extractions that may seem too complicated at first and sometimes are about the process more than each component, but check those vblogs: mattcasters, BIOpenSource, DiddySteiner, ETLTools, , fechever75, LaboratoriosSIUCV, opensourcebi and more.
- The most important cookbook you have are the sample files in your disk at:
- Also get this old guide, it’s no longer distributed with the PDI but it was very usefull to me as it lists the main ‘steps’.
- Continue with more articles from the Pentaho wiki.
3. Define a DB Connection
In the left panel, right click on Transformations. Click new. Now again on the left panel select the view tab.
Right click on the Database connections node, select new and fill the dialog with your data:
Here you can see the values for our MySql database and the result of the test.
Close the dialogs and right click on the left panel in the mysql connection, select share. Now when you save the transformation the connection data will be available for other extractions.
You have to save the transformation, a .ktr [xml] file. Create a folder for your transformations, that will make it easy to sync or backup them up:
Note: An Oracle 10g connection using jndi (no net client) looks like this:
Conection to Oracle 10g
4. Execute a Transformation
A nice way to start learning about the PDI, ETL and datawarehousing is by opening the samples folder and check the components names and its notes, those are self explanatory. If you double click on them you will see the parameters that specify each behavior. If you right click on them you can select options to see the description, input or output fields, the text description -you should document the intention of the activity in here-, preview a sample run, etc.
Once you have reviewed some transformations I recommend one to start, that is create an object fundamental to multidimensional analisys: the time dimension. This is a table with a row for each day in the calendar, has columns showing special attributes like months, quarters, years, weekends So its easy to select dates based on those columns and then select the values in the fact table just with the indexed records which contain those needed dates.
A nice specification for a time dimension table is listed in this post of Nicholas Goodman. His blog has very interesting information too.
Check this pages, download the examples and run them in your environment.
- Kettle Tip: Using java locales for a Date Dimension – Sept 2007 (link).
In this post, Roland Bouman, shows a simplified extraction and then proceeds to show how to connect to a database, use a SQL to create the table and execute it.
Then it explains three more steps to generate the data.
- HowTo: Create a date dimension with PDI (Kettle) – March 2010 (link)
Geschrieben Von adds more characteristics for a day and uses more PDI steps to obtain them: calculators, filters (select), lookups. This will be version 2.0 of the last example.
- Building a detailed Date Dimension with Pentaho Kettle – Sept 2010 (link)
In here, Slawomir Chodnicki explains briefly the desing considerations in his design. One important thing here is how he introduces the concept of updating your data on dimensions jus by re-run the transformation, this is something we must get used to. It is important if your job crashes and you have to rebuid the process or being capable to continue from a given point.
The file contains some erros on the java scrpts steps -some variables are not defined but referenced-, it is an oportunity to see the debugger messages of PDI.
5. Working without a Repository
If you are working with a developer team you shoud create a repository. Its simple, just click on new button and with a user with DB privileges on MySql, create the database.
Then you will get a single area for your programs and avoid versioning and syncing problems, your connection also get stored, etc. But if you are one or two people (normal for a pilot project) it is best to avoid using one. You can just synck and back up your program folders. Also you don’t need to change the normal way the BI server seeks programs.
The repository really needs its special post.
Ok, if you can’t wait, read this from John Dzilvelis.
6. PDI Agile Plugin
[Edit July 2011:] On version 4.2 RC1, and on, the plug-in is already included.
Head over here and download de 1.0 version for the Modeling and Visualization Plugin, aka the Analyzer Plugin for LucidEra Cleariew before being bought by Pentaho.
It adds prototyping, datasource visualization and modeling creation on data snapshots, so you drag and drop and save your work.
Unzip the content of pmv-1.0.2-stable.zip into:
It will create a folder named agile-bi.
Start spoon and you will notice three butons on the top right. That’s normal view, model and visualize. Check this video from arubawayne.
I was confuse about this note but the code is open source except the analyzer presentation layer and its available here.
7. Additional articles:
These are medium and complex topics:
- An example of the ‘generate documentation’ step. So you can add descriptions to your extractions and make use of this new step: here
- Error handling. Since version 3.8 error flow is available from every step here is how to use it properly.
- Handling of configuration and variables: here
- An impresive plugin ‘excel output’ (more complex but more impresive than the default step). So you can generate formated reports: here.
Note: on 4.2, the excel step is integrated in the PDI.
- Connect PDI to SAP BI as a web service here.
- Good chapter book sample “Pentaho Data Integration 4 Cookbook”: A transformation, A report from PDI data, PDI jobs from the BI Server process/ PUC, PUC-PDI-CDA, dashboard and data from PDI.
8. Ruby Plugin
[Edit August 31, 2011]