Complete Pentaho Installation on Ubuntu, Part 6

Install PDI/Kettle and Agile PDI in a Development Environment

This is where you’ll install and play with one of the most interesting, well crafted, user friendly and enjoyable applications I have ever used.

It is also the heart of the Pentaho BI suite as the tool to Extract, Transform & Load (ETL), process data and execute jobs. It’s name was Kettle, its former name, now is known as PDI for Pentaho Data Integration, and also Spoon, it’s executable file name.

We’ll install PDI as a desktop development tool.

1. Get and Install PDI.

Go to Pentaho files in sourceforge here and download the latest stable release (version 4.2 should be up by july 2011. Its a remarkable new version, check the improvements).

Double click on the file and extract it’s content into a new folder:

/Pentaho/data-integration

You can delete the .bat files and make all the .sh files, specially the spoon.sh as an executable file (right click on permisions tab). And in command terminal start it with:

./spoon.sh

Close the window dialog that offers to open a repository. You should be in the PDI, a development environment:

Pentaho Data Integration

PDI

Note: You don’t have to configure anything or add drivers for common databases.  I told you, this open source application is the result of a great community and is a very well crafted product, you’ll see.

[Edit]
Ups, There is a glitch on the interface and the new ubuntu 11.04 scrollbars as they don’t work or let you put steps into the canvas. The solution I took is disable them as shown on the PDI forums here.

2. Meet the Application

There are several resources you should browse and revisit them as you familiarize yourself with the concepts of ETL and this tool:

  1. 18 slides presentation by the project founder, manager and lead developer of PDI, explaining its capabilities. Slide #11 ‘use-cases’ list some of its uses. [video].
  2. Check a guide that comes with your download at
    /Pentaho/data-integration/docs/English/getting_started_with_pdi.pdf
  3. A detailed spoon user guide in the Pentaho wiki.
  4. Some videos on you-tube explain specific extractions that may seem too complicated at first and sometimes are about the process more than each component, but check those vblogs: mattcasters, BIOpenSource, DiddySteiner, ETLTools, , fechever75, LaboratoriosSIUCV, opensourcebi and more.
  5. The most important cookbook you have are the sample files in your disk at:
    /Pentaho/data-integration/samples/transformations/
  6. Also get this old guide, it’s no longer distributed with the PDI but it was very usefull to me as it lists the main ‘steps’.
  7. Continue with more articles from the Pentaho wiki.

3. Define a DB Connection

In the left panel, right click on Transformations. Click new. Now again on the left panel select the view tab.

Left Panel

Right click on the Database connections node, select new and fill the dialog with your data:

Here you can see the values for our MySql database and the result of the test.

Close the dialogs and right click on the left panel in the mysql connection, select share. Now when you save the transformation the connection data will be available for other extractions.

You have to save the transformation, a .ktr [xml] file. Create a folder for your transformations, that will make it easy to sync or backup them up:

/Pentaho/data-integration/development/

Note: An Oracle 10g connection using jndi (no net client) looks like this:

Conection to Oracle 10g

Conection to Oracle 10g

4. Execute a Transformation

A nice way to start learning about the PDI, ETL and datawarehousing is by opening the samples folder and check the components names and its notes, those are self explanatory. If you double click on them you will see the parameters that specify each behavior. If you right click on them you can select options to see the description, input or output fields, the text description -you should document the intention of the activity in here-, preview a sample run, etc.

Once you have reviewed some transformations I recommend one to start, that is create an object fundamental to multidimensional analisys: the time dimension. This is a table with a row for each day in the calendar, has columns showing special attributes like months, quarters, years, weekends So its easy to select dates based on those columns and then select the values in the fact table just with the indexed records which contain those needed dates.

A nice specification for a time dimension table is listed in this post of Nicholas Goodman. His blog has very interesting information too.

Check this pages, download the examples and run them in your environment.

  • Kettle Tip: Using java locales for a Date Dimension – Sept 2007 (link).
    In this post, Roland Bouman, shows a simplified extraction and then proceeds to show how to connect to a database, use a SQL to create the table and execute it.
    Then it explains three more steps to generate the data.
    In here you will see the more difficult part of using PDI, the javascript step.
  • HowTo: Create a date dimension with PDI (Kettle) – March 2010 (link)
    Geschrieben Von adds more characteristics for a day and uses more PDI steps to obtain them: calculators, filters (select), lookups. This will be version 2.0 of the last example.
  • Building a detailed Date Dimension with Pentaho Kettle – Sept 2010 (link)
    In here, Slawomir Chodnicki explains briefly the desing considerations in his design. One important thing here is how he introduces the concept of updating your data on dimensions jus by re-run the transformation, this is something we must get used to. It is important if your job crashes and you have to rebuid the process or being capable to continue from a given point.
    The file contains some erros on the java scrpts steps -some variables are not defined but referenced-, it is an oportunity to see the debugger messages of PDI.

5. Working without a Repository

If you are working with a developer team you shoud create a repository. Its simple, just click on new button and with a user with DB privileges on MySql, create the database.

Then you will get a single area for your programs and avoid versioning and syncing problems, your connection also get stored, etc. But if you are one or two people (normal for a pilot project) it is best to avoid using one. You can just synck and back up your program folders. Also you don’t need to change the normal way the BI server seeks programs.

The repository really needs its special post.

Ok, if you can’t wait, read this from

6. PDI Agile Plugin

[Edit July 2011:] On version 4.2 RC1, and on, the plug-in is already included.

Head over here and download de 1.0 version for the Modeling and Visualization Plugin, aka the Analyzer Plugin for LucidEra Cleariew before being bought by Pentaho.

It adds prototyping, datasource visualization and modeling creation on data snapshots, so you drag and drop and save your work.

Unzip the content of pmv-1.0.2-stable.zip into:

/Pentaho/data-integration/plugins/spoon

It will create a folder named agile-bi.

Start spoon and you will notice three butons on the top right. That’s normal view, model and visualize. Check this video from .

I was confuse about this note but the code is open source except the analyzer presentation layer and its available here.

7. Additional articles:

These are medium and complex topics:

  • An example of the ‘generate documentation’ step. So you can add descriptions to your extractions and make use of this new step: here
  • Error handling. Since version 3.8 error flow is available from every step here is how to use it properly.
  • Handling of configuration and variables: here
  • An introduction to regular expresions. A must read for the javascript step: here
  • An impresive plugin ‘excel output’ (more complex but more impresive than the default step). So you can generate formated reports: here.
    Note: on 4.2, the excel step is integrated in the PDI.
  • Connect PDI to SAP BI as a web service here.
  • Good chapter book sample “Pentaho Data Integration 4 Cookbook”: A transformation, A report from PDI data, PDI jobs from the BI Server process/ PUC, PUC-PDI-CDA, dashboard and data from PDI.
  • In the javascript step you can see very usefull sample code for each function. DEinspanjer explains with more detail than this: On left panel -> open Transform Functions -> open Date Functions -> right click on dateDiff -> select Sample.

8. Ruby Plugin

[Edit August 31, 2011]
Its news to me that there is another way to do the scripting than the javascript step. Now you can do your process in ‘Ruby’ with another plugin that you just need to unzip on the plugin directory: ruby-scripting-plugin.

Advertisement

Complete Pentaho Installation on Ubuntu, Part 5

Prepare your ‘Production’ Server

There are a few things we can do to our working server:

Usefull Changes

  1. Add your own logo
  2. Remove user selection login option
  3. Modify login screen text
  4. Increase connection timeout
  5. Increase memory limit for apache
  6. Set up a publishing password
  7. Change your passwords
  8. Remove warning messages.

Optional Activities

  1. Rename your Server and port if necesary
  2. Use a connection pool to avoid disconnection
  3. Move the database to other machine (not standard)
  4. Customize the BI server messages (rarely needed)
  5. Email account setup
  6. Automatic startup
  7. Additional notes

Not all of them are needed in your installation, choose which one you’ll use and which ones will be common in your developer and production server. Work up to that point and make a file system [folders] copy to deploy in your server machine. Continue the configuration on that server.

I come from a developer background where an additional quality server would be set up to test work in progress and deployment packages before releasing them into production but in BI I think that’s too much, the two environments are enough to avoid shooting yourself on the foot either by over-confidence, late hour distractions or human error.

In any case the backup policy of the servers should be independent. One for continuing operation depending on how fast a machine can be set up with the last data snapshot used. The other one is an incremental backup so you’ll won’t loose more than one or few day’s work.

To do the backup you shut down the DB and BI servers and copy the folders, just remember to set up the replacement machines with the same names. Or do a SQL backup with MySql Administrator or ‘Percona XtraBackup’ that can do a hot -operational- DB backup for your InnoDB tables (ISAM tables use a table locks). For the BI folder, copying the files are enough.

1. Add your own logo

To change the top-right image of ‘Pentaho’ overwrite the logo.png image in:

3.8 [152×60 pixels]: /Pentaho/biserver-ce/tomcat/webapps/pentaho/mantle/

3.9 [152×75 pixels]: /Pentaho/biserver-ce/tomcat/webapps/pentaho/mantle/themes/onyx/images and …/themes/slate/images

By the way, the login screen logo [224×94 pixels] in 3.9 is at:

/Pentaho/biserver-ce/tomcat/webapps/pentaho-style/images/login/logo.png

2. Remove user selection login option

If you have tested your users and your admin user and password (for authorization you’ll need Admin, Authenticated and ceo roles) you can hide the user selector from the login dialog. Edit the following XML file:

/Pentaho/biserver-ce/pentaho-solutions/system/pentaho.xml

and change the content from true to false:

3.8: <login-show-users-list>false</login-show-users-list> 
3.9 and 3.10: <login-show-sample-users-hint>false</login-show-sample-users-hint> 

3. Modify login screen text

Pentaho uses the first screen to present their product to new users. You might want to replace them with your project objectives, links for specific user manuals, tips, instructional videos*, news or local resources.

Specifically -I think- you’ll want to modify the scary text “Supplied free of charge with no support… no maintenance, no warranty” into something more conforting like “Pentaho Open Source Enterprise Quality Software supported by [use your contact data]”.

You can use the ‘Komposer’ editor to easily modify the text in:

/Pentaho/biserver-ce/tomcat/webapps/pentaho/jsp/PUCLogin.jsp

* For screencast videos I recomend wink (windows freeware) it produces flash files. Or if you are adventurous, use services with text-to-speach characters like http://www.xtranormal.com or http://goanimate.com/, short videos are great for instructional guides.

This has been fixed, this is no longer needed

[Edit: In 3.9 4.0 this file should be edited to remove the demo users to apear if you click on the Evaluation Login link. Search for joe, and remove the following text (strikedthrough) and replace de text striked and bold like this:
<tr>    <td style=”font-size: .8em;”>  <strong>Administrator</strong><br>     User Name: joe<br>  Password: password</td>     <td style=”font-size: .8em;”>  <strong>Business User</strong><br>     User Name: suzy<br> Password: password</td>
</tr>  <tr>    <td colspan=”3″ style=”padding: 4px 20px 0 0; font-size: .8em;”>    <a href=”http://www.pentaho.com/helpmeout/&#8221; target=”_blank”>Request free evaluation support.</a>    <img src=”/pentaho-style/images/login/help_link.png” width=”20″ height=”20″ align=”absbottom”>    </td>    <td style=”font-size: .8em;”>Contact [your data] for user accounts and support </td> </tr>

4. Increase connection timeout

The default browser session connection time out is 30 minutes, you can change that modifying the file:

/Pentaho/biserver-ce/tomcat/conf/web.xml

The unit is in minutes so for 3 hours set it to:

<session-timeout>180</session-timeout>

and also with the same xml tag in:

/Pentaho/biserver-ce/tomcat/webapps/pentaho/WEB-INF/web.xml

5. Increase memory limit for apache

This activity was a must in previous versions, but now the parameters are ok. Anyway if you want to play with the variable -I havent found benchmarks on it- you can edit:

/Pentaho/biserver-ce/start-pentaho.sh

set memory minimum (Xms) and maximum (Xmx) in tomcat, for example: -Xms516m -Xmx1024m

CATALINA_OPTS="-Xms256m -Xmx768m -XX:MaxPermSize=256m -Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterval=3600000"

6. Set up a publishing password

On future articles we’ll be saving queries, metada or OLAP models into the production web server. But you need to set the ‘publishing password’. In the file:

/Pentaho/biserver-ce/pentaho-solutions/system/publisher_config.xml

type a password:

<publisher-password>[your-password]</publisher-password>

7. Change your passwords

There are several steps for this to work:

  • Set up a new user as administrator.
    – Start PAC (administrator console).
    – Add Admin and ceo roles to your selected user.
  • Stop PAC and the BI server.
  • Change the database passwords.
    – Enter theMySql  administrator
    – Change the passwords for: pentaho_admin, pentaho_user and hibuser.
  • Re-edit the following files with the new passwords:
    – hibuser password in:
    /Pentaho/biserver-ce/pentaho-solutions/system/hibernate/
    mysql5.hibernate.cfg.xml
    -All user’s passwords in:
    /Pentaho/biserver-ce/pentaho-solutions/system/simple-jndi/
    jdbc.properties

    – hibuser password in:
    /Pentaho/biserver-ce/pentaho-solutions/system/
    applicationContext-spring-security-jdbc.xml
    and (tanks to Raul Chavarría and mahamood notes):
    /Pentaho/biserver-ce/pentaho-solutions/system/
    applicationContext-spring-security-hibernate.properties
    – hibuser and pentaho_user passwords in this identical files:
    /Pentaho/biserver-ce/tomcat/conf/Catalina/localhost/pentaho.xml
    /Pentaho/biserver-ce/tomcat/webapps/pentaho/META-INF/context.xml
  • Edit the URL tag and change joe and password to your new user and password in
    <URL>…userid=joe&amp;password=password</URL> in the file:
    /Pentaho/biserver-ce/pentaho-solutions/system/olap/datasources.xml
  • There are some examples (.xaction files)… but check them as you click on them and they report errors.
  • Reset the server and test.
  • In the PAC delete the demo users: joe, pat, suzy, tiffany

8. Remove warning messages

In 3.9.stable and up the OLAP results of jpivot shows a warning about being deprecated in favor of the new PAT OLAP browser:

JPivot has been replaced by Pentaho Analyzer.
It is provided as a convenience but will no longer be enhanced or offically supported by Pentaho.

and I think it will not be appreciated by your users. As it is really not true that PAT has all jpivot needed functionallity, not yet you can eliminate it with a CSS tag in:

/Pentaho/biserver-ce/tomcat/webapps/pentaho/adhoc/styles/

there are two files: jpivot.css y jpivotIE6.css.You need to add this at the bottom:

#deprecatedWarning { display: none; }

In the query tool WAQR, there is also a message to remove

Message Remove

In the file adhoc.css  at the bottom add:

#waqrDeprecatedAlert { display: none; }

9. Rename your Server (and Port if necesary)

Your ‘localhost’ url must be changed so it can be reached from any computer in your network. Use your machine name as its already mapped on the network.

The default port is 8080 that’s a port for web development so it is probably used on a server, check with your system administrator for a free port or take the next one available like the enterprise edition does. If its a new machine you can leave it as it is.

The same files and lines are used to change the server name and port. The most important one is the tomcat (or web application server) services definition file:

/Pentaho/biserver-ce/tomcat/webapps/pentaho/WEB-INF/web.xml

search for ‘localhost’ and modify [your server]:[your port] like BISERVER1:8086

<param-name>fully-qualified-server-url</param-name>
<param-value>http://
BISERVER1:8086/pentaho/</param-value>

now that we are in this file we’ll add the IP address from the machines will be connecting from the administration console, use a comma separated list:

<param-name>TrustedIpAddrs</param-name>
<param-value>[your.own.pcs.ip]<param-value>

save it and edit:

/Pentaho/biserver-ce/tomcat/conf/server.xml

First change the port (On 3.9 4.0):

<Connector URIEncoding=”UTF-8″ port=”8086” protocol=”HTTP/1.1″  
           connectionTimeout=”20000″  
           redirectPort=”8443″ />

then the server:

<Engine name=”Catalina” defaultHost=”BISERVER1“>

<Host name=”BISERVER1”  appBase=”webapps”
unpackWARs=”true” autoDeploy=”true”
xmlValidation=”false” xmlNamespaceAware=”false”>

Some other files may need to be edited depending if you’re puting services in diferent machines but if they remain in your server localhost will be ok, for example, the OLAP services will work with localhost in:

/Pentaho/biserver-ce/pentaho-solutions/system/olap/datasources.xml

but if you changed the port you’ll need to change it to [your-server]:[your-port] like:

<URL>http://biserver1:8086/…

[Edit with the Saiku update this is no longer necesary]
If you installed saiku then edit also:

/Pentaho/biserver-ce/tomcat/webapps/saiku/WEB-INF/web.xml

There are other files to modify if you change the localhost-port but depend on each service. Like the metadata.mxi from the schema editor, it will be mentiones on its own post.

10. Use a connection pool

For running an uninterrupted service or inclusive for more than 8 hours you might want to use a connection pool manager that adds more advantages than just maintaining an open connection to MySQL at all times.

In the Pentaho forum they recommended c3p0 as a connection pool. So download the file

c3p0-0.9.1.2.jar

and put it into library folders:

/Pentaho/biserver-ce/tomcat/webapps/pentaho/WEB-INF/lib
/Pentaho/biserver/administration-console/lib

And add this lines:

<!– connection pool–>
<property name=”connection.provider_class”>org.hibernate.connection.C3P0ConnectionProvider</property>
<property name=”hibernate.cp3p0.acquire_increment”>3</property>
<property name=”hibernate.cp3p0.idle_test_period”>10</property>
<property name=”hibernate.cp3p0.min_size”>5</property>
<property name=”hibernate.cp3p0.max_size”>75</property>
<property name=”hibernate.cp3p0.max_statements”>0</property>
<property name=”hibernate.cp3p0.timeout”>25200</property>
<property name=”hibernate.cp3p0.preferredTestQuery”>Select 1</property>
<property name=”hibernate.cp3p0.testConnectionOnCheckout”>true</property>
<property name=”hibernate.c3p0.acquireRetryAttempts”>30</property>
<property name=”hibernate.c3p0.acquireIncrement”>5</property>
<property name=”hibernate.c3p0.idleConnectionTestPeriod”>300</property>

to your hibernate connection file in:

/Pentaho/biserver-ce/pentaho-solutions/system/hibernate/mysql5.hibernate.cfg.xml

just at the top, after the tag <session-factory>. Restart your server.

11. If you’ll move the database to other machine

Update your server and ports, users and password on:

/Pentaho/biserver-ce/pentaho-solutions/system/olap/datasources.xml
/Pentaho/biserver-ce/pentaho-solutions/system/simple-jndi/jdbc.properties

Change your strings [new-server]:[new-port]

12. If you’ll need to customize the BI server messages

Some messages in the BI server (console app) can be easily changed in case you’ll need more personalized texts. There is an excelent blog article from Prashant Raju.

The tooltip description of every program can be set with the .properties file:

/Pentaho/biserver-ce/pentaho-solutions/index.properties

Even if you’ll want to edit every message you can do it in:

/Pentaho/biserver-ce/tomcat/webapps/pentaho/mantle/messages/messages.properties
/Pentaho/biserver-ce/tomcat/webapps/pentaho/mantleLogin/messages/ MantleLoginMessages.properties

Remember that every language has its sufix, for example the spanish files are ‘_es’ like messages_es.properties. And if you’re using special caracters you better make sure youre saving in UTF-8 coding (editor scren when you save).

13. Email account setup

The BI server can use a mail server if you have one set up already but in case you don’t get authorization to connect to it you can always send mail from the automated processes (ETL) using your mail acount. This is how I did it and we’ll review it with PDI.

The setup for the server is easy, if you know some parameters from your email server. Edit

/Pentaho/biserver-ce/pentaho-solutions/system/smtp-email/email_config.xml

Put your data in:

<mail.smtp.host>[your.mail.ip.address]<mail.smtp.host>
<mail.userid>[your-user]</mail.userid>
<mail.password>[your-password]</mail.password>

We’ll test it with the pentaho designer,

14. Automatic startup

The correct way to do this is adding a user and group with a perfect defined user account, make scripts and placing them on the correct folders.

The fast track is based on the pentaho wiki like:

  • Edit a file in a terminal/command window, type:
    sudo gedit /etc/init.d/pentaho
  • Add this text (Important: Don’t forget to use YOUR user folder):
    case “$1” in
    start)
    /home/[your-user]/Pentaho/biserver-ce/start-pentaho.sh > /tmp/pentaho.out
    ;;
    stop)
    home/[your-user]/Pentaho/biserver-ce/stop-pentaho.sh
    ;;
    esac
    exit 0
  • Save and exit. Then make the file executable:
    sudo chmod +x /etc/init.d/pentaho
  • update the boot scripts
    sudo update-rc.d pentaho defaults

You can restart your server.

Note: How to set up a tomcat pentaho windows service:

  1. Open a terminal window as administrator.
    Click on start on exec, type cmd, right click on list, select run as admin
  2. Change folder to your pentaho-tomcat excecutables:
    C:\Pentaho\biserver-ce\tomcat\bin
  3. Register the service:
    tomcat6.exe //IS//Tomcat6 –DisplayName=”PentahoBIServer”
  4. Start the configuration application:
    tomcat6w //ES//PentahoBIServer
  5. On ‘Log On’ tab, select the windows user you used for the instalation or use another one that has ownership of the folder as some temporal files requires it.
  6. On ‘java’ tab tipe:
    1. Virtual machine:
      C:\Program Files\Java\jdk1.6.0_25\jre\bin\server\jvm.dll
      Classpath:
      C:\Pentaho\biserver-ce\tomcat\bin\bootstrap.jar CD:\Pentaho\biserver-ce\tomcat\bin\tomcat-juli.jar
    2. On ‘java options’
      a. type (use your path):
      -Dcatalina.base=C:\Pentaho\biserver-ce\tomcat      -Dcatalina.home=C:\Pentaho\biserver-ce\tomcat      -Djava.endorsed.dirs=C:\Pentaho\biserver-ce\tomcat\endorsed      -XX:MaxPermSize=256m      -Dsun.rmi.dgc.client.gcInterval=3600000      -Dsun.rmi.dgc.server.gcInterval=3600000      -Djava.io.tmpdir=C:\Pentaho\biserver-ce\tomcat\temp      -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager      -Djava.util.logging.config.file=C:\Pentaho\biserver-ce\tomcat\conf\logging.properties
      b. Initial memory pool:        512
      c. For max memory pool:  1024
      d. Thread stack size:          64
    3. On ‘startup’ tab:
      a. Class:           org.apache.catalina.startup.Bootstrap
      b. Arguments:  start
      c. Mode:           jvm
    4. On ‘shutdown’ tab:
      a. Class            org.apache.catalina.startup.Bootstrap
      b. Arguments   stop
      c. Mode           jvm
  7. Apply and close the application
  8. Use services.msc and change its properties to automatic start
  9. Restart your PC-Server

15. Additional Notes

A. There are some errors in the console or logs (/tomcat/logs/) that are not really important, but you can get rid of one (sooner or later someone is going to ask about them): To avoid the “WARNING: Security role name PENTAHO_ADMIN used in an <auth-constraint> without being defined in a <security-role>”. Add before the security-constraint tag in:

/Pentaho/biserver-ce/tomcat/webapps/pentaho/WEB-INF/web.xml

add

<security-role>
   <description>security role</description>
   <role-name>PENTAHO_ADMIN</role-name>
</security-role>
<security-constraint>

That’s it. If you know something else to add, please let me know.