Performing Lookups and Transformations with Talend

December 22, 2009 Appirio

Ward Loving

In this article, I’d like to explore the workhorse of the Talend Open Studio – the tMap component. As you’d expect, this component allows you to visually map fields from an input source to fields on an output stream. It also gives you the ability to do lookups to other input files or databases while performing transformations on your data with basic java expressions. We’re going to manipulate some text files to give you a feel for this component’s capabilities, but before we dive in let’s discuss a scenario or two where you might need to use the tMap component:

Use Case A: You’re on a space walk doing some routine maintenance and you look up and see that your toolbox is slowly drifting out of reach into geosynchronous orbit.

Use Case B: You have some data that you need to move from one system to another. Three individual address fields in the source data need to be consolidated into a single text area field in the destination system. You also have a requirement to ensure that only valid email addresses are to be included the output files, but you don’t want to lose a whole record if an email address is invalid.

Sadly, data migration skills are probably not going to help you much with Use Case A. You can skip this article. But, with Use Case B, you’re in business. We’ll start where we left off in the TEST TALEND project that I created in the previous article in June. We won’t be reusing the code from the previous example but you might want to walk through the steps of setting up the repository and creating a new project if you haven’t done that before. You can find and download the latest Talend Open Studio by going to http://www.talend.com.

Open up the Talend Open Studio and select a project. Right-click on the Job Designs folder in the Repository tab and select Create Job. Go ahead and name the job – I called mine Second Job.

Fig 1 – Create Job

I’m am working quite a bit with data migrations to Salesforce.com these days and although you can pull data directly from Salesforce using Talend, I often find myself using Talend to work with files rather than using it to pull data out of Salesforce.com directly. Talend doesn’t allow us (yet), to adjust the number records exported in each batch so it is much quicker to use the Salesforce.com Data Loader to pull records out of Salesforce.com. The difference in performance between the two tools is not a big deal for a few thousand rows – but when you’re talking hundreds of thousands or millions of rows of data, every optimization is helpful. For this example let’s use Talend’s tFileInputDelimited Component which can be used to process the ‘lingua franca’ of Saleforce.com Data Migrations – the csv file. You can find this under File->Input in the components list or you just can just type tFileInput into the Palette Search field.

Fig 2 – Search Palette

Drag the tFileInputComponent onto the workspace.

Fig 3 – Workspace

We’re going to point this tFileInputComponent to a file. You can simply cut and paste the following data into a file called “sample_input_data.csv”:

FirstName, LastName, Email, Address1, Address2, Address3, GenderCode
John, Smith, jsmith@yahoo.com, 666 5th Ave, 19th Floor, Suite 15, M
Jane, Doe, jdoe@gmail.com, 444 8th Ave, Snyder Plaza, Building 7, F
Ralph, Emerson, invalid@email!com, 108 Street,, Apt 3B, M
Edgar, Poe, epoe@scarystory.com, 1000 Boston Street,,, M

For this example, I’ve put the file in my C:data directory. Now if you double-click on the tFileInputComponent in the workspace it will bring up the Basic Settings of the Component tab at the bottom of the workspace.

Fig 4 – Component Tab – Basic Settings

I’ve highlighted the following settings in Fig 4 above which need to be updated:

  • Select the File Name for the sample_input_data file.
  • Select a comma, rather than a semi-colon as the field separator
  • Select the csv option checkbox
  • Put “1” in the Header field to designate that your input file has column headers.

We’ll also need to input the schema for our csv file, so click Edit Schema and enter the column headers for your file as shown in the figure below. Clicking the green plus sign adds a column entry and clicking the red cross removes one. There are also some handy controls here to import and export your schema entries. This can save you a lot of time when you are working with the same schema across multiple files. Click OK when you’ve completed the schema entries.

Fig 5 – Schema Dialog

Now switch to the Advanced Settings panel and click the “Trim All Columns” checkbox. As far as I can tell, there is almost no reason why you wouldn’t want to trim the fields of your csv files. Making it a habit to check this will save you a lot of headaches performing matching against your lookup files.

Fig 6 – Component Tab – Advanced Settings

You can also change the name of your component in the workspace on the View panel. I’ve changed the label for my tFileInputComponent to “Sample Input File” by updating the Label Format field.

Fig 7 – Component Tab – View

Next we’ll create a simple lookup file to process our Gender Codes. Cut and paste the following data into another file called “gender_lookup.csv”:

Code,Value
M,Male
F,Female

We’ll need to repeat the steps shown in Figures 3 – 7 for the second tFileInputComponent which is pointing to the gender lookup file:

  • Drag the tFileInputComponent onto the workspace
  • Update the Basic Settings – File Name, Separator, CSV Option, and Header
  • Create the schema for the Lookup file. It will have two columns: Code and Value.
  • Check the “Trim all columns” checkbox on the Advanced Tab
  • Update the component to a more user-friendly name (Optional). I’ve named mine “Gender Lookup”

Now in addition to input, our migration process is also going to have output. Let’s stick with the delimited file format for now and create another csv file for the output document using the tFileOutputDelimited component. It can be found under File->Output section of the Component Palette. Go ahead and drag the tFileOutputDelimited component out into the workspace.

Fig 8 – Component Tab – Basic Settings – tFileOutput Component

The steps here to update a tFileOutputComponent aren’t that different from the tFileInputComponent:

  • Select the File Name for the sample_output_data file.
  • Select
    a comma, rather than a semi-colon as the field separator
  • Select the Include Header checkbox to designate that you’d like your output file to have column headers.

We’ll also need to update the schema for the output component. Remember we have a requirement to consolidate the address fields from three separate fields down to one, so our schema is going to change a little from the input file version.

Fig 9 – tFileOutput Schema Dialog

Click ok when you’ve finished entering your schema. Now we’ll go to the advanced settings panel and click the “CSV Options” checkbox to ensure that any wandering commas are properly escaped.

Fig 10 – Component Tab – Advanced Settings – tFileOutputComponent

Switch to the View panel to update the name of the tFileOutputComponent. I’ve updated mine to “Sample Output File”. Your workspace should look something like this.

Fig 11 – Workspace with Input and Output Components

All our efforts have been in preparation for this moment — so don’t blow this big guy. If you can take the pressure, and I know that you can, select the tMap component from the component palette under the Processing header and drag it onto the workspace.

Fig 12 – tMap

The crowd roars. That wasn’t so bad now, was it?

The next step is to create the flow for our job. Right click the Sample Input File component and select Row->Main as shown below.

Fig 13 – Create Flow

Connect the little plug to your tMap component. You also need to do the same thing with your Gender Lookup tFileInputComponent. When you right click and select Row->Main from the tMap to the Sample Data Output component the following dialog will pop up. I’ve named the flow “toOutputFile” as shown below. It is possible to create multiple outputs for each job but we just need one for this example.

Fig 14 – Output Name Dialog

A second dialog will come up asking if you want to get the schema of the target component. You do. When you’ve done all this your job should look something like this:

Fig 15 – Workspace with Connected Components

Double-click on the tMap component to bring up the Map Editor. Your input fields are on the left side of the screen and your output fields are on the right. To map one field to another simply drag the field from the left to the right side of the screen and drop it into the Expression column.

Fig 16 – Mapping Fields

After matching row1.FirstName to toOutputFile.FirstName you’ll see a yellow arrow marking the association. Match up the LastName field as well.

You can concatenate all three address fields together by simply dragging each of them to the single address field on the output side of the map editor.

Fig 17 – Address Mapping

Unfortunately, this turns out to be a little crude. If one of the address fields is blank, then you end up with unsightly gaps in your address. In Salesforce.com most street address fields are text areas, so it’s nice to preserve the address lines in the text area which are captured in the source data. Clicking the button labeled ‘…’ next to the toOutputFile Address field in the tMap Editor to brings up the Expression Editor. The button is marked with a red arrow in Figure 17.

Fig 18 – Expression Editor

Cut and paste the following code into the expression editor for more adult address processing:

(row1.Address1 != null && row1.Address1.length() > 0 ? row1.Address1 : “”) +
(row1.Address2 != null && row1.Address2.length() > 0 ? “n” + row1.Address2 : “” ) + (row1.Address3 != null && row1.Address3.length() > 0 ? “n” + row1.Address3 : “”)

This code simply checks to see if a value is present before adding a linefeed to the address line entry. Click Ok to update the expression for your Address field and return to the Map Editor.

To filter out invalid email address we can add a regular expression to the Expression Editor. Click the ‘…’ button next to the Email field and enter this expression:

(row1.Email != null &&
row1.Email.matches(“^[\w\.-]+@([\w\-]+\.)+[A-Za-z]{2,4}$”) ? row1.Email.toLowerCase() : “”)

This regular expression isn’t a perfect email validator, but it keeps out most of the riff-raff. If an email doesn’t match the regular expression, it probably isn’t a valid address and we’re just replacing the invalid text with an empty string. An Email type field in Salesforce.com will actually bounce records which have an invalid email format when you load them to the system, which may not be the desired behavior. We’ve only scratched the surface of the capabilities of the TOS Expression Editor. It has all kinds of string, numerical, and date processing functions available to you. The most frequent idiom that I use is a simple java ternary operator. In fact, we could use this idiom to process our gender codes if we wanted to:

(row1.GenderCode != null && row1.GenderCode.equals(“M”) ? “Male” :
     (row1.GenderCode != null && row1.GenderCode.equals(“F”) : “”))

This code simply resolves the GenderCode to “Male” or “Female” if the code equals “M” or “F” and returns an empty string if it doesn’t match either one. But let’s resolve our gender codes in more stylish fashion using a lookup. The simplicity of this example belies the power of this capability. It is possible to resolve foreign key relationships in this way between files with hundreds of thousands or even millions of records. Talend’s lookups are more powerful than VLookups in Excel as a way to create foreign key relationships for data migrations and are much more repeatable. There is no limit to the number of lookups you can include. My personal record is eleven in a single job. Although Talend has the capability of writing temp files to disk for large lookups (that is the button that looks like a disk drive on the row2 header in the Map Editor), I have found that you can run into memory issues if your data input has several million rows and your lookup file has several million rows as well. There’s only so much heap to go around. The way I’ve worked around this limitation is to divide the input file into parts using a separate Talend job, and then iterate through the parts using a tFileList component. You can check the Talend component
documentation to get more information about tFileList configuration.

To create a lookup relationship between the Gender Code and the Gender Lookup file, drag the
row1. Gender Code field down to the Expr. Key column of the row2.Code field. You should see a purple arrow marking the association, or is it mauve? You know, these are things that keep me up at night.

Fig 19 – Lookup Mapping

Next, drag the row2.Value field to the toOutputFile.Gender field on the right side of the editor window. That’s all there is to it. Your complete mapping should look like the following.

Fig 20 – Completed Mapping

If you click the Inner Join checkbox on the header of the row2 editor, shown in Fig 19 above, only row1 records that have matching gender code will be written to your output file. Click Ok to close the Map Editor and let’s run this bad boy. Select the Run tab at the bottom of the screen and press the Run button.

Fig 21 – Run the Second Job

Check your output file to see the results. It should look something like this:

FirstName,LastName,Email,Address,Gender
“John”,”Smith”,”jsmith@yahoo.com”,”666 5th Ave
19th Floor
Suite 15″,”Male”
“Jane”,”Doe”,”jdoe@gmail.com”,”444 8th Ave
Snyder Plaza
Building 7″,”Female”
“Ralph”,”Emerson”,””,”108 Street
Apt 3B”,”Male”
“Edgar”,”Poe”,”epoe@scarystory.com”,”1000 Boston Street”,”Male”

We’ll you haven’t retrieved a $100K grease gun from the abyss, but you have added a couple of tricks to you data migration toolkit. I encourage you to download the Open Studio and try it out.

Previous Article
Calling a REST Web Service (JSON) with Apex
Calling a REST Web Service (JSON) with Apex

Jeff Douglas Using JSON RESTful Web Services with Salesforce.com opens up your org to a number third-party ...

Next Article
Experimenting with Windows Azure
Experimenting with Windows Azure

I’ve been experimenting with Windows Azure the past few days. It’s great to see C# again 🙂 and Visual Studi...