For my Netflix Search Engine project, Flicksery, I get data from Netflix as XML files. The catalog file for instant titles is around 600 MB. Following is a sample XML entry for one title/movie.
As you can see it is not very easy to read. Not only the alignment, but the number of elements too. May be we could use a better text editor or an XML editor to see this properly. But it still would be very difficult to play around with the data and do any kind of transformation or analysis. Talend Open Studio is something that can be really useful to analyze the data embedded within the large XML.
In this post we will try and analyze the XML file using some really neat features available in Talend Open Studio.
Let’s get started:
Open up Talend Open Studio and create a new project – NetflixDataAnalysis:
Right click on Job Designs and select Create job – InstantFeedTest
Right click on File xml and select Create file xml:
This brings up a wizard. Enter the details as shown below and click Next:
In step 2 of the wizard select Input XML and click, Next.
In step 3 of the wizard select the XML file. For this test, I took only part of the XML file as loading the entire 600 MB file would cause Java Heap issues and would prevent the file from loading correctly. As we just want to analyze and see the different fields available in the XML, a sample should be sufficient. Once the file is selected you should see the schema of the XML file in the Schema Viewer.
Step 4 is where you start to see the real power of Talend Open Studio. The Source Schema list on the left displays the schema of the XML file. The Target Schema section provides you with a way of defining an output schema for the XML. Using XPath you can now define the required elements from the input XML file. You can drag the element which will repeat itself in the XML to Xpath loop expression section. In this case the element catalog_title is the element that embeds all information for a single movie/title.
Next, you can traverse through all the nodes on the left and drag the required elements to the right under the Fields to extract section. You can also provide custom column names under the Column Name section. Once you are done dragging all the required fields, click on Refresh Preview to see a preview of the data. This preview helps one get a quick idea of how the data will be parsed. Click Finish.
Double click on the job, InstantFeedTest to open it up in the workspace. Drag the newly created XML Metadata, NetflixInstantFeed. Also, drag the tFileOutputDelimited_1 component from the Palette on the right.
Select the tFileOutputDelimited_1 node and go to the “Component” tab at the bottom of the workspace. Update the configurations, Field Separator to “,” and set the File Name to the desired path and name. We are now ready to test out our job. Click on the Run icon on the toolbar to execute the job. Once executed you should see processing status as shown below:The above job is going to read the XML file, extract the fields and generate a comma separated text file with the extracted data.
http://api-public.netflix.com/catalog/titles/movies/780726,The Mummy,http://cdn0.nflximg.net/images/0270/2990270.jpg,When British archaeologists uncover the ancient sarcophagus of a mummified Egyptian priest (Boris Karloff), they foolishly ignore its warning not to open the box. Now brought back to life, the mummy tries to resurrect the soul of his long-dead love.,,Top 100 Thrills nominee,,,1346482800,4102444800,NR,MPAA,4388,1.77:1,1932,Classic Movies,3.5,1387701023800,,,
As you can see the big XML node has now more readable as a simple comma separated record. This was a simple one-to-one mapping to from XML to CSV. Talend Open Studio is way more powerful than this. You can add a new component to the job to apply transformations to data coming in from the XML.
As you see in the above record the first column/value in the comma separated record is a link. All I am interested is in the last 6 digits of the link. I want my final output to have only 6 digits and not the entire link. To do this delete the link between NetflixInstantFeed and tFileOutputDelimited_1 nodes. Next, drag tMap_1 component from the Pallete to the Job workspace. Right click on NetflixInstantFeed, select, Row->Main and join it to tMap_1. Next, right click on tMap_1, select, Row->New Output (Main) and join it to tFileOutputDelimited_1 node. You will be prompted to enter a name. The name entered for this example is processed_op. Once done, the job should now look as shown below:
Select the tMap_1 component and click on the Map Editor button on Component tab at the bottom of the workspace. The Map Editor opens up with the metadata information from the XML file on the left and desired output on the right. As you see below I have dragged all the columns from the left to the right. The only modification I have done is for the “id” column. I have applied a function to get only the last 6 digits from the right.
As you can see we can easily apply functions to transform the input data. The function shown above is a StringHandling function. There are several other functions that can be applied using the Expression Builder as shown below:
After you are done applying your function, click OK to close the screen. Now, you can re-run the job to see if the transformation has been applied correctly. After a successful run you should see the results of the job as shown below:
Let us look at the output file to see the effect of the transformation:
780726,The Mummy,http://cdn0.nflximg.net/images/0270/2990270.jpg,When British archaeologists uncover the ancient sarcophagus of a mummified Egyptian priest (Boris Karloff), they foolishly ignore its warning not to open the box. Now brought back to life, the mummy tries to resurrect the soul of his long-dead love.,,Top 100 Thrills nominee,,,1346482800,4102444800,NR,MPAA,4388,1.77:1,1932,Classic Movies,3.5,1387701023800,,,
We have successfully built a job to transform an XML file to a comma separated file. This job can be exported and run as a standalone job on any environment running Java. Also, we chose to output the data to CSV file, however, Talend Open Studio can read from multiple data formats, databases and also write to different file formats or directly insert into databases.
This was just a quick introduction (tip of the iceberg) to experience the usefulness of Talend Open Studio for data processing. The features of this tool are vast and can’t be covered in a blog post. But this should get you started on using Talend. Hope you found this fast paced tutorial useful.