What is Dataflow Discoverer?
Dataflow Discoverer (DFLD) is a command-line discovery utility that detects and writes the metadata of SSIS dataflow columns to a SQL Server lookup table. DFLD detects Dataflows nested within containers up to any level of nesting.

Why use Dataflow Discoverer?
SSIS dataflow column metadata generated by DFLD utility could be used to enhance the error reporting capability of your SSIS package. For example, to get the name of ErrorColumn from the Error Outputs which is not available by default in SSIS yet.

What are the minimum system requirements to use Dataflow Discoverer?
  1. .NET framework 4 or above
  2. SQL Server Database (any version)
How to install Dataflow Discoverer?
Verify that your system meets minimum requirements, then
  1. Download sql script from Downloads page. Execute the script on your SQL Server. By default the sql script will create a SQL Server database - ETLMetadata and a lookup table – DFLMetadata. You could also create the table DFLMetadata in any other SQL Server database, in that case just comment out the database creation segment of the sql script.
  2. Download DFLD executable from Downloads page. Copy the executable to a local folder. It is recommended to add the executable path to System Path environment variable.
How to use Dataflow Discoverer?
From the command prompt, type:
dfld /param1:value1 /param2:value2 /param3:value3 ... /paramN:valueN

List of Parameters:
Parameter Shorthand Mandatory Description
/package /p Y Fully qualified path of SSIS Package in File System. If the path contains spaces, enclose in double quotes.
/server /s Y IP Address/Name of Database Server hosting the DFLMetadata table. If Instance Name is applicable, then specify as Database Server\Instance Name.
/database /d Y Database Name hosting the DFLMetadata table. This is usually ETLMetadata, unless you have deployed the table DFLMetadata to a different database.
/user /u N Database User Name for SQL authentication. This parameter is not required for Windows authentication.
/pwd /pwd N Password of Database User required for SQL authentication. This parameter is not required for Windows authentication.


If any parameter contains space, then enclose in double quotes. From command prompt, type dfld /? for more help options.
Examples:
  • Database uses windows authentication
dfld /p:C:\users\eagle\populateDim.dtsx /s:localhost /d:ETLmetadata
  • Database uses SQL server authentication
dfld /p:C:\users\eagle\populateRef.dtsx /s:localhost /d:ETLmetadata /u:userxyz /pwd:XXXXXXX
  • Package Name contains space
dfld /p:"C:\users\eagle\populate fact.dtsx" /s:localhost /d:ETLmetadata

On successful completion, your console will display statistics similar to the one shown in this screen capture

What metadata is captured by Dataflow Discoverer?

Metadata Data Type Description
PackageID Varchar(38) Unique GUID of the package
PackageName Varchar(200) Recommended to keep package name under 200 characters
DataflowID Varchar(38) GUID of the Dataflow
DataflowName Varchar(200) Recommended to keep dataflow name under 200 characters
IOType Char(1) Flag to indicate whether the collection is Input or Output. I= Input Collection, O= Output Collection
IOName Varchar(200) Name of Input/output collection
IOID Int ID of Input/output collection
ColumnName Varchar(200) Recommended to keep Column name under 200 characters
ColumnDesc Varchar(500)
LineageID Int It’s an integer value which identifies the column in the buffer
ColumnID Int Unique value that identifies a column within a Dataflow in the package.
ColumnDataType Varchar(30)
ColumnPrecision Int
ColumnScale Int
UpstreamComponentName Varchar(200)
SortKeyPosition Int
ColumnCodePage Int
MappedColumnID Int
UsageType Int
ColumnType Int


How to get name of ErrorColumn from the Error Outputs in my SSIS Package?
From your SSIS Dataflow, lookup DFLMetadata table to fetch the name of ErrorColumn. Check-out this post which demonstrates how this is done SSIS: How to get name of ErrorColumn from Error Output?

At what stage in my ETL development cycle should I use Dataflow Discoverer?
You can generate the dataflow column metadata using DFLD at any stage in your SSIS package development cycle. It is recommended to refresh your dataflow column metadata just before deployment when your package is ready to go.

How to capture/refresh Dataflow column metadata from multiple packages using Dataflow Discoverer?
It’s a typical ops requirement to keep the DFLMetadata up to date. This can be automated by creating a SSIS package that scans your deployment folder for package (* .dtsx) files. For each package found, execute the DFLD using an Execute Process Task in the control flow. Check-out this post which demonstrates how this is done DFLD: How to capture/refresh Dataflow column metadata from multiple packages using Dataflow Discoverer?

Where could I find the documentation for the source code of Dataflow Discoverer?
Download the source code from Downloads page. The source code is adequately documented to be self-explanatory.

Author's Blog

http://bennyaustin.wordpress.com

Recent Posts

 Benny Austin News Feed 
Friday, October 07, 2016  |  From Benny Austin

A brief retrospective of my submission for Kaggle data science competition that predicts the gender and age group of a smartphone user based on their usage pattern.

Sunday, September 11, 2016  |  From Benny Austin

A brief retrospective of my submission for Kaggle data science competition that forecasts inventory demand for Grupo Bimbo. Objective Grupo Bimbo is a bakery product manufacturing company that supplies bread and bakery products to its clients in Mexico on a weekly basis. Usually the sales agent calculates the supply for each product for each store. […]

Sunday, October 11, 2015  |  From Benny Austin

While type 2 dimensions are great to track attribute history, in some instances type 2 SCD may not be appropriate. In this post, I would like to mention few common anti-patterns I have encountered around type 2 SCD.

Sunday, August 02, 2015  |  From Benny Austin

Time Series is the historical representation of data points collected at periodic intervals of time. Statistical tools like R use forecasting models to analyse historical time series data to predict future values with reasonable accuracy. In this post I will be using R time series to forecast the exchange rate of Australian dollar using daily closing rate of the dollar collected over a period of two years.

Sunday, June 21, 2015  |  From Benny Austin

Decision tree is a data mining model that graphically represents the parameters that are most likely to influence the outcome and the extent of influence. The output is similar to a tree/flowchart with nodes, branches and leaves. The nodes represent the parameters, the branches represent the classification question/decision and the leaves represent the outcome. Internally decision tree algorithm performs a recursive classification on the input dataset and assigns each record to a segment of the tree where it fits closest.

Tuesday, February 24, 2015  |  From Benny Austin

Box plot is an effective way to visualize the distribution of your data.It only takes a few lines of code in R to come up with a box plot.

Monday, January 05, 2015  |  From Benny Austin

CUBE operator in Pig computes all possible combination of the specified fields. In this post I will demonstrate the use of Cube operator to analyse energy rating of air conditioners in Hortonworks Data Platform (HDP).

Sunday, January 04, 2015  |  From Benny Austin

This post demonstrates the use of GROUP operator to analyse credit card expenses.

Monday, October 20, 2014  |  From Benny Austin

Hive implements MapReduce using HiveQL. The built-in capabilities of HiveQL abstracts the implementation of mappers and reducers with a simple yet powerful SQL like query language. To demonstrate the inbuilt capabilities of HiveQL, I will be analysing hashtags from a twitter feed on Hortonworks Data Platform (HDP).

Tuesday, September 02, 2014  |  From Benny Austin

Power Query can discover and import data from websites. Often data warehouses rely on external data which is readily available in public websites for e.g. public holidays, school holidays, SIC codes, SWIFT codes, post codes etc. Power Query is perfectly suitable for such situations. Power Query can discover, fetch and transform data from a HTML table in a web page into a format that can be easily imported into data warehouses using SSIS package. It's like an ETL tool for the web page data source.

Monday, August 04, 2014  |  From Benny Austin

Storage cluster (HDFS) in Hadoop is also the Processing cluster (MapReduce). Azure provides two different options to store data:
Option 1: Use HDInsight cluster to store data as well as to process MapReduce requests. For e.g. a Hive database hosted in an HDInsight cluster which also executes HiveQL MapReduce queries. In this instance data is stored in the cluster’s HDFS.
Option 2: Use HDInsight cluster to only process MapReduce requests whereas data is stored in Azure blob storage. For e.g. the Hive data is stored in Azure storage while the HDInsight cluster executes HiveQL MapReduce queries. Here the metadata of Hive database is stored in the cluster whereas the actual data is stored in Azure storage. The HDInsight cluster is co-located in the same datacentre as the Azure storage and connected by high speed network.

There are several advantage of using Azure storage. Provisioning the HDInsight cluster on demand while retaining Azure storage is cost effective especially when there is spending limits on your Azure subscription.

Sunday, July 27, 2014  |  From Benny Austin

Hive implements MapReduce using HiveQL. The built-in capabilities of HiveQL abstracts the implementation of mappers and reducers with a simple yet powerful SQL like query language. To demonstrate the inbuilt capabilities of HiveQL, I will be analysing hashtags from a twitter feed on Azure HDInsight platform.

Tuesday, June 17, 2014  |  From Benny Austin

This post is a tutorial to get started on Hive in HDInsight. The steps to be followed are given below. As a pre-requisite you would need a subscription to Microsoft Azure to try out these steps
1. Provision Azure Storage Account
2. Provision HDInsight Cluster
3. Create Hive Database and Tables
4. Prepare Data as Ctrl-A separated Text Files
5. Upload Text Files to Azure Storage
6. Load Data to Hive
7. Execute HiveQL DML Jobs

Thursday, April 10, 2014  |  From Benny Austin

Partitions improve the query response of SSAS cube by narrowing the MDX query to a pre-defined subset of data. Usually cubes are developed with few basic partitions to begin with and new partitions are added over time. Partitions are created using XMLA commands. The rest of this post explains the steps to create partitions using metadata and SSIS package.

Wednesday, February 19, 2014  |  From Benny Austin

SSAS Partitions are primarily geared towards better query performance by narrowing the MDX query to a pre-defined subset of data. Each partition must be in processed state for the cube to be available for browsing. However in a delta ETL run, it is more than likely that only some partitions would be refreshed with data while the rest remain unchanged. For example in a hospital admissions cube partitioned by month, data movements are frequent for admissions made during the current/last month compared to earlier months. This means if the partitions updated by the ETL run can be identified, then we can tailor SSIS package to dynamically process only those partitions that require data refresh while leaving the remaining partitions untouched.

Thursday, November 14, 2013  |  From Benny Austin

Canned reports/dashboards aka custom reports or out of box reports are inevitable part of any BI ecosystem. A well-designed DW/BI will satisfy most of the analytical requirements of business which includes but not restricted to canned reports. Canned reports are a given, not the end game to any BI strategy. Reporting tools available today are incredibly easy to use. A power user with a little bit of training can build reports and dashboards with reasonable ease while BI teams can focus on expanding and enriching the self-service BI Platform.

Sunday, August 18, 2013  |  From Benny Austin

The execution plan of SSAS processing engine triggers a sequence of processing tasks based on the object's current processed state and its dependencies with other objects. This is what happens when the SSAS cube is processed from SSMS. It is possible to replicate this execution plan close enough in an SSIS package so that the cube is always processed and refreshed with latest data.

Wednesday, July 24, 2013  |  From Benny Austin

If your SSRS report is using SSAS MDX datasets, the member uniquename is particularly useful to pass parameter values to drill-through MDX reports or for sorting based on key values. If Fields!Attribute.UniqueName is returning nothing or blank or null value in your report , then most likely you are missing the DIMENSION PROPERTIES clause in your MDX dataset query

Tuesday, July 16, 2013  |  From Benny Austin

While SSAS allows you to build a cube straight of the fact and dimension tables, however there are benefits in creating the cube from database views that sits on top of the underlying fact and dimension tables. Here's why.

Wednesday, June 26, 2013  |  From Benny Austin

It's quite common to analyse a measure by categorizing it into different buckets or bands. For example to list the customers who are due for payment in the next week or fortnight, to decide which income group to target for a particular product campaign, which age group of students are most likely to buy a smartphone and so on. These bands are usually not supplied by source systems but derived from the underlying measure and they exist only for analytic purposes. While this banding could be achieved in reports or dashboards, there is significant value to model them as dimensions especially to promote self-service BI.

 Benny Austin News Feed 

Last edited Apr 9, 2013 at 3:57 PM by BennyAustin, version 34