Sunday, May 16, 2010
Wednesday, May 12, 2010
Informatica Random
The Integration Service ensures that the target database code page is a superset of the source database code page.
Wednesday, May 5, 2010
Lookup (Informatica)
Transformation type:
Passive
Connected/Unconnected
Lookup transformation is used to look up data in a flat file, relational table, view, synonym or source qualifier.
Following tasks can be performed with a Lookup transformation:
- Get a related value. Retrieve a value from the lookup table based on a value in the source. For example, the source has an employee ID. Retrieve the employee name from the lookup table.
- Perform a calculation.Retrieve a value from a lookup table and use it in a calculation. For example, retrieve a sales tax percentage, calculate a tax, and return the tax to a target.
- Update slowly changing dimension tables. Determine whether rows exist in a target.
Relational or flat file lookup:
Use ODBC for relational. Use flat file wizard for flat file lookup.
Pipeline lookup:
Perform a lookup on application sources such as a JMS, MSMQ, or SAP. Drag the source into the mapping and associate the Lookup transformation with the source qualifier.
Connected or unconnected lookup:
A connected Lookup transformation receives source data, performs a lookup, and returns data to the pipeline. An unconnected Lookup transformation is not connected to a source or target. A transformation in the pipeline calls the Lookup transformation with a :LKP expression. The unconnected Lookup transformation returns one column to the calling transformation.
Points to remember:
- Unconnected cache can be called by more than one ports.
- Unconnected can be called by one or more tranformations.
- More than one column can be returned from unconnected lookups by concatenating the fields in the sql override
- Lookup cache persistant can be saved in a file and shared by lookups.
- Lookup cache persistant is used when the cache is not going to change for a long time. This cache can be shared in multiple phases of the project.
- To improve performance when processing relational or flat file lookup sources, create a pipeline Lookup transformation instead of a relational or flat file Lookup transformation.
Cache the lookup source to improve performance. If you cache the lookup source, you can use a dynamic or static cache. By default, the lookup cache remains static and does not change during the session.
With a dynamic cache, the Integration Service inserts or updates rows in the cache. When you cache the target table as the lookup source, you can look up values in the cache to determine if the values exist in the target. The Lookup transformation marks rows to insert or update the target.
Option | Lookup Type | Description | ||||||
Lookup SQL Override | Relational | Overrides the default SQL statement to query the lookup table. Specifies the SQL statement you want the Integration Service to use for querying lookup values. Use only with the lookup cache enabled. For more information, see Lookup Query. | ||||||
Lookup Table Name | Pipeline Relational | The name of the table or source qualifier from which the transformation looks up and caches values. When you create the Lookup transformation, choose a source, target, or source qualifier as the lookup source. You can also import a table, view, or synonym from another database when you create the Lookup transformation. If you enter a lookup SQL override, you do not need to enter the Lookup Table Name. | ||||||
Lookup Source Filter | Relational | Restricts the lookups the Integration Service performs based on the value of data in Lookup transformation input ports. For more information about filtering source data from lookups, see Filtering Lookup Source Rows. | ||||||
Lookup Caching Enabled | Flat File Pipeline Relational | Indicates whether the Integration Service caches lookup values during the session. When you enable lookup caching, the Integration Service queries the lookup source once, caches the values, and looks up values in the cache during the session. Caching the lookup values can improve session performance. When you disable caching, each time a row passes into the transformation, the Integration Service issues a select statement to the lookup source for lookup values. Note: The Integration Service always caches flat file and pipeline lookups. | ||||||
Lookup Policy on Multiple Match | Flat File Pipeline Relational | Determines which rows that the Lookup transformation returns when it finds multiple rows that match the lookup condition. You can select the first or last row returned from the cache or lookup source, or report an error. Or, you can allow the Lookup transformation to use any value. When you configure the Lookup transformation to return any matching value, the transformation returns the first value that matches the lookup condition. It creates an index based on the key ports instead of all Lookup transformation ports. If you do not enable the Output Old Value On Update option, the Lookup Policy On Multiple Match option is set to Report Error for dynamic lookups. For more information about lookup caches, see Lookup Caches. | ||||||
Lookup Condition | Flat File Pipeline Relational | Displays the lookup condition you set in the Condition tab. | ||||||
Connection Information | Relational | Specifies the database containing the lookup table. You can define the database in the mapping, session, or parameter file:
By default, the Designer specifies $Source if you choose an existing source table and $Target if you choose an existing target table when you create the Lookup transformation. You can override these values in the session properties. The Integration Service fails the session if it cannot determine the type of database connection. | ||||||
Source Type | Flat File Pipeline Relational | Indicates that the Lookup transformation reads values from a relational table, flat file, or source qualifier. | ||||||
Tracing Level | Flat File Pipeline Relational | Sets the amount of detail included in the session log. | ||||||
Lookup Cache Directory Name | Flat File Pipeline Relational | Specifies the directory used to build the lookup cache files when you configure the Lookup transformation to cache the lookup source. Also used to save the persistent lookup cache files when you select the Lookup Persistent option. By default, the Integration Service uses the $PMCacheDir directory configured for the Integration Service. | ||||||
Lookup Cache Persistent | Flat File Pipeline Relational | Indicates whether the Integration Service uses a persistent lookup cache, which consists of at least two cache files. If a Lookup transformation is configured for a persistent lookup cache and persistent lookup cache files do not exist, the Integration Service creates the files during the session. Use only with the lookup cache enabled. | ||||||
Lookup Data Cache Size Lookup Index Cache Size | Flat File Pipeline Relational | Default is Auto. Indicates the maximum size the Integration Service allocates to the data cache and the index in memory. You can configure a numeric value, or you can configure the Integration Service to determine the cache size at run-time. If you configure the Integration Service to determine the cache size, you can also configure a maximum amount of memory for the Integration Service to allocate to the cache. If the Integration Service cannot allocate the configured amount of memory when initializing the session, it fails the session. When the Integration Service cannot store all the data cache data in memory, it pages to disk. Use with the lookup cache enabled. | ||||||
Dynamic Lookup Cache | Flat File Pipeline Relational | Indicates to use a dynamic lookup cache. Inserts or updates rows in the lookup cache as it passes rows to the target table. Use only with the lookup cache enabled. | ||||||
Output Old Value On Update | Flat File Pipeline Relational | Use with dynamic caching enabled. When you enable this property, the Integration Service outputs old values out of the lookup/output ports. When the Integration Service updates a row in the cache, it outputs the value that existed in the lookup cache before it updated the row based on the input data. When the Integration Service inserts a new row in the cache, it outputs null values. When you disable this property, the Integration Service outputs the same values out of the lookup/output and input/output ports. This property is enabled by default. | ||||||
Cache File Name Prefix | Flat File Pipeline Relational | Use with persistent lookup cache. Specifies the file name prefix to use with persistent lookup cache files. The Integration Service uses the file name prefix as the file name for the persistent cache files it saves to disk. Only enter the prefix. Do not enter .idx or .dat. You can enter a parameter or variable for the file name prefix. Use any parameter or variable type that you can define in the parameter file. For information about using parameter files, see the Workflow Administration Guide. If the named persistent cache files exist, the Integration Service builds the memory cache from the files. If the named persistent cache files do not exist, the Integration Service rebuilds the persistent cache files. | ||||||
Recache From Lookup Source | Flat File Pipeline Relational | Use with the lookup cache enabled. When selected, the Integration Service rebuilds the lookup cache from the lookup source when it first calls the Lookup transformation instance. If you use a persistent lookup cache, it rebuilds the persistent cache files before using the cache. If you do not use a persistent lookup cache, it rebuilds the lookup cache in memory before using the cache. | ||||||
Insert Else Update | Flat File Pipeline Relational | Use with dynamic caching enabled. Applies to rows entering the Lookup transformation with the row type of insert. When enabled, the Integration Service inserts new rows in the cache and updates existing rows When disabled, the Integration Service does not update existing rows. For more information about defining the row type, see Using Update Strategy Transformations with a Dynamic Cache. | ||||||
Update Else Insert | Flat File Pipeline Relational | Use with dynamic caching enabled. Applies to rows entering the Lookup transformation with the row type of update. When enabled, the Integration Service updates existing rows, and inserts a new row if it is new. When disabled, the Integration Service does not insert new rows. For more information about defining the row type, see Using Update Strategy Transformations with a Dynamic Cache. | ||||||
Datetime Format | Flat File | Click the Open button to select a datetime format. Define the format and field width. Milliseconds, microseconds, or nanoseconds formats have a field width of 29. If you do not select a datetime format for a port, you can enter any datetime format.Default is MM/DD/YYYY HH24:MI:SS. The Datetime format does not change the size of the port. | ||||||
Thousand Separator | Flat File | If you do not define a thousand separator for a port, the Integration Service uses the properties defined here. You can choose no separator, a comma, or a period. Default is no separator. | ||||||
Decimal Separator | Flat File | If you do not define a decimal separator for a particular field in the lookup definition or on the Ports tab, the Integration Service uses the properties defined here. You can choose a comma or a period decimal separator. Default is period. | ||||||
Case-Sensitive String Comparison | Flat File Pipeline | The Integration Service uses case-sensitive string comparisons when performing lookups on string columns. For relational lookups, the case-sensitive comparison is based on the database support. | ||||||
Null Ordering | Flat File Pipeline | Determines how the Integration Service orders null values. You can choose to sort null values high or low. By default, the Integration Service sorts null values high. This overrides the Integration Service configuration to treat nulls in comparison operators as high, low, or null. For relational lookups, null ordering is based on the database support. | ||||||
Sorted Input | Flat File Pipeline | Indicates whether or not the lookup file data is sorted. This increases lookup performance for file lookups. If you enable sorted input, and the condition columns are not grouped, the Integration Service fails the session. If the condition columns are grouped, but not sorted, the Integration Service processes the lookup as if you did not configure sorted input. For more information about sorted input, see Flat File Lookups. | ||||||
Lookup Source is Static | Flat File Pipeline Relational | The lookup source does not change in a session. | ||||||
Pre-build Lookup Cache | Allow the Integration Service to build the lookup cache before the Lookup transformation requires the data.
| |||||||
Subsecond Precision | Relational | Specifies the subsecond precision for datetime ports. For relational lookups, you can change the precision for databases that have an editable scale for datetime data. You can change subsecond precision for Oracle Timestamp, Informix Datetime, and Teradata Timestamp datatypes. Enter a positive integer value from 0 to 9. Default is 6 (microseconds). |
Connected Lookup | Unconnected Lookup |
Receives input values directly from the pipeline. | Receives input values from the result of a :LKP expression in another transformation. |
Use a dynamic or static cache. | Use a static cache. |
Cache includes the lookup source columns in the lookup condition and the lookup source columns that are output ports. | Cache includes all lookup/output ports in the lookup condition and the lookup/return port. |
Can return multiple columns from the same row or insert into the dynamic lookup cache. | Designate one return port (R). Returns one column from each row. |
If there is no match for the lookup condition, the Integration Service returns the default value for all output ports. If you configure dynamic caching, the Integration Service inserts rows into the cache or leaves it unchanged. | If there is no match for the lookup condition, the Integration Service returns NULL. |
If there is a match for the lookup condition, the Integration Service returns the result of the lookup condition for all lookup/output ports. If you configure dynamic caching, the Integration Service either updates the row the in the cache or leaves the row unchanged. | If there is a match for the lookup condition, the Integration Service returns the result of the lookup condition into the return port. |
Pass multiple output values to another transformation. Link lookup/output ports to another transformation. | Pass one output value to another transformation. The lookup/output/return port passes the value to the transformation calling :LKP expression. |
Supports user-defined default values. | Does not support user-defined default values. |
To be continued...
Tuesday, May 4, 2010
Performance Tuning (Informatica)
- Filter rows at source qualifier
- Sort data at source qualifier, whenever needed, instead of using sorter
- Use filter transformations as close to the source as possible
- Use sorted input for aggregator
- Use database partitioning
- Use workflow partitioning
- Avoid multiple datatype conversions
- Use variables for repetitive calculations
- Avoid joiners
- Use sorted input for joiner
- In joiners, use the sources with less data as master
- From source qualifier, take out only those ports which are required
- In lookup, delete the ports from the tranformations, which are not required
Error at Data Flow Task [Pivot [28]]: Output column "TransMonth" (49) cannot be mapped to PivotKey input column.
TITLE: Package Validation Error
------------------------------
Package Validation Error
------------------------------
ADDITIONAL INFORMATION:
Error at Data Flow Task [Pivot [28]]: Output column "TransMonth" (49) cannot be mapped to PivotKey input column.
Error at Data Flow Task [SSIS.Pipeline]: "component "Pivot" (28)" failed validation and returned validation status "VS_ISBROKEN".
Error at Data Flow Task [SSIS.Pipeline]: One or more component failed validation.
Error at Data Flow Task: There were errors during task validation.
(Microsoft.DataTransformationServices.VsIntegration)
------------------------------
BUTTONS:
OK
------------------------------
Resolution:
1. TransMonth column was added in the list of output columns. By removing this column from the output columns the issue was fixed.
Note: There can be many reasons for this error.
------------------------------
Package Validation Error
------------------------------
ADDITIONAL INFORMATION:
Error at Data Flow Task [Pivot [28]]: Output column "TransMonth" (49) cannot be mapped to PivotKey input column.
Error at Data Flow Task [SSIS.Pipeline]: "component "Pivot" (28)" failed validation and returned validation status "VS_ISBROKEN".
Error at Data Flow Task [SSIS.Pipeline]: One or more component failed validation.
Error at Data Flow Task: There were errors during task validation.
(Microsoft.DataTransformationServices.VsIntegration)
------------------------------
BUTTONS:
OK
------------------------------
Resolution:
1. TransMonth column was added in the list of output columns. By removing this column from the output columns the issue was fixed.
Note: There can be many reasons for this error.
Row Sampling (SSIS)
The Row Sampling transformation is used to obtain a randomly selected subset of an input dataset. You can specify the exact size of the output sample, and specify a seed for the random number generator.
There are many applications for random sampling. For example, a company that wanted to randomly select 50 employees to receive prizes in a lottery could use the Row Sampling transformation on the employee database to generate the exact number of winners.
The Row Sampling transformation is also useful during package development for creating a small but representative dataset. You can test package execution and data transformation with richly representative data, but more quickly because a random sample is used instead of the full dataset. Because the sample dataset used by the test package is always the same size, using the sample subset also makes it easier to identify performance problems in the package.
The Row Sampling transformation creates a sample dataset by selecting a specified number of the transformation input rows. Because the selection of rows from the transformation input is random, the resultant sample is representative of the input. You can also specify the seed that is used by the random number generator, to affect how the transformation selects rows.
Using the same random seed on the same transformation input always creates the same sample output. If no seed is specified, the transformation uses the tick count of the operating system to create the random number. Therefore, you could use the same seed during testing, to verify the transformation results during the development and testing of the package, and then change to a random seed when the package is moved into production.
This transformation is similar to the Percentage Sampling transformation, which creates a sample dataset by selecting a percentage of the input rows.
http://msdn.microsoft.com/en-us/library/ms141200.aspx
There are many applications for random sampling. For example, a company that wanted to randomly select 50 employees to receive prizes in a lottery could use the Row Sampling transformation on the employee database to generate the exact number of winners.
The Row Sampling transformation is also useful during package development for creating a small but representative dataset. You can test package execution and data transformation with richly representative data, but more quickly because a random sample is used instead of the full dataset. Because the sample dataset used by the test package is always the same size, using the sample subset also makes it easier to identify performance problems in the package.
The Row Sampling transformation creates a sample dataset by selecting a specified number of the transformation input rows. Because the selection of rows from the transformation input is random, the resultant sample is representative of the input. You can also specify the seed that is used by the random number generator, to affect how the transformation selects rows.
Using the same random seed on the same transformation input always creates the same sample output. If no seed is specified, the transformation uses the tick count of the operating system to create the random number. Therefore, you could use the same seed during testing, to verify the transformation results during the development and testing of the package, and then change to a random seed when the package is moved into production.
This transformation is similar to the Percentage Sampling transformation, which creates a sample dataset by selecting a percentage of the input rows.
http://msdn.microsoft.com/en-us/library/ms141200.aspx
Percentage Sampling (SSIS)
The Percentage Sampling transformation creates a sample data set by selecting a percentage of the transformation input rows. The sample data set is a random selection of rows from the transformation input, to make the resultant sample representative of the input.
The Percentage Sampling transformation is especially useful for data mining. By using this transformation, you can randomly divide a data set into two data sets: one for training the data mining model, and one for testing the model.
The Percentage Sampling transformation is also useful for creating sample data sets for package development. By applying the Percentage Sampling transformation to a data flow, you can uniformly reduce the size of the data set while preserving its data characteristics. The test package can then run more quickly because it uses a small, but representative, data set.
http://msdn.microsoft.com/en-us/library/ms139864.aspx
The Percentage Sampling transformation is especially useful for data mining. By using this transformation, you can randomly divide a data set into two data sets: one for training the data mining model, and one for testing the model.
The Percentage Sampling transformation is also useful for creating sample data sets for package development. By applying the Percentage Sampling transformation to a data flow, you can uniformly reduce the size of the data set while preserving its data characteristics. The test package can then run more quickly because it uses a small, but representative, data set.
http://msdn.microsoft.com/en-us/library/ms139864.aspx
OLE DB Command (SSIS)
The OLE DB Command transformation runs an SQL statement for each row in a data flow. For example, you can run an SQL statement that inserts, updates, or deletes rows in a database table.
You can configure the OLE DB Command Transformation in the following ways:
http://msdn.microsoft.com/en-us/library/ms141138.aspx
You can configure the OLE DB Command Transformation in the following ways:
- Provide the SQL statement that the transformation runs for each row.
- Specify the number of seconds before the SQL statement times out.
- Specify the default code page.
DELETE FROM DimProduct WHERE ProductKey = ?
.. The OLE DB Command transformation provides the parameter names and you cannot modify them. The parameter names are Param_0, Param_1, and so on.http://msdn.microsoft.com/en-us/library/ms141138.aspx
Multicast (SSIS)
The Multicast transformation distributes its input to one or more outputs. This transformation is similar to the Conditional Split transformation. Both transformations direct an input to multiple outputs. The difference between the two is that the Multicast transformation directs every row to every output, and the Conditional Split directs a row to a single output.
There is nothing to configure in the Multicast Transformation Editor other than the names of the outputs.
http://msdn.microsoft.com/en-us/library/ms137701.aspx
There is nothing to configure in the Multicast Transformation Editor other than the names of the outputs.
http://msdn.microsoft.com/en-us/library/ms137701.aspx
Merge Join (SSIS)
The Merge Join transformation provides an output that is generated by joining two sorted datasets using a FULL, LEFT, or INNER join. For example, you can use a LEFT join to join a table that includes product information with a table that lists the country/region in which a product was manufactured. The result is a table that lists all products and their country/region of origin. For more information, see Using Joins.
You can configure the Merge Join transformation in the following ways:
http://msdn.microsoft.com/en-us/library/ms141775.aspx
You can configure the Merge Join transformation in the following ways:
- Specify the join is a FULL, LEFT, or INNER join.
- Specify the columns the join uses.
- Specify whether the transformation handles null values as equal to other nulls.
http://msdn.microsoft.com/en-us/library/ms141775.aspx
Merge (SSIS)
The Merge transformation combines two sorted datasets into a single dataset. The rows from each dataset are inserted into the output based on values in their key columns.
By including the Merge transformation in a data flow, you can perform the following tasks:
http://msdn.microsoft.com/en-us/library/ms141703.aspx
By including the Merge transformation in a data flow, you can perform the following tasks:
- Merge data from two data sources, such as tables and files.
- Create complex datasets by nesting Merge transformations.
- Remerge rows after correcting errors in the data.
- The transformation inputs are not sorted.
- The combined output does not need to be sorted.
- The transformation has more than two inputs.
http://msdn.microsoft.com/en-us/library/ms141703.aspx
Lookup (SSIS)
The Lookup transformation performs lookups by joining data in input columns with columns in a reference dataset.
You can configure the Lookup transformation in the following ways:
http://msdn.microsoft.com/en-us/library/ms141821.aspx
You can configure the Lookup transformation in the following ways:
- Select the connection manager that you want to use. If you want to connect to a database, select an OLE DB connection manager. If you want to connect to a cache file, select a Cache connection manager.
- Specify the table or view that contains the reference dataset.
- Generate a reference dataset by specifying an SQL statement.
- Specify joins between the input and the reference dataset.
- Add columns from the reference dataset to the Lookup transformation output.
- Configure the caching options.
- SQL Server
- Oracle
- DB2
- If there is no matching entry in the reference dataset, no join occurs. By default, the Lookup transformation treats rows without matching entries as errors. However, you can configure the Lookup transformation to redirect such rows to a no match output. For more information, see Lookup Transformation Editor (General Page) and Lookup Transformation Editor (Error Output Page).
- If there are multiple matches in the reference table, the Lookup transformation returns only the first match returned by the lookup query. If multiple matches are found, the Lookup transformation generates an error or warning only when the transformation has been configured to load all the reference dataset into the cache. In this case, the Lookup transformation generates a warning when the transformation detects multiple matches as the transformation fills the cache.
http://msdn.microsoft.com/en-us/library/ms141821.aspx
Import Column (SSIS)
The Import Column transformation reads data from files and adds the data to columns in a data flow. Using this transformation, a package can add text and images stored in separate files to a data flow. For example, a data flow that loads data into a table that stores product information can include the Import Column transformation to import customer reviews of each product from files and add the reviews to the data flow.
The Import Column Transform is a partner to the Export Column Transform. These transforms do the
work of translating physical files from system file storage paths into database blob - type fields and vice
versa.
You can configure the Import Column transformation in the following ways:
http://msdn.microsoft.com/en-us/library/ms141262.aspx
The Import Column Transform is a partner to the Export Column Transform. These transforms do the
work of translating physical files from system file storage paths into database blob - type fields and vice
versa.
You can configure the Import Column transformation in the following ways:
- Specify the columns to which the transformation adds data.
- Specify whether the transformation expects a byte-order mark (BOM).
http://msdn.microsoft.com/en-us/library/ms141262.aspx
Export Column (SSIS)
The Export Column transformation reads data in a data flow and inserts the data into a file.
For example, if the data flow contains product information, such as a picture of each product, you could use the Export Column transformation to save the images to files.
Unlike the other transformations, the Export Column Transform doesn ’ t need a destination to create the
file.
You can configure the Export Column transformation in the following ways:
http://msdn.microsoft.com/en-us/library/ms139818.aspx
For example, if the data flow contains product information, such as a picture of each product, you could use the Export Column transformation to save the images to files.
Unlike the other transformations, the Export Column Transform doesn ’ t need a destination to create the
file.
You can configure the Export Column transformation in the following ways:
- Specify the data columns and the columns that contain the path of files to which to write the data.
- Specify whether the data-insertion operation appends or truncates existing files.
- Specify whether a byte-order mark (BOM) is written to the file.
http://msdn.microsoft.com/en-us/library/ms139818.aspx
Derived Column (SSIS)
The Derived Column transformation creates new column values by applying expressions to transformation input columns. An expression can contain any combination of variables, functions, operators, and columns from the transformation input. The result can be added as a new column or inserted into an existing column as a replacement value. The Derived Column transformation can define multiple derived columns, and any variable or input columns can appear in multiple expressions.
You can use this transformation to perform the following tasks:
http://msdn.microsoft.com/en-us/library/ms141069.aspx
You can use this transformation to perform the following tasks:
- Concatenate data from different columns into a derived column. For example, you can combine values from the FirstName and LastName columns into a single derived column named FullName, by using the expression
FirstName + " " + LastName
. - Extract characters from string data by using functions such as SUBSTRING, and then store the result in a derived column. For example, you can extract a person's initial from the FirstName column, by using the expression
SUBSTRING(FirstName,1,1)
. - Apply mathematical functions to numeric data and store the result in a derived column. For example, you can change the length and precision of a numeric column, SalesTax, to a number with two decimal places, by using the expression
ROUND(SalesTax, 2)
. - Create expressions that compare input columns and variables. For example, you can compare the variable Version against the data in the column ProductVersion, and depending on the comparison result, use the value of either Version or ProductVersion, by using the expression
ProductVersion == @Version? ProductVersion : @Version
. - Extract parts of a datetime value. For example, you can use the GETDATE and DATEPART functions to extract the current year, by using the expression
DATEPART("year",GETDATE())
.
http://msdn.microsoft.com/en-us/library/ms141069.aspx
Data Mining Query (SSIS)
The Data Mining Query transformation performs prediction queries against data mining models. This transformation contains a query builder for creating Data Mining Extensions (DMX) queries.
The Data Mining Query Transformation typically is used to fill in gaps in your data or predict a new
column for your Data Flow. This transformation runs a data - mining query and adds the output to
the Data Flow. It also can optionally add columns, such as the probability of a certain condition being
true.
A few great scenarios for this transformation would be the following:
The Data Mining Query Transformation typically is used to fill in gaps in your data or predict a new
column for your Data Flow. This transformation runs a data - mining query and adds the output to
the Data Flow. It also can optionally add columns, such as the probability of a certain condition being
true.
A few great scenarios for this transformation would be the following:
- You could take columns, such as number of children, household income, and marital income, to predict a new column that states whether the person owns a house or not.
- You could predict what customers would want to buy based on their shopping cart items.
- You could fill in the blank holes in your data where customers didn ’ t enter all the fields in a questionnaire.
Data Conversion (SSIS)
The Data Conversion transformation converts the data in an input column to a different data type and then copies it to a new output column.
You can configure the Data Conversion transformation in the following ways:
http://msdn.microsoft.com/en-us/library/ms141706.aspx
You can configure the Data Conversion transformation in the following ways:
- Specify the columns that contain the data to convert and the types of data conversions to perform.
- Specify whether the transformation output columns use the quicker, but locale-insensitive, fast parsing routines that Microsoft SQL Server Integration Services provides or the standard locale-sensitive parsing routines.
http://msdn.microsoft.com/en-us/library/ms141706.aspx
Monday, May 3, 2010
Copy Column (SSIS)
The Copy Column transformation creates new columns by copying input columns and adding the new columns to the transformation output. Later in the data flow, different transformations can be applied to the column copies.
This transformation has one input and one output. It does not support an error output.
http://msdn.microsoft.com/en-us/library/ms137741.aspx
This transformation has one input and one output. It does not support an error output.
http://msdn.microsoft.com/en-us/library/ms137741.aspx
Conditional Split (SSIS)
The Conditional Split transformation can route data rows to different outputs depending on the content of the data.
It is similar to a CASE decision structure in a programming language.
You can configure the Conditional Split transformation in the following ways:
http://msdn.microsoft.com/en-us/library/ms137886.aspx
It is similar to a CASE decision structure in a programming language.
You can configure the Conditional Split transformation in the following ways:
- Provide an expression that evaluates to a Boolean for each condition you want the transformation to test.
- Specify the order in which the conditions are evaluated. Order is significant, because a row is sent to the output corresponding to the first condition that evaluates to true.
- Specify the default output for the transformation. The transformation requires that a default output be specified.
http://msdn.microsoft.com/en-us/library/ms137886.aspx
Character Map (SSIS)
The Character Map transformation applies string functions, such as conversion from lowercase to uppercase, to character data. This transformation operates only on column data with a string data type.
You configure the Character Map transformation in the following ways:
You configure the Character Map transformation in the following ways:
- Specify the columns to convert.
- Specify the operations to apply to each column.
Cache (SSIS)
It allows you to load a cache file on disk in the Data Flow.
The Cache Transform can be used to populate a cache file in the Data Flow as a transform, and then immediately used, or it can be used as a destination and then used by another package or Data Flow in the same package.
This cache file allows you to perform lookups against large datasets.
It allows you to share the same lookup cache across many Data Flows.
Advantages:
As the data to be looked up is cached in a file, the SSIS engine does not need to connect to the database for each incoming row. So the lookups are faster.
The Cache Transform can be used to populate a cache file in the Data Flow as a transform, and then immediately used, or it can be used as a destination and then used by another package or Data Flow in the same package.
This cache file allows you to perform lookups against large datasets.
It allows you to share the same lookup cache across many Data Flows.
Advantages:
As the data to be looked up is cached in a file, the SSIS engine does not need to connect to the database for each incoming row. So the lookups are faster.
Sunday, May 2, 2010
Joiner (Informatica)
Transformation type:
Active
Connected
The Joiner transformation joins sources with at least one matching column.
The two input pipelines include a master pipeline and a detail pipeline or a master and a detail branch. The master pipeline ends at the Joiner transformation, while the detail pipeline continues to the target.
- You cannot use a Joiner transformation when either input pipeline contains an Update Strategy transformation.
- You cannot use a Joiner transformation if you connect a Sequence Generator transformation directly before the Joiner transformation.
To improve performance for a sorted Joiner transformation, use the source with fewer duplicate key values as the master.
By default, when you add ports to a Joiner transformation, the ports from the first source pipeline display as detail sources. Adding the ports from the second source pipeline sets them as master sources.
Use one or more ports from the input sources of a Joiner transformation in the join condition.
Normal Join
With a normal join, the Integration Service discards all rows of data from the master and detail source that do not match, based on the condition.
Master Outer Join
A master outer join keeps all rows of data from the detail source and the matching rows from the master source. It discards the unmatched rows from the master source.
Detail Outer Join
A detail outer join keeps all rows of data from the master source and the matching rows from the detail source. It discards the unmatched rows from the detail source.
Full Outer Join
A full outer join keeps all rows of data from both the master and detail sources.
Option | Description |
Case-Sensitive String Comparison | If selected, the Integration Service uses case-sensitive string comparisons when performing joins on string columns. |
Cache Directory | Specifies the directory used to cache master or detail rows and the index to these rows. By default, the cache files are created in a directory specified by the process variable $PMCacheDir. If you override the directory, make sure the directory exists and contains enough disk space for the cache files. The directory can be a mapped or mounted drive. |
Join Type | Specifies the type of join: Normal, Master Outer, Detail Outer, or Full Outer. |
Null Ordering in Master | Not applicable for this transformation type. |
Null Ordering in Detail | Not applicable for this transformation type. |
Tracing Level | Amount of detail displayed in the session log for this transformation. The options are Terse, Normal, Verbose Data, and Verbose Initialization. |
Joiner Data Cache Size | Data cache size for the transformation. Default cache size is 2,000,000 bytes. If the total configured cache size is 2 GB or more, you must run the session on a 64-bit Integration Service. You can configure a numeric value, or you can configure the Integration Service to determine the cache size at runtime. If you configure the Integration Service to determine the cache size, you can also configure a maximum amount of memory for the Integration Service to allocate to the cache. |
Joiner Index Cache Size | Index cache size for the transformation. Default cache size is 1,000,000 bytes. If the total configured cache size is 2 GB or more, you must run the session on a 64-bit Integration Service. You can configure a numeric value, or you can configure the Integration Service to determine the cache size at runtime. If you configure the Integration Service to determine the cache size, you can also configure a maximum amount of memory for the Integration Service to allocate to the cache. |
Sorted Input | Specifies that data is sorted. Choose Sorted Input to join sorted data. Using sorted input can improve performance. For more information about working with sorted input, see Using Sorted Input. |
Transformation Scope | Specifies how the Integration Service applies the transformation logic to incoming data. You can choose Transaction, All Input, or Row. For more information, see Working with Transactions. |
Adding Transformations to the Mapping (Sorted Input)
When you add transformations between the sort origin and the Joiner transformation, use the following guidelines to maintain sorted data:
Do not place any of the following transformations between the sort origin and the Joiner transformation: |
- | Custom |
- | Unsorted Aggregator |
- | Normalizer |
- | Rank |
- | Union transformation |
- | XML Parser transformation |
- | XML Generator transformation |
- | Mapplet, if it contains one of the above transformations |
You can place a sorted Aggregator transformation between the sort origin and the Joiner transformation if you use the following guidelines: |
- | Configure the Aggregator transformation for sorted input using the guidelines in Using Sorted Input. |
- | Use the same ports for the group by columns in the Aggregator transformation as the ports at the sort origin. |
- | The group by ports must be in the same order as the ports at the sort origin. |
When you join the result set of a Joiner transformation with another pipeline, verify that the data output from the first Joiner transformation is sorted. |
Tip: You can place the Joiner transformation directly after the sort origin to maintain sorted data.
If you use a sorted Aggregator transformation between the sort origin and the Joiner transformation, treat the sorted Aggregator transformation as the sort origin when you define the join condition. Use the following guidelines when you define join conditions:
The ports you use in the join condition must match the ports at the sort origin. |
When you configure multiple join conditions, the ports in the first join condition must match the first ports at the sort origin. |
When you configure multiple conditions, the order of the conditions must match the order of the ports at the sort origin, and you must not skip any ports. |
The number of sorted ports in the sort origin can be greater than or equal to the number of ports at the join condition. |
Example of a Join Condition
For example, you configure Sorter transformations in the master and detail pipelines with the following sorted ports:
1. | ITEM_NO |
2. | ITEM_NAME |
3. | PRICE |
When you configure the join condition, use the following guidelines to maintain sort order:
You must use ITEM_NO in the first join condition. |
If you add a second join condition, you must use ITEM_NAME. |
If you want to use PRICE in a join condition, you must also use ITEM_NAME in the second join condition. |
If you skip ITEM_NAME and join on ITEM_NO and PRICE, you lose the sort order and the Integration Service fails the session.
Joining Data from a Single Source
You may want to join data from the same source if you want to perform a calculation on part of the data and join the transformed data with the original data. When you join the data using this method, you can maintain the original data and transform parts of that data within one mapping. You can join data from the same source in the following ways:
Join two branches of the same pipeline. |
Join two instances of the same source. |
Joining Two Branches of the Same Pipeline
When you join data from the same source, you can create two branches of the pipeline. When you branch a pipeline, you must add a transformation between the source qualifier and the Joiner transformation in at least one branch of the pipeline. You must join sorted data and configure the Joiner transformation for sorted input.
For example, you have a source with the following ports:
Employee |
Department |
Total Sales |
In the target, you want to view the employees who generated sales that were greater than the average sales for their departments. To do this, you create a mapping with the following transformations:
Sorter transformation. Sorts the data. |
Sorted Aggregator transformation. Averages the sales data and group by department. When you perform this aggregation, you lose the data for individual employees. To maintain employee data, you must pass a branch of the pipeline to the Aggregator transformation and pass a branch with the same data to the Joiner transformation to maintain the original data. When you join both branches of the pipeline, you join the aggregated data with the original data. |
Sorted Joiner transformation. Uses a sorted Joiner transformation to join the sorted aggregated data with the original data. |
Filter transformation. Compares the average sales data against sales data for each employee and filter out employees with less than above average sales. |
Mapping that Joins Two Branches of a Pipeline |
Note: You can also join data from output groups of the same transformation, such as the Custom transformation or XML Source Qualifier transformation. Place a Sorter transformation between each output group and the Joiner transformation and configure the Joiner transformation to receive sorted input.
Joining two branches might impact performance if the Joiner transformation receives data from one branch much later than the other branch. The Joiner transformation caches all the data from the first branch, and writes the cache to disk if the cache fills. The Joiner transformation must then read the data from disk when it receives the data from the second branch. This can slow processing.
Joining Two Instances of the Same Source
You can also join same source data by creating a second instance of the source. After you create the second source instance, you can join the pipelines from the two source instances. If you want to join unsorted data, you must create two instances of the same source and join the pipelines.
Mapping that Joins Two Instances of the Same Source |
Note: When you join data using this method, the Integration Service reads the source data for each source instance, so performance can be slower than joining two branches of a pipeline.
Guidelines
Use the following guidelines when deciding whether to join branches of a pipeline or join two instances of a source:
Join two branches of a pipeline when you have a large source or if you can read the source data only once. For example, you can only read source data from a message queue once. |
Join two branches of a pipeline when you use sorted data. If the source data is unsorted and you use a Sorter transformation to sort the data, branch the pipeline after you sort the data. |
Join two instances of a source when you need to add a blocking transformation to the pipeline between the source and the Joiner transformation. |
Join two instances of a source if one pipeline may process slower than the other pipeline. |
Join two instances of a source if you need to join unsorted data. |
Subscribe to:
Posts (Atom)