Friday, April 30, 2010

Audit(SSIS)

The Audit Transform allows you to add auditing data to your Data Flow. In the age of HIPPA and
Sarbanes - Oxley (SOX) audits, you often must be able to track who inserted the data into a table and
when. This transform helps you with that function.

The following are some of the options you ’ ll have available to you:
  • Execution Instance GUID: The GUID that identifies the execution instance of the package
  • PackageID: The unique ID for the package
  • PackageName: The name of the package
  • VersionID: The version GUID of the package
  • ExecutionStartTime: The time the package began
  • MachineName: The machine that the package ran on
  • UserName: The user that started the package
  • TaskName: The Data Flow Task name that holds the Audit Task
  • TaskID: The unique identifier for the Data Flow Task that holds the Audit Task


Aggregate(SSIS)

The Aggregate Transform allows you to aggregate data from the Data Flow to apply certain TSQL
functions that are done in a GROUP BY statement like Average, Minimum, Maximum, and Count.
The most important option is the Operation drop - down box.
For this option, you can select the following:
  • Group By: Breaks the dataset into groups by the column you specify .
  • Average: Averages the selected column ’ s numeric data .
  • Count: Counts the records in a group .
  • Count Distinct: Counts the distinct non - NULL values in a group .
  • Minimum: Returns the minimum numeric value in the group .
  • Maximum: Returns the maximum numeric value in the group .
  • Sum: Returns sum of the selected column ’ s numeric data in the group .

    Thursday, April 29, 2010

    Wednesday, April 28, 2010

    Financial Data Model

    http://www.google.com/url?sa=t&source=web&ct=res&cd=7&ved=0CCIQFjAG&url=http%3A%2F%2Fwww.polydata.calpoly.edu%2Fdashboards%2Fobiee_conf%2Fpresentations%2FFinancialDataModelOverview.ppt&rct=j&q=example+scd2+in+finance&ei=ZvbYS7vDG4O78ga034mwBQ&usg=AFQjCNG1XhPyCuPA2VsE5iCQN5K24FTHWA

    How To Manage Current and Historical Information Within Your Data Warehouse

    http://www.exforsys.com/tutorials/data-warehousing/how-to-manage-current-and-historical-information-within-your-data-warehouse.html

    Infa Bloggers

    http://radhakrishnasarma.blogspot.com/

    Some imp posts:
    Designing the mapping/workflow efficiently - part I
    Designing the mapping/workflow efficiently - part II

    DW Terms

    Dimension:
    Dimension Table contains description of Facts.
    The dimensions are generally non-numeric and correspond to the who, what, when or where aspects of a question.
    For example, the Store table contains store names and addresses; the Product table contains product and packaging information; and the Period table contains month, quarter, and year values.

    Fact:
    Fact tables provide the (usually) additive values that act as independent variables by which dimensional attributes are analyzed.
    The facts also called as measures are generally numeric and correspond to the how much or how many aspects of a question.

    Confirmed Dimensions:
    "Dimensions which are common for all the fact tables are called Confirmed dimensions.
    Exa. Time, Geography"

    Fact Less Fact:

    A factless fact table is a table that contains nothing but dimensional keys.  Ralph Kimball’s earlier article is still the best source to learn this.
    There are two types of factless tables.  One is for capturing the event.  An event establishes the relationship among the dimension members from various dimension but there is no measured value.  The existence of the relationship itself is the fact.
    This type of fact table itself can be used to generate the useful reports.  You can count the number of occurrences  with various criteria.  For example, you can have a fact less fact table to capture the number of stores in a particular region or you can have a factless fact table to capture the student attendance (the example used by Ralph).  The following questions can be answered:
    • Which class has the least attendance?
    • Which teachers taugh the most students?
    • What is the average number of attendance of a given course?
    All the queries are based on the COUNT() with the GROUP BY queries.  I think that the interesting metrics are the nested GROUP BY so you can first count and then apply other aggregate functions such as AVERAGE, MAX, MIX.
    The other type of factless table is  called Coverage table by Ralph.  It is used to support negative analysis report. For example a Store that did not sell a product for a given period.  To produce such report, you need to have a fact table to capture all the possible combinations.  You can then figure out what is missing.

    What is normalization?

    Normalization is the process of designing a data model to efficiently store data in a database. The end result is that redundant data is eliminated, and only data related to the attribute is stored within the table.
    For example, let's say we store City, State and ZipCode data for Customers in the same table as Other Customer data. With this approach, we keep repeating the City, State and ZipCode data for all Customers in the same area. Instead of storing the same data again and again, we could normalize the data and create a related table called City. The "City" table could then store City, State and ZipCode along with IDs that relate back to the Customer table, and we can eliminate those three columns from the Customer table and add the new ID column.
    Normalization rules have been broken down into several forms. People often refer to the third normal form (3NF) when talking about database design. This is what most database designers try to achieve: In the conceptual stages, data is segmented and normalized as much as possible, but for practical purposes those segments are changed during the evolution of the data model. Various normal forms may be introduced for different parts of the data model to handle the unique situations you may face.
    Whether you have heard about normalization or not, your database most likely follows some of the rules, unless all of your data is stored in one giant table. We will take a look at the first three normal forms and the rules for determining the different forms here.


    Rules for First Normal Form (1NF) (No repeating groups, Primary keys)
    Eliminate repeating groups. This table contains repeating groups of data in the Software column.

     
    Computer
    Software
    1
    Word
    2
    Access, Word, Excel
    3
    Word, Excel
    To follow the First Normal Form, we store one type of software for each record.

     
    Computer
    Software
    1
    Word
    2
    Access
    2
    Word
    3
    Excel
    3
    Word
    3
    Excel
    Rules for second Normal Form (2NF) (1NF and No partial dependencies on Primary Key)
    Eliminate redundant data plus 1NF. This table contains the name of the software which is redundant data.

     
    Computer
    Software
    1
    Word
    2
    Access
    2
    Word
    3
    Excel
    3
    Word
    3
    Excel
    To eliminate the redundant storage of data, we create two tables. The first table stores a reference SoftwareID to our new table that has a unique list of software titles.

     
    Computer
    SoftwareID
    1
    1
    2
    2
    2
    1
    3
    3
    3
    1
    3
    3
     

     
    SoftwareID
    Software
    1
    Word
    2
    Access
    3
    Excel
    Rules for Third Normal Form (3NF) (2NF and No transitive dependencies)
    Eliminate columns not dependent on key plus 1NF and 2NF. In this table, we have data that contains both data about the computer and the user.

     
    Computer
    User Name
    User Hire Date
    Purchased
    1
    Joe
    4/1/2000
    5/1/2003
    2
    Mike
    9/5/2003
    6/15/2004
    To eliminate columns not dependent on the key, we would create the following tables. Now the data stored in the computer table is only related to the computer, and the data stored in the user table is only related to the user.

     
    Computer
    Purchased
    1
    5/1/2003
    2
    6/15/2004
     

     
    User
    User Name
    User Hire Date
    1
    Joe
    5/1/2003
    2
    Mike
    6/15/2004
     

     
    Computer
    User
    1
    1
    2
    1

    What does normalization have to do with SQL Server?

    To be honest, the answer here is nothing. SQL Server, like any other RDBMS, couldn't care less whether your data model follows any of the normal forms. You could create one table and store all of your data in one table or you can create a lot of little, unrelated tables to store your data. SQL Server will support whatever you decide to do. The only limiting factor you might face is the maximum number of columns SQL Server supports for a table.
    SQL Server does not force or enforce any rules that require you to create a database in any of the normal forms. You are able to mix and match any of the rules you need, but it is a good idea to try to normalize your database as much as possible when you are designing it. People tend to spend a lot of time up front creating a normalized data model, but as soon as new columns or tables need to be added, they forget about the initial effort that was devoted to creating a nice clean model.
    To assist in the design of your data model, you can use the DaVinci tools that are part of SQL Server Enterprise Manager.

    Advantages of normalization

    1. Smaller database: By eliminating duplicate data, you will be able to reduce the overall size of the database.
    2. Better performance:
    a. Narrow tables: Having more fine-tuned tables allows your tables to have less columns and allows you to fit more records per data page.
    b. Fewer indexes per table mean faster maintenance tasks such as index rebuilds.
    c. Only join tables that you need.

    Disadvantages of normalization

    1. More tables to join: By spreading out your data into more tables, you increase the need to join tables.
    2. Tables contain codes instead of real data: Repeated data is stored as codes rather than meaningful data. Therefore, there is always a need to go to the lookup table for the value.
    3. Data model is difficult to query against: The data model is optimized for applications, not for ad hoc querying.

    Summary

    Your data model design is both an art and a science. Balance what works best to support the application that will use the database and to store data in an efficient and structured manner. For transaction-based systems, a highly normalized database design is the way to go; it ensures consistent data throughout the entire database and that it is performing well. For reporting-based systems, a less normalized database is usually the best approach. You will eliminate the need to join a lot of tables and queries will be faster. Plus, the database will be much more user friendly for ad hoc reporting needs.

    SCDs

    When the data warehouse receives notification that an existing row in a
    dimension has in some way changed, there are three basic responses.
    We call these three basic responses Type 1, Type 2, and Type 3 slowly changing
    dimensions (SCDs).
     
    Type 1 SCD
     
    In Type 1 SCD, we simply overwrite the existing row. It means we don’t maintain history here.
     
    e.g.
     
    Primary Key        Natural Key        Prod Name         Category              Package Type   
     
    4321                       AT04                      Sprite                    Soft Drink            Glass
    4321                       AT04                      Sprite                    Soft Drink            Plastic
                   
     
    Type 2 SCD
     
    In Type 2 SCD, history is maintained accurately in the dimensions, and the changes are properly associated with the related facts.
    Here when the database is notified the existing dimension row needs to be changed, instead of overwriting the changes, the
    database inserts new row at the moment of change.
    This new record is assigned a new surrogate primary key and this primary key is used in all the fact tables for which this dimension is
    a foreign key.
    Type 2 SCD perfectly partitions history because each detailed version of a dimension entity is correctly connected to the span of fact table records.
     
    Kimball recommends dimension tables to provide optional useful information about Type 2 dimension changes.
     
    He recommends adding the following five fields to dimension tables processed
    with Type 2 logic:
     
    1.       Calendar Date                   foreign key (date of change)
    2.       Row Effective                    DateTime (exact date-time of change)
    3.       Row End                              DateTime (exact date-time of next change)
    4.       Reason for Change          (text field)
    5.       Current Flag                       (current/expired)
     
    Primary Key        Natural Key(Emp ID)       Designation        Department       Calendar Date   Start Date            End Date                Reason For Change         Current Flag
     
    4321                       E00234                                  Grade I                 Sales                      31-Mar-2008      04-Apr-2007       31-Mar-2008                Joining                                  0
    4322                       E00234                                  Grade II                Sales                      31-Mar-2009      01-Apr-2008       31-Mar-2009                Promotion                          0
    4323                       E00234                                  Grade II                Marketing           01-Apr-2009       01-Apr-2009       31-Mar-2099                Dept Changed                   1
     

    Kimball also recommends not to set the Row End DateTime value to NULL as it may give erroneous results when used in a BETWEEN logic.
    This value can be set to any arbitrary far future value.
     
    Type 3
     
    Type 3 SCD is used when a change happens to a dimension record but the old value remains valid as a second choice.
    In Type 3 SCD, instead of issuing a new row when the change takes place a new column is added to the table and
    the old value is placed in this column before it is overwritten.
     
     
    e.g.
     
    Primary Key        Natural Key(Prod ID)      Prod Name         Size        Category              Colour  
     
    1127648                A 107B                                  Denim Pants      30           men’s Wear       Blue
     
     
    Primary Key        Natural Key(Prod ID)      Prod Name         Size        Category              Old Category      Colour  
     
    1127648                A 107B                                  Denim Pants      30           Leisure Wear     Men’s Wear       Blue

    Logical Vs Physical Data Modeling

    Monday, April 19, 2010

    Attach .mdf without .ldf

    sp_attach_single_file_db @dbname= 'Customers_Orders', @physname= 'C:\Program Files\Microsoft SQL Server\MSSQL10.ICE\MSSQL\DATA\Customers_and_Orders.mdf'