The Data Warehouse Toolkit

Ralph Kimball, Margy Ross

Mentioned 34

Ralph Kimball invented a data warehousing technique called ?dimensional modelling? and popularised it in his first Wiley bestseller The Data Warehouse Toolkit. Since then dimensional modelling has become the most widely accepted technique for data warehouse design. Since the first edition, Kimball has improved on his earlier techniques and created many new ones. In this second edition, he provides a comprehensive collection of all of them, from basic to advanced, and strategies for optimising data warehouse design for common business applications. He includes examples for retail sales, inventory management, procurement, orders and invoices, customer relationship management, accounting, financial services, telecommunication and utilities, health care, insurance and more. He also presents unique modelling techniques for e-commerce and shows strategies for optimising performance. A companion Web site provides updates on dimensional modelling techniques, links to related sites and source code where appropriate.

More on Amazon.com

Mentioned in questions and answers.

What advice would you have for someone who wants to start in the BI (Business Intelligence) domain?

I where and what I should start with: Books, Blogs, WebCasts... What I should pay attention to and what I should stay away from.

Are the Microsoft technologies worth while ?

I would stress you to read this book; might seem kind of outdated but the same theory still applies today. It is probably the best starter for general BI.

The Data Warehouse Toolkit - Ralph Kimball

Regarding Microsoft's BI it is a medium-sized tool that can do the job in your first steps (I have more experience with Cognos though). Haven't used MS tools since 2005 so I can't tell much about it.

In case you happen to be interested in Cognos, I have a few videos which can be of help: Cognos Tutorials

Good luck with your project.

Get the Kimball Books (specially this one http://www.amazon.com/The-Data-Warehouse-Toolkit-Dimensional/dp/0471200247) and for starters you may want to start with the MS BI Framework The Microsoft Data Warehouse Toolkit and the SQL Server Enterprise (MS BI Bundle with the database, ETL and reporting), it's easy a readily available, specially if you are a student with the MSDNAA, you can get the enterprise version for free!!!

I really enjoy database design and the whole concept of managing data semantically and all the logic that comes with it.

My knowledge level when it comes to databases is however (I would guess) quite basic - I can model data relationships correctly with ER diagrams, connection tables, handling many-to-many, one-to-many etc etc. I'm experienced when it comes to programming in general, and I figure my database knowledge is like knowing the basics of object oriented programming, i.e how to model a car-class, inheriting from the vehicle class, containing wheel objects and so on.

Now I would like to further my knowledge about relational databases so that I may confidently say to an employer that I can handle the subject on a professional level.

All I can handle right now is probably my movie database in the back end of my personal website, which probably would collapse if I was Amazon and had to store millions of movies. So there's the subject of scalability right? I'm sure there's a pretty "standard" array of subjects/concepts within database design that you simply must understand and be able to apply in real life if you're going to work with databases on a professional level.

So, I would be very grateful if any database gurus in the field could namedrop some areas, concepts, case studies or anything that would be beneficial to study to get really good at databases. I'm sure there's a vast science lurking here, and I want it.

Thanks in advance!

A whole other area is dimensional modelling and data warehousing.

I had been working with relational modelling for years, and then I read The Data Warehouse Toolkit and received an entirely new view of how it could be used.

Objective:

Get the number of times something happened between two times when the order of magnitude of the count is 100,000 - 10,000,000.

Current implementation:

  • Using PostgreSQL
  • Each "incident" is recorded as a separate row in a table

The columns:

  • Incident type
  • Date-Time that it occurred

The query to get the count (pseudocode):

COUNT rows WHERE time_occurred > <begin_time> AND time_occurred < <end_time>

The problem:

This works, but the query is very inefficient and takes about 40 seconds to respond. As I understand it, PostgreSQL is not a good database to use for this type of query.

I sat down and thought up a few ways that this type of query could be indexed and executed in O(log n) time so I know t is possible.

What tools should I be using to do this? Should we be using a different database to store the count rows? Is there a package we could install on top of PostgreSQL to do this easily? What are our options?

Note:

Not sure if I was clear about this. The result of COUNT should be on the order of 100,000 - 10,000,000. This means that the number of rows that match the query would be on the order of 100,000 - 10,000,000. The actual number of rows in the table is an order of magnitude more.

Thanks so much!

This is exactly the problem that dimensional modelling and data warehousing is designed to solve.

A previous project I worked on built a data warehouse in Ruby in a couple of weeks in order to deal with queries like this, and exposed it to the main app with a simple REST API. Basically you extract your data and transform it into a 'Star Schema', which is highly optimized for queries like the one you describe.

Postgresql is well suited to be the data warehouse database.

It is a very detailed subject, and a great starter resource is this: http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimensional/dp/0471200247

I've been tasked with creating a dynamic report builder to extend our current product that should allow our users to configure with relative ease a useful report drawing data from what they've inputted into the system. Currently we customize these reports manually, and this process involves a developer (me) taking the requirements of the report (fields, aggregate totals, percentages, etc) and publishing the results as a relatively interactive page that allows for the ability to 'drill down' for more information in the record rows, etc.

The reports are not extremely complicated, but they're involved enough that programmatically generating these reports doesn't seem possible. I feel like creating an interface to allow the users to customize the look of the report shouldn't be too difficult, though involved in and of itself. Where I am at a loss is how to create an interface that will allow users who have absolutely no 'programming' literacy the ability to easily generate the SQL queries that will pull the information they need.

In fact, they need to be able to create these queries and access the bowels of their inputted data without ever being aware of what they're really doing. I feel for this to work as required, the generation of the report has to be as indistinguishable from magic as possible. The user should be able to drag and drop what he/she needs from sets of possible data and magically produce an report.

I'm up for the challenge of course, but I really don't know where to start. Once I get the gears moving, resolving individual issues will be 'easy' ( well actually more like part of the process), but getting off the ground has been challenging and frustrating. If anyone can offer me a direction to search in, I'm not afraid of putting in the hours. Thank you for your time, and I look forward to some positive suggestions.

Have a look at what data warehouses do (e.g. The Data Warehouse Toolkit). They create several basic table that are very wide, contain a lot of redundant data and cover a certain aspect of the database.

I would create several such wide views and let the users select a single view as the basis for a dynamic report. They then can chose the columns to display, the sorting and grouping. But they cannot chose any additional tables or views.

Of course, a typical view must really cover everything regarding a certain aspect of the database. Let's assume you have an Order Items view. Such a view would contain all items of all orders offering hundreds of columns that cover:

  • The order product ID, product name, regular price, discount, paid price, price incl. the associated part of the shipping cost etc.
  • The order ID, order date, delivery date, the shipping cost etc.
  • The customer ID, customer name, customer address etc.
  • Each date consists of several columns: full date, day of year, month, year, quarter, quarter with year etc.
  • Each address consists the full address, the city, the state, the area, the area code etc.

That way, the dynamic reporting is rather easy to use because the users don't need to join any tables but have all the data they need.

for a traffic accounting system I need to store large amounts of datasets about internet packets sent through our gateway router (containing timestamp, user id, destination or source ip, number of bytes, etc.).

This data has to be stored for some time, at least a few days. Easy retrieval should be possible as well.

What is a good way to do this? I already have some ideas:

  • Create a file for each user and day and append every dataset to it.

    • Advantage: It's probably very fast, and data is easy to find given a consistent file layout.
    • Disadvantage: It's not easily possible to see e.g. all UDP traffic of all users.
  • Use a database

    • Advantage: It's very easy to find specific data with the right SQL query.
    • Disadvantage: I'm not sure if there is a database engine that can efficiently handle a table with possibly hundreds of millions datasets.
  • Perhaps it's possible to combine the two approaches: Using an SQLite database file for each user.

    • Advantage: It would be easy to get information for one user using SQL queries on his file.
    • Disadvantage: Getting overall information would still be difficult.

But perhaps someone else has a very good idea?

Thanks very much in advance.

First, get The Data Warehouse Toolkit before you do anything.

You're doing a data warehousing job, you need to tackle it like a data warehousing job. You'll need to read up on the proper design patterns for this kind of thing.

[Note Data Warehouse does not mean crazy big or expensive or complex. It means Star Schema and smart ways to handle high volumes of data that's never updated.]

  1. SQL databases are slow, but that slow is good for flexible retrieval.

  2. The filesystem is fast. It's a terrible thing for updating, but you're not updating, you're just accumulating.

A typical DW approach for this is to do this.

  1. Define the "Star Schema" for your data. The measurable facts and the attributes ("dimensions") of those facts. Your fact appear to be # of bytes. Everything else (address, timestamp, user id, etc.) is a dimension of that fact.

  2. Build the dimensional data in a master dimension database. It's relatively small (IP addresses, users, a date dimension, etc.) Each dimension will have all the attributes you might ever want to know. This grows, people are always adding attributes to dimensions.

  3. Create a "load" process that takes your logs, resolves the dimensions (times, addresses, users, etc.) and merges the dimension keys in with the measures (# of bytes). This may update the dimension to add a new user or a new address. Generally, you're reading fact rows, doing lookups and writing fact rows that have all the proper FK's associated with them.

  4. Save these load files on the disk. These files aren't updated. They just accumulate. Use a simple notation, like CSV, so you can easily bulk load them.

When someone wants to do analysis, build them a datamart.

For the selected IP address or time frame or whatever, get all the relevant facts, plus the associated master dimension data and bulk load a datamart.

You can do all the SQL queries you want on this mart. Most of the queries will devolve to SELECT COUNT(*) and SELECT SUM(*) with various GROUP BY and HAVING and WHERE clauses.

I am interested in learning more about data warehousing. I see terms like "dimension", "snowflake schema" and "star schema" thrown about. Where would one start in learning about this stuff? Are there good books or Internet resources?

ETL is in this space too right?

Wikipedia's resources on Data Warehousing are good.

Reading any of Ralph Kimball's books, such as "The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling "

Yes, ETL is in this space.

You may also be interested in Column oriented databases.

Vertica have a blog with a few posts regarding how they're often better for what data warehouses are used for. For example "Reflections on the Kimball Data Warehouse "Bible": Time for a New Testament?" and "The Truth About MPP & Data Warehousing"

As a corollary to this question I was wondering if there was good comparative studies I could consult and pass along about the advantages of using the RDMBS do the join optimization vs systematically denormalizing in order to always access a single table at a time.

Specifically I want information about :

  • Performance or normalisation versus denormalisation.
  • Scalability of normalized vs denormalized system.
  • Maintainability issues of denormalization.
  • model consistency issues with denormalization.

A bit of history to see where I am going here : Our system uses an in-house database abstraction layer but it is very old and cannot handle more than one table. As such all complex objects have to be instantiated using multiple queries on each of the related tables. Now to make sure the system always uses a single table heavy systematic denormalization is used throughout the tables, sometimes flattening two or three levels deep. As for n-n relationship they seemed to have worked around it by carefully crafting their data model to avoid such relations and always fall back on 1-n or n-1.

End result is a convoluted overly complex system where customer often complain about performance. When analyzing such bottle neck never they question these basic premises on which the system is based and always look for other solution.

Did I miss something ? I think the whole idea is wrong but somehow lack the irrefutable evidence to prove (or disprove) it, this is where I am turning to your collective wisdom to point me towards good, well accepted, literature that can convince other fellow in my team this approach is wrong (of convince me that I am just too paranoid and dogmatic about consistent data models).

My next step is building my own test bench and gather results, since I hate reinventing the wheel I want to know what there is on the subject already.

---- EDIT Notes : the system was first built with flat files without a database system... only later was it ported to a database because a client insisted on the system using Oracle. They did not refactor but simply added support for relational databases to existing system. Flat files support was later dropped but we are still awaiting refactors to take advantages of database.

As far as I know, Dimensional Modeling is the only technique of systematic denormalization that has some theory behind it. This is the basis of data warehousing techniques.

DM was pioneered by Ralph Kimball in "A Dimensional Modeling Manifesto" in 1997. Kimball has also written a raft of books. The book that seems to have the best reviews is "The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (Second Edition)" (2002), although I haven't read it yet.

There's no doubt that denormalization improves performance of certain types of queries, but it does so at the expense of other queries. For example, if you have a many-to-many relationship between, say, Products and Orders (in a typical ecommerce application), and you need it to be fastest to query the Products in a given Order, then you can store data in a denormalized way to support that, and gain some benefit.

But this makes it more awkward and inefficient to query all Orders for a given Product. If you have an equal need to make both types of queries, you should stick with the normalized design. This strikes a compromise, giving both queries similar performance, though neither will be as fast as they would be in the denormalized design that favored one type of query.

Additionally, when you store data in a denormalized way, you need to do extra work to ensure consistency. I.e. no accidental duplication and no broken referential integrity. You have to consider the cost of adding manual checks for consistency.

I accumulated a quite a lot of data in a raw form (csv and binary) - 4GB per day for a few months to be precise.

I decided to join the civilized world and use database to access the data and I wondered what would be the correct layout; the format is quite simple: a few rows for every time tick (bid, ask, timestamp, etc.) x up to 0.5Million/day x hundreds of financial instruments x monthes of data.

There is a MySQL server with MYISAM (which I understood would be the correct engine for this type of usage) running on commodity harware (2 x 1GB RAID 0 SATA, core 2 @ 2.7GHz)

What would be correct layout of the database? How should the tables/indices look like? What are the general recommendations with this scenario? What would you predict set me pitfalls along the way?

Edit: my common usage will be simple queries to extract time series information for a specific date and instruments, e.g.

SELECT (ask + bid) / 2
  WHERE instrument='GOOG'
  AND date = '01-06-2008'
  ORDER BY timeStamp;

Edit: I tried to stuff all my data in one table indexed by the timeStamp but it was way too slow - therefore I reckoned it would take a more elaborate scheme.

Or perhaps consider a star schema, dimensions and facts. Ralph Kimball has some nice stuff to tell you how to go about it.

Imagine that you are handed a new book everyday from an author. The book is a work in progress. He does not tell you what he has changed or added.

Your job is to identify the changes and additions, and pass ONLY these along to the publisher (who does not have time to read the entire book everyday)

For the purposes of this problem, the book is comprised of 1m lines of ascii text and growing (actually a MySQL backup file).

My current idea is to make a secure hash (SHA256 for example) of each line (1k Chars) and store it on HD. Since the hash is only 32bytes the file is only 32MB.

Then when we get the next file tomorrow, we go through it line by line, creating a new hash for each line and comparing it to the hash from the previous day.

When the process is finished we overwrite the hash file ready for the next day.

The comparison uses a binary search method of string compare ( > < operands) This returns a result in an average of four iterations.

I have not coded a btree index solution yet, but how would you tackle this?

This is a technique used for incremental loading on a data warehouse. In the situation where you do not have the ability to identify changed data within a source system, you can take out a snapshot of the data and compare it with your last snapshot to identify the differences. This technique even gets a mention in Ralph Kimball's book on the subject and is used in an application I was involved in the design of.

You need a hashing algorithm with a very wide key as this approach is vulnerable to birthday attacks. MD5 or any of the SHA family would be good. It also cannot detect deletions without a post-process that goes through the difference looking for missing natural keys. This computation actually needs to be aware of the table structure.

I'm new to modeling star schemas, fresh from reading the Data Warehouse Toolkit.

I have a business process that has clients and non-clients calling into conference calls with some of our employees.

My fact table, call it "Audience", will contain a measure of how long an attending person was connected to the call, and the cost of this person's connection to the call. The grain is "individual connection to the conference call".

Should I use my conformed Client dimension and create a non-client dimension (for the callers that are not yet clients) this way (omitting dimensions that are not part of this questions):

First potential model

Or would it be OK/better to have a non-conformed Attending dimension related to the conformed Client dimension in this manner:

Second potential model

Or is there a better/standard mechanism to model business processes like this one?

Edit:

What about using model 2 above, but creating a view on top of the client dimension table and the attending dimension to make it look like it is only one dimension?

Is that an acceptable alternative to Damir's answer below?

There is no need to split clients into two tables (dimensions). Simply put all clients, active and prospects into the same dimension table. You could then introduce an IsActive attribute (column) to distinguish between paying clients and prospects. Sooner or later you will use a data mining tool to learn more about clients and what distinguishes people who are willing to pay for your service from those who are not. In order for the algorithm to work, you have to supply data for both groups of people -- those who are paying and those who are not paying. To summarize, prospects belong to the same table as paying clients.

With this, you can use your model No 1. Make sure that measures in the fact table make sense. For example if a call_id =123 had 10 people participating, then

sum(cost_of_connection)
from factAudience
where call_id = 123;

should return the total cost of the call, not something meaningless -- like 10x the real cost.

EDIT

A "paying client" and a "prospect client" are both a type of a client, therefore belong to the same dimension table -- dimClient. Somewhere in the DW there is a factSale (or similar) with FK to the dimSale. Even if you do not have a column in dimClient to differentiate between paying and prospects -- you can still get paying clients by joining factSale and dimClient.

"Who is a customer?" is a common debate when introducing a DW in an organization. In order to be able to analyze client acquisition, retention, conversion, etc., prospects have the same treatment as paying customers -- at least in the DW. Keep in mind that acquiring and creating new customers is on the top of the list for (almost) any CEO.

I am a SQL Server DBA but so far in my career journey only worked with OLTP systems. I want to start learning analysis services in SQL Server but I need some getting started resources.

What is the good book to learn concept of dimensional modeling? Is there any good videos or Webcast that can explain me concepts behind dimentsional modeling like facts, dimensions, star schema, snowflow schema ( I just know those words but not their meaning)

I know I could have search this forum for the similar questions, but I need some good recommndataion based on experience (for e.g. if you ask me how to learn T-SQL my recommendation is to read Itzik Ben Gan's T-SQL fundamentals...is there any equivalent kind of gold book to learn OLAP)

Thanks in advance.

Learn Microsoft BI
39 videos cover themes from basics of business intelligence to the details of SQL Server Analysis Services 2008. All videos are free of charge, but you should register to download them.

Microsoft resources for solution partners and developers
You can find quite interesting videos on BI essentials, data warehousing, SSIS, SSRS, SSAS in the following trainings there:
SQL Server 2008 Training for ISV Developers (10 Parts)
Business Intelligence Essentials Web Seminar Series

Books:
"Smart Business Intelligence Solutions with Microsoft SQL Server 2008" by Lynn Langit, Kevin S. Goff, Davide Mauri, Sahil Malik, and John Welch
"Microsoft SQL Server 2005 Analysis Services Step by Step" by Reed Jacobson, Stacia Misner, Hitachi Consulting

I also made a companion series (on YouTube) to my book - on how to use SSAS in BIDS 'How do I...BI?' - here

also spend some time with Ralph Kimball and build a solid understanding of dimension modeling. If you master that, building cubes in Analysis Services is straight forward.

The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (Second Edition)

Kimball : DimensionalModeling :: Itzik : TSQL

Let's say that we have D_PROCESS, D_WORKER and D_STATUS as dimensions, and the fact F_EVENT that links a process (what) with a worker (who's in charge) and the "current" status.

The process status changes over time. Shoud we store in F_EVENT one line per process/status/worker, or one line per process/worker, and "somewhere else" one line per status change for a given process/worker?

I'm new to Datawarehouse and it's hard to find best practices/tutorial related to data modelization.

Read The Data Warehouse Toolkit by Ralph Kimball for a good introduction to dimensional modeling.

It sounds like you are storing a process change event in F_EVENT. If this process has a defined beginning and end, I would build a snapshot fact table which would let you track the process over time (simply updating the row each time the process moves from one step to another).

EDIT:

I'll try to make a general case using your dimensions as examples.

For D_PROCESS, modeling a "process" isn't usually modeled as a dimension, and you called it a "what", so I'm going to rename this to "D_ACCOUNT".

The basic data model will be for a "tax processing system" in which WORKERS are processing ACCOUNTS, and each ACCOUNT/WORKER combination has several possible "STATUSES" of where this process currently stands.

D_ACCOUNT
    ACCOUNT_NUMBER
    ACCOUNT_TYPE

D_WORKER
    WORKER_ID
    FIRST_NAME
    LAST_NAME
    BADGE_NUMBER
    SHIFT

D_STATUS
    STATUS_ID
    STATUS_NAME

Now if I want to report on all "events" that have happened to an Account, performed by a worker, I can build a Transaction-level fact table F_EVENT:

F_EVENT
    ACCOUNT_ID
    WORKER_ID
    STATUS_ID
    EVENT_TIME_ID
    Metrics taken at time of the measurement (Cost, Worker time spent, etc)

We call the unique combination of dimensions that identifies a row the Granularity or Grain of the fact table.

The grain of this table is Account, Worker, Status, and Time. It answer questions like "How much time did my workers on shift 3 spend processing accounts on Wednesday?" or "How many events occured that changed the processing status to "CLOSED"?

I'm not sure how much this type of table would help.

Instead, say you are interested in tracking the process itself as it moves through various statuses. I'm going to assume that the status always moves forward in time, from "NOT STARTED" to "IN PROCESS" to "CLOSED".

I'll build what Kimball calls an "Accumulating Snapshot Fact table.

F_TAXPROCESSING
    ACCOUNT_ID
    WORKER_ID 
    CURRENT_STATUS_ID
    NOT_STARTED_DTTM
    NOT_STARTED_FLAG
    IN_PROCESS_DTTM
    IN_PROCESS_FLAG
    CLOSED_DTTM
    CLOSED_FLAG

This table's grain is Account, Worker. This table keeps track of the "process" by updating the date/time of the change to the status, and a flag when that status has been reached.

This allows you to track the process over time, allowing you to see how many accounts have reacted the "IN PROCESS" status, how long it took to get there, et cetera.

SAS programmers at my company work with researchers to analyze data stored in a number of text files around 1Tb in size. The resulting SAS processes can take days to run. Whenever the researchers want to change a question slightly, the processes have to be re-run, requiring further hours or days.

The SAS programmers approached our DBA team for a way of storing their data with the aim of greatly improving query performance.

Two main difficulties are:

  1. We have only a handful of example queries, and there is no particularly typical set of queries to expect.
  2. Many of the queries will be of a form like

    SELECT COUNT(DISTINCT id) FROM TABLE t WHERE a = true AND b = 3 AND c IN (3 to 10);

but in which the WHERE filter parameters are unknown and could include any combination of columns and attributes. This is to say, it seems to me (having read up a bit about data warehouses) that our requirements exclude a typical data warehouse approach in which we perform some aggregations and work with a higher granularity of records.

I'm looking for any resources that speak to designing databases with similar constraints. In Bill Inmon's Building the Data Warehouse, he briefly mentions "exploration warehouses" and "data mining warehouses". Using these terms I found this article that was slightly helpful: "Designing the Data Warehouse for Effective Data Mining" [pdf], but that's more or less it. Most of what I find when searching re: "data mining" regards OLAP.

I'm a novice DBA and I've been tasked with coming up with some suggestions for this design. I think at this point my most helpful suggestion will be to suggest we design to avoid expensive joins as much as possible. I'm out on a limb here--not expecting miracles, but any sage advice or reading recommendations would be very welcome.

Thanks!

Read everything you can by Ralph Kimball.

http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimensional/dp/0471200247

Your typical query (SELECT aggregate FROM fact JOIN dimension WHERE criteria) is the sweet spot for the star schema.

Forget "data mining". It isn't a helpful term.

Focus on "Star Schema". Build the right data structure.

I implementing analytics for a medical software. The data to be processed is mainly appointment related. I'm planing to implement star schema for generating reports. I have a few doubts

  1. My data can change like a appointment can be marked as cancelled later, i read that changing data in star schema is not a good idea. If not what is a better approach.
  2. The data to my fact tables will inserted by a background task when the data is added to my main database. Is constant insertion of data to fact table a issue as reposts are viwed in the application almost anytime.
  3. I am planning to implement it in mysql, and if someone can point me to some post releated to performance of mysql with this kind of structure it would be great. Also which is a better engine to implement this schema Innodb or Myisam

Thanks.

I'll try to answer in general terms that are not tied to a specific database technology (I'm a MS SQL Server DWH person).

To address your specific questions ...

"1.My data can change like a appointment can be marked as cancelled later, i read that changing data in star schema is not a good idea. If not what is a better approach."

There are two main table types in DWHes Fact tables and Dimension tables.

Changing fact or dimensional data in a star schema is perfectly valid. It is not considered good practise to delete dimension records from a DWH.

You need to make a choice of type 1 (overwite history) or type 2 (keep history) changes to data (Slowly Changing Dimension).

I'm not sure if you are suggesting deleting fact records here, but a better approach would be to have a flag on each fact record to indicate the status of the appointment (booked/used/cancelled/etc) and if a patient cancels their appointment then change the fact record from status=booked to status=cancelled; not actually deleting the fact record. This way also you can track the number of cancelled appointments.

To add a complication you could have your fact table keeping history as well, so that you can show an "as at" position, i.e. be able to show the state of the database as at a particular point in time. This increases the size of your database quite a bit, depending on the number of changes that occurr.

"2.The data to my fact tables will inserted by a background task when the data is added to my main database. Is constant insertion of data to fact table a issue as reposts are viwed in the application almost anytime."

You need to have a discussion around the frequency of updates / the importance of having up to date data. Generally (as I'm sure you are aare) DWHes are not OLTP systems and so aren't meant to be constantly being updated with new data and able to be reported on the most up to date data. If you want that really you want an OLTP system.

That said, we have implemented a DWH that did updates every 20 minutes. This had a SQL DWH with an OLAP cube sitting on top. Im not sure if mysql has OLAP technology, but I'd feel sure there is some opensource version available. There are several flavours of OLAP (MOLAP/ROLAP/HOLAP) which give different focus to performance/data currency.

You would normally want to separate the DWH itself from the reporting DB layer, especially if there are many users.

"3.I am planning to implement it in mysql, and if someone can point me to some post releated to performance of mysql with this kind of structure it would be great. Also which is a better engine to implement this schema Innodb or Myisam"

I'll have to pass on this question. I used to know a bit about innoDB adn MyISAM, but its been about 8 years since i played with the technology.

A very good book on Star Schema DWH design is by Ralph Kimball on DWH Design Book

I'm looking for a way to quickly obtain an accurate average inventory for reporting purposes.


BACKGROUND:

I'm looking for a way to determine Gross Margin Return On Investment (GMROI) for inventory items where the inventory levels are not Constantin with time (ie some items maybe out of stock then over stocked, whilst others will be constant and never out of stock)

GMROI = GrossProfit/AverageInvenotry

say over 1 year

These need to be obtained on the fly, batch processing is not an option.


THE PROBLEM:

Given the relational database used only has current stock levels. I can calculate back to a historic stock say:

HistoricStock=CurrentStock-Purchase+Sales

But I really want an average invertory not just a single point in time.

I could calculate back a series of points then average them but I'm worried about the calculation overhead (to a lesser extent the accuracy), given I want to do this on the fly. I could create a data warehouse and bank the data but I'm concerned about blowing out the database size (ie StockHolding Per Barcode Per Location Per Day for say 2 years)

From memory the integral of the inventory/time graph divided by the time interval would give the average inventory but how do you integrate real world data without a formula Or lots of small time strips?


Any Ideas or References would be appreciate

Thanks B.

In general this seems like a good case for developing an inventory fact table, but exactly how you would implement it depends a lot on your data and source systems.

If you haven't already, I would get the Data Warehouse Toolkit; chapter 3 is about inventory data management. As you mentioned, you can create an inventory fact table and load a daily snapshot of inventory levels from the source system, then you can easily calculate whatever averages you need from the data warehouse, not from the source system.

You mentioned that you're concerned about the volume of data, although you didn't say how many rows per day you would add. But data warehouses can be designed to handle very large tables using table partitioning or similar techniques, and you could also calculate "running averages" after adding each day's data if the calculation takes a very long time for any reason.

I've used Excel PivotTable to analyze data from my database because it allows me to "slice and dice" very quickly. As we know what is in our database tables, we all can write SQL queries that do what PivotTable does.

But I am wondering why PivotTable can construct the queries so fast while it knows nothing about the data and the meanings/relationship between the data fields we give it?

Put the question in another way, how can we build ad-hoc SQL queries in such a fast and efficient way? ("Use PivotTable, of course!", yep, but what I want is a programmatic way).

@Mark Ransom is definitely onto something with the notion of Excel keeping the data in memory, making it faster computationally. It's also possible that Excel pre-indexes datasets in such a way that makes it more responsive than your database.

There's one significant, non-algorithmic possibility for why it's faster: Excel, in Pivot Table usage, has no concept of a join. When you're fetching the data ad hoc from your database, any joins or correlations between tables will result in further lookups, scans, index loads, etc. Since Excel has all the data in a single location (RAM or no), it can perform lookups without having to pre-form datasets. If you were to load your database data into a temp table, it would be interesting to see how ad hoc queries against that table stacked up, performance-wise, against Excel.

One thing's certain, though: although databases are excellent tools for producing accurate reports, a traditionally-normalized database will be far less than optimal for ad hoc queries. Because normalized data structures focus on integrity above all else (if I may take that liberty), they sacrifice ad hoc optimization at the expense of keeping all the data sensible. Although this is a poor example, consider this normalized schema:

+--------+     +---------+
|tblUsers|     |luGenders|
+--------+     +---------+
|userID  |     |genderID |
|genderID||gender   |
+--------+     +---------+

SELECT * FROM luGenders;
> 1 Female
> 2 Male

If, in this example, we wished to know the number of female/male users in our system, the database would need to process the join and behave accordingly (again, this is a bad example due to the low number of joins and low number of possible values, which generally should bring about some database-engine optimisation). However, if you were to dump this data to Excel, you'd still incur some database penalty to pull the data, but actually pivoting the data in Excel would be fairly speedy. It could be that this notion of up-front, fixed-cost penalty is being missed by your idea of Excel being quicker than straight ad hoc queries, but I don't have the data to comment.

The most tangential point, though, is that while general databases are good for accuracy, they often suck at ad hoc reports. To produce ad hoc reports, it's often necessary to de-normalize ("warehouse") the data in a more queryable structure. Looking up info on data warehousing will provide a lot of good results on the subject.

Moral of the story: having a fully algorithmic, fast ad hoc query system is an awesome ideal, but is less than practical given space and time constraints (memory and people-hours). To effectively generate an ad hoc system, you really need to understand the use cases of your data, and then denormalize it effectively.

I'd highly recommend The Data Warehouse Toolkit. For the record, I'm no DBA, I'm just a lowly analyst who spends 80 hours per week munging Excel and Oracle. I know your pain.

Our organisation lacks any data mining or analytical tools, so I'm trying to persuade them to implement a Business Intelligence solution using Microsoft SQL Server 2008 R2. They've asked for a prototype, so they can get a better idea of what Business Intelligence can do for them.

I'm assuming that the prototype will consist of -

  • A subset of data from a critical application
  • Integration Services (SSIS): Used to clean the data subset?
  • Analysis Services (SSAS): Used to create and maintain a dimensional model based on that data subset? Data Mining?
  • Reporting Services (SSRS): Used to create, maintain and update a 'dashboard' of reports.

I want to show how a Business Intelligence solution with data mining and analytic capabilities can help their organisation perform better.

As this is the first time I've done this, I'd value advice from other people on whether this prototype is realistic or not. And does anyone know of any easily-accessible real-life examples that I can show them?

my thoughts …

  • Don’t overestimate the size (in terms of time) of a new DWH project.
  • Start with something not very complex and well understood in terms of business rules.

The biggest problem we have had with new DWH projects/pilots (we are a DWH consultancy so have a number of clients) is getting management support of it. Often the DWH will be sponsored by someone in IT and there is no real board level support, so the project takes a long time to progress and it is difficult to get resources for it.

The best projects we have found are ones that have levels of support in three areas: Management (board level), IT and Business (preferably someone with good understanding of the business rules involved).

Have a look at Ralph Kimball’s Data Warehouse Toolkit which goes through different styles of DWH design for different industries. It is very good!

The tools I expect you would use (I’ve added a couple of technologies here)

  • SSIS (ETL tool) is used to Extract (from source systems) Transform (data into appropriate form the load) and Load (into Dim and Fact tables)
  • SSAS (OLAP tool) is used to create/process an OLAP cube. Warning: there is quite a large learning curve on this tool!!
  • SSRS (reporting tool) is used to create static and dynamic reports/dashboards.
  • MS Excel. There are free data mining models that can be added in and when connected to an OLAP cube which will allow very interesting DM to be performed.
  • Windows Sharepoint Services (WSS) (comes free with a Windows Server operating systems) to deploy your SSRS reports onto.

This is a good prototype scope in terms of technologies (if you are from the MS background), but the spread of technologies is very large and for one person coming in cold to them, this is perhaps unrealistic.

The most critical thing is to get the DWH star schema design correct (and there will be a number of solutions that are “correct” for any set of data), otherwise your OLAP cube design will be flawed and very difficult to build.

I would get a local DWH consultant to validate your design before you have built it. Make sure you give them a very tight scope of not changing much otherwise most consultants will “tinker” with things to make them “better”.

Good Luck!

I don't know a good way to maintain sums depending on dates in a SQL database.

Take a database with two tables:

Client

  • clientID
  • name
  • overdueAmount

Invoice

  • clientID
  • invoiceID
  • amount
  • dueDate
  • paymentDate

I need to propose a list of the clients and order it by overdue amount (sum of not paid past invoices of the client). On big database it isn't possible to calculate it in real time.

The problem is the maintenance of an overdue amount field on the client. The amount of this field can change at midnight from one day to the other even if nothing changed on the invoices of the client.

This sum changes if the invoice is paid, a new invoice is created and due date is past, a due date is now past and wasn't yesterday...

The only solution I found is to recalculate every night this field on every client by summing the invoices respecting the conditions. But it's not efficient on very big databases.

I think it's a common problem and I would like to know if a best practice exists?

You should read about data warehousing. It will help you to solve this problem. It looks similar as what you just said

"The only solution I found is to recalculate every night this field on every client by summing the invoices respecting the conditions. But it's not efficient on very big databases."

But it has something more than that. When you read it, try to forget about normalization. Its main intention is for 'show' data, not 'manage' data. So, you would feel weird at beginning but if you understand 'why we need data warehousing', it will be very very interesting.

This is a book that can be a good start http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimensional/dp/0471200247 , classic one.

I have a series of text files (one per year) which contain the answers to an annual survey/questionnaire. Column headings in the files refer to question numbers, and each row represents one person's answers to the questions e.g.

Q1,   Q2,   Q3, ...
P1A1, P1A2, P1A3, ...
P2A1, P2A2, P2A3, ...
etc.
[where Q1 is Question 1, P1 is Person 1 and A1 is Answer 1].

The questionnaire uses the same core set of questions each year, but each year new questions are also added and some old ones are removed. My tables therefore have lots of fields in common, but they're not all the same e.g.

Year 1             |  Year 2           |  Year 3              etc.
Q1,   Q2,   Q3     |  Q1,   Q2,   Q4   |  Q1,   Q2,   Q5
P1A1, P1A2, P1A3   |  P1A1, P1A2, P1A4 |  P1A1, P1A2, P1A5
P2A1, P2A2, P2A3   |  P2A1, P2A2, P2A4 |  P2A1, P2A2, P2A5

In this example Q1 and Q2 are the core questions, while Qs 3, 4 & 5 depend on the survey year.

Removed question numbers are never re-used: if a new question is added, it's given a completely new number. The real surveys have between 300 and 500 questions, and approximately 40,000 people respond each year.

I want to combine all of this data into a single table, the column headings of which would be the set of distinct headings in the original files, plus a column for the year. For years where a question isn't relevant, I'd like to have nulls e.g.

Year,  Q1,   Q2,   Q3,   Q4,   Q5
1,     P1A1, P1A2, P1A3, Null, Null
1,     P2A1, P2A2, P2A3, Null, Null
2,     P1A1, P1A2, Null, P1A4, Null
2,     P2A1, P2A2, Null, P2A4, Null
3,     P1A1, P1A2, Null, Null, P1A5
3,     P2A1, P2A2, Null, Null, P2A5

Essentially, I just want to append columns where they match and otherwise have nulls, but if possible I'd like to generate the set of column headings for the final table automatically (from the input tables) as the number of questions in the survey is large and the prospect of working-out and then typing the Create Table statement is horrific! I suppose what I'm after is some kind of variation on UNION ALL that doesn't require me to specify all the columns in order and which can handle the non-matching columns.

Does anyone have any tips or suggestions, please? I was originally hoping to use SQLite, but I'm willing to try pretty-much anything. I'm not a databasing expert, but I have a basic understanding of SQL and can do a bit of Python scripting if that's any help.

If you've made it this far thanks for reading! I thought this would be a very simple problem, but it's surprisingly difficult to describe in detail.

Step 1. Stop coding.

Step 2. Buy a book on data warehousing and star-schema design. Ralph Kimball's Data Warehouse Toolkit, for example. http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimensional/dp/0471200247

Step 3. Design a proper star schema. Question is a dimension. Time (specifically year) is a dimension. Respondent (if you know them) is a dimension. Response (individual survey instrument) is a dimension. Facts are answers to questions in a given year. Each survey becomes multiple rows of facts.

Step 4. Load your star schema with your various years of data. The dimensions (questions) are a little complex to load because each year reuses some questions and adds new ones. Once the dimensions are loaded, the fact rows are pretty easy to load.

Step 5. Now you can write a quick reporting application to extract the requisite data from the various fact rows and assemble the desired report.

I'm doing an ongoing survey, every quarter. We get people to sign up (where they give extensive demographic info).

Then we get them to answer six short questions with 5 possible values much worse, worse, same, better, much better.

Of course over time we will not get the same participants,, some will drop out and some new ones will sign up,, so I'm trying to decide how to best build a db and code (hope to use Python, Numpy?) to best allow for ongoing collection and analysis by the various categories defined by the initial demographic data..As of now we have 700 or so participants, so the dataset is not too big. I.E.; demographic, UID, North, south, residential. commercial Then answer for 6 questions for Q1 Same for Q2 and so on,, then need able to slice dice and average the values for the quarterly answers by the various demographics to see trends over time.

The averaging, grouping and so forth is modestly complicated by having differing participants each quarter

Any pointers to design patterns for this sort of DB? and analysis? Is this a sparse matrix?

This is a Data Warehouse. Small, but a data warehouse.

You have a Star Schema.

You have Facts:

  • response values are the measures

You have Dimensions:

  • time period. This has many attributes (year, quarter, month, day, week, etc.) This dimension allows you to accumulate unlimited responses to your survey.

  • question. This has some attributes. Typically your questions belong to categories or product lines or focus or anything else. You can have lots question "category" columns in this dimension.

  • participant. Each participant has unique attributes and reference to a Demographic category. Your demographic category can -- very simply -- enumerate your demographic combinations. This dimension allows you to follow respondents or their demographic categories through time.

But Ralph Kimball's The Data Warehouse Toolkit and follow those design patterns. http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimensional/dp/0471200247
Buy the book. It's absolutely essential that you fully understand it all before you start down a wrong path.

Also, since you're doing Data Warehousing. Look at all the [Data Warehousing] questions on Stack Overflow. Read every Data Warehousing BLOG you can find.

There's only one relevant design pattern -- the Star Schema. If you understand that, you understand everything.

I`m new to data warehousing, but I think my question can be relatively easy answered. I built a star schema, with a dimension table 'product'. This table has a column 'PropertyName' and a column 'PropertyValue'. The dimension therefore looks a little like this:

surrogate_key | natural_key (productID) | PropertyName | PropertyValue | ...
    1              5                          Size           20          ...
    2              5                          Color          red
    3              6                          Size           20
    4              6                          Material       wood

and so on.

In my fact table I always use the surrogate keys of the dimensions. Cause of the PropertyName and PropertyValue columns my natural key isn`t unique / identifying anymore, so I get way too much rows in my fact table.

My question now is, what should I do with the property columns? Would it be best, to put each property into separate dimensions, like dimension size, dimension color and so on? I got about 30 different properties. Or shall I create columns for each property in the fact table? Or make one dimension with all properties?

Thanks in advance for any help.

Your dimension table 'product' should look like this:

surrogate_key | natural_key (productID) | Color | Material | Size | ...
    1              5                      red     wood       20     ...
    2              6                      red     ...         

If you have to many properties, try to group them in another dimension. For example Color and Material can be attributes of another dimension if you can have the same product with same id and same price in another color or material. Your fact table can identify product with two keys: product_id and colormaterial_id...

Reading recommendation: The Data Warehouse Toolkit, Ralph Kimball

I'm tasked with creating a datawarehouse for a client. The tables involved don't really follow the traditional examples out there (product/orders), so I need some help getting started. The client is essentially a processing center for cases (similar to a legal case). Each day, new cases are entered into the DB under the "cases" table. Each column contains some bit of info related to the case. As the case is being processed, additional one-to-many tables are populated with events related to the case. There are quite a few of these event tables, example tables might be: (case-open, case-dept1, case-dept2, case-dept3, etc.). Each of these tables has a caseid which maps back to the "cases" table. There are also a few lookup tables involved as well.

Currently, the reporting needs relate to exposing bottlenecks in the various stages and the granularity is at the hour level for certain areas of the process.

I may be asking too much here, but I'm looking for some direction as to how I should setup my Dim and Fact tables or any other suggestions you might have.

I suggest you check out Kimball's books, particularly this one, which should have some examples to get you thinking about applications to your problem domain.

In any case, you need to decide if a dimensional model is even appropriate. It is quite possible to treat a 3NF database 'enterprise data warehouse' with different indexes or summaries, or whatever.

Without seeing your current schema, it's REALLY hard to say. Sounds like you will end up with several star models with some conformed dimensions tying them together. So you might have a case dimension as one of your conformed dimensions. The facts from each other table would be in fact tables which link both to the conformed dimension and any other dimensions appropriate to the facts, so for instance, if there is an employee id in case-open, that would link to an employee conformed dimension, from the case-open-fact table. This conformed dimension might be linked several times from several of your subsidiary fact tables.

Kimball's modeling method is fairly straightforward, and can be followed like a recipe. You need to start by identifying all your facts, grouping them into fact tables, identifying individual dimensions on each fact table and then grouping them as appropriate into dimension tables, and identifying the type of each dimension.

I use a DATE's master table for looking up dates and other values in order to control several events, intervals and calculations within my app. It has rows for every single day begining from 01/01/1990 to 12/31/2041.

One example of how I use this lookup table is:

  1. A customer pawned an item on: JAN-31-2010
  2. Customer returns on MAY-03-2010 to make an interest pymt to avoid forfeiting the item.
  3. If he pays 1 months interest, the employee enters a "1" and the app looks-up the pawn
    date (JAN-31-2010) in date master table and puts FEB-28-2010 in the applicable interest pymt date. FEB-28 is returned because FEB-31's dont exist! If 2010 were a leap-year, it would've returned FEB-29.
  4. If customer pays 2 months, MAR-31-2010 is returned. 3 months, APR-30... If customer pays more than 3 months or another period not covered by the date lookup table, employee manually enters the applicable date.

Here's what the date lookup table looks like:


{ Copyright 1990:2010, Frank Computer, Inc. }

{ DBDATE=YMD4- (correctly sorted for faster lookup) }

CREATE TABLE     datemast 
(
 dm_lookup       DATE,    {lookup col used for obtaining values below}
 dm_workday      CHAR(2), {NULL=Normal Working Date,}
                          {NW=National Holiday(Working Date),}
                          {NN=National Holiday(Non-Working Date),}
                          {NH=National Holiday(Half-Day Working Date),}
                          {CN=Company Proclamated(Non-Working Date),}
                          {CH=Company Proclamated(Half-Day Working Date)}

 {several other columns omitted}

 dm_description CHAR(30), {NULL, holiday description or any comments}
 dm_day_num     SMALLINT, {number of elapsed days since begining of year}
 dm_days_left   SMALLINT, (number of remaining days until end of year}

 dm_plus1_mth   DATE,     {plus 1 month from lookup date}
 dm_plus2_mth   DATE,     {plus 2 months from lookup date}
 dm_plus3_mth   DATE,     {plus 3 months from lookup date}
 dm_fy_begins   DATE,     {fiscal year begins on for lookup date}
 dm_fy_ends     DATE,     {fiscal year ends on for lookup date}
 dm_qtr_begins  DATE,     {quarter begins on for lookup date}
 dm_qtr_ends    DATE,     {quarter ends on for lookup date}
 dm_mth_begins  DATE,     {month begins on for lookup date}
 dm_mth_ends    DATE,     {month ends on for lookup date}
 dm_wk_begins   DATE,     {week begins on for lookup date}
 dm_wk_ends     DATE,     {week ends on for lookup date}

 {several other columns omitted}
)
IN "S:\PAWNSHOP.DBS\DATEMAST"; 

Is there a better way of doing this or is it a cool method?

This is a reasonable way of doing things. If you look into data warehousing, you'll find that those systems often use a similar system for the time fact table. Since there are less than 20K rows in the fifty-year span you're using, there isn't a huge amount of data.

There's an assumption that the storage gives better performance than doing the computations; that most certainly isn't clear cut since the computations are not that hard (though neither are they trivial) and any disk access is very slow in computational terms. However, the convenience of having the information in one table may be sufficient to warrant having to keep track of an appropriate method for each of the computed values stored in the table.

As part of my role at the firm I'm at, I've been forced to become the DBA for our database. Some of our tables have rowcounts approaching 100 million and many of the things that I know how to do SQL Server(like joins) simply break down at this level of data. I'm left with a couple options

1) Go out and find a DBA with experience administering VLDBs. This is going to cost us a pretty penny and come at the expense of other work that we need to get done. I'm not a huge fan of it.

2) Most of our data is historical data that we use for analysis. I could simply create a copy of our database schema and start from scratch with data putting on hold any analysis of our current data until I find a proper way to solve the problem(this is my current "best" solution).

3) Reach out to the developer community to see if I can learn enough about large databases to get us through until I can implement solution #1.

Any help that anyone could provide, or any books you could recommend would be greatly appreciated.

Here are a few thoughts, but none of them are quick fixes:

  1. Develop an archival strategy for the data in your large tables. Create tables with similar formats to the existing transactional table and copy the data out into those tables on a periodic basis. If you can get away with whacking the data out of the tx system, then fine.

  2. Develop a relational data warehouse to store the large data sets, complete with star schemas consisting of fact tables and dimensions. For an introduction to this approach there is no better book (IMHO) than Ralph Kimball's Data Warehouse Toolkit.

  3. For analysis, consider using MS Analysis Services for pre-aggregating this data for fast querying.

  4. Of course, you could also look at your indexing strategy within the existing database. Be careful with any changes as you could add indexes that would improve querying at the cost of insert and transactional performance.

  5. You could also research partitioning in SQL Server.

  6. Don't feel bad about bringing in a DBA on contract basis to help out...

To me, your best bet would be to begin investigating movement of that data out of the transactional system if it is not necessary for day to day use.

Of course, you are going to need to pick up some new skills for dealing with these amounts of data. Whatever you decide to do, make a backup first!

One more thing you should do is ensure that your I/O is being spread appropriately across as many spindles as possible. Your data files, log files and sql server temp db data files should all be on separate drives with a database system that large.

Hi I am new to data warehousing and I need to design a fact constellation schema for a retail stores network. Can any body suggest me a good tutorial? I have seen rarely any on the web. Please help me. Its really urgent.

The Data Warehouse Toolkit is essential reading for data warehousing, and includes an entire chapter where retail sales is used as an example.

I am a web developer and i would like to shift my field to data warehousing. Can anyone please give me some idea , which langauges or stuff i need to learn like cogonos , datastage, etl

or IF anyone currently working can guide me how can i start , i will thankful to you. DO i nned to do oracle because i know mysql , sql. My basic understanding with databse is good. Any books

Here are some books to start with. Toolkits you choose can be based upon whichever toolkit is in use where you are working. Most Data Warehousing professionals adhere to either Kimball's methods or Inmon's, so having knowledge about both should help you in the future. I included the other, because I respect the work of Imhoff and Geiger. I was lucky when I started, I was asked to take over maintenance of an existing Proof of Concept Data Warehouse. Good luck, and if you need any help, let us know on StackOverflow! The books I provided so far are general design and methodology books which are technology neutral. With good knowledge of these methodologies, and some specific training on tools, you should be able to work with any BI stack. You might want to try several toolkits to find one that fits your style. I have worked with the Microsoft toolkit almost exclusively, but I have touched on Cognos and Hyperion at various times in the past, and the place I am on contract to right now is switching to Informatica for ETL. They all have good and bad points, so it comes down to being adaptable and a quick learner, just like any other IT job. As a web developer, you probably used many tools depending upon what you did, i.e. you learned HTML first and then did some CSS and maybe some javascript, you may have also used tools like coldfusion. But the point is once you knew the basics, like what I presented below, the tools were secondary.
Kimball's Data Warehouse Toolkit
Mastering Data Warehouse Design (Imhoff, Galemmo, Geiger)
Inmon's Building the Data Warehouse

I am on SQL Server 2008 R2 and I am currently developing a database structure which contains seasonal values for some products.

By seasonal I mean that those values won't be useful after a particular date in terms of customer use. But, those values will be used for statistical results by internal stuff.

On the sales web site, we will add a feature for product search and one of my aim is to make this search as optimized as possible. But, more row inside the database table, less fast this search will become. So, I consider archiving the unused values.

I can handle auto archiving with SQL Server Jobs automatically. No problem there. But I am not sure how I should archive those values.

Best way I can come up with is that I create another table inside the same database with same columns and put them there.

Example :

My main table name is ProductPrices and there a primary key has been defined for this database. Then, I have created another table named ProdutcPrices_archive. I created a primary key field for this table as well and the same columns as ProductPrices table except for ProdutPrices primary key value. I don't think it is useful to archive that value (do I think correct?).

For the internal use, I consider putting two table values together with UNION (Is that the correct way?).

This database is meant to use for long time and it should be designed with best structure. I am not sure if I miss something here for the long run.

Any advice would be appreciated.

gbn brings up some good approaches. I think the "right" longer term answer for you is the t3rd option, though.

It sounds like you have two business use cases of your data -

1.) Real time Online Transaction Processing (OLTP). This is the POS transactions, inventory management, quick "how did receipts look today, how is inventory, are we having any operational problems?" kind of questions and keeps the business running day to day. Here you want the data necessary to conduct operations and you want a database optimized for updates/inserts/etc.

2.) Analytical type questions/Reporting. This is looking at month over month numbers, year over year numbers, running averages. These are the questions that you ask as that are strategic and look at a complete picture of your history - You'll want to see how last years Christmas seasonal items did against this years, maybe even compare those numbers with the seasonal items from that same period 5 years ago. Here you want a database that contains a lot more data than your OLTP. You want to throw away as little history as possible and you want a database largely optimized for reporting and answering questions. Probably more denormalized. You want the ability to see things as they were at a certain time, so the Type 2 SCDs mentioned by gbn would be useful here.

It sounds to me like you need to create a reporting database. You can call it a data warehouse, but that term scares people these days. Doesn't need to be scary, if you plan it properly it doesn't have to take you 6 years and 6 million dollars to make ;-)

This is definitely a longer term answer but in a couple years you'll be happy you spent the time creating one. A good book to understand the concept of dimensional modeling and thinking about data warehouses and their terminology is The Data Warehouse Toolkit.

I am trying to build a data warehouse using the OLTP database and was wondering if I need to define the Fact Tables first or the Dimensions?

Please help?

Regards Yogi

It's very iterative process, usually you go this way:

  1. Gather business requirements
  2. Identify facts and dimensions
  3. Design
  4. Implement

It's of course very simplified. Steps 1-3 usually consume about 80% of whole project and consist of many sub steps.

I recommend you do a reading of Kimball's book on subject: http://www.amazon.com/The-Data-Warehouse-Toolkit-Dimensional/dp/0471200247

-- UPDATE

I found another guideline which might lead you through creating successful BI solution:

  1. Create a data source
  2. Create a data source view
  3. Create one or more dimensions
  4. Create a cube
  5. Deploy the database

This will lead you to conclusion that you need to have dimensions and facts already in place. This is right and wrong.

You might find that it is impossible to gather all the requirements at once, so what you should start with is (and create tables accordingly):

  1. Identify & create dimensions
  2. Identify & create facts
  3. Connect those, add cube calculations, KPIs whatever required

Source

Repeat steps 1-3 as you learn more from your customer.

This repetitive nature comes from various factors. Customers are not always aware of all their requirements, sometimes those might not even be functional ones, so how could they know. We as DEVs are not domain experts, usually we learn as we go.

I recently started a new job in BI. I have done very well at adapting my T-SQL skills to working on a BI data warehouse. My problem is that the field is so diverse and the tools available are almost limitless, I dont know what to focus my research on. Can someone point me in the right direction? Im not looking for a "how to" but more best practice type of information. If you have worked in BI for several years where would you tell an entry level employee to start? What did you whish you knew when you started that you know now?

I constently feel like i have too many options and feel that if i dont pick the correct path that my work is useless...I bet someone else has experienced this when starting to work in BI.

I agree with Dan, starting with the tasks assigned. Do not search the internet for answers. Spend time reading books. By finding the solution in books, you learn a lot along the way. By googling your queries (which I'm sure must have been answered), you do not learn anything new.

Apart from that, you can start learning the basics of dimensional modelling, star schemas, snowflakes, slowly changing dimensions etc. Get used to these buzzwords. A good book to start can be Ralph Kimball's Datawarehouse Toolkit (http://www.amazon.com/The-Data-Warehouse-Toolkit-Dimensional/dp/0471200247)

Vijay

I've been requested to study "the tools Oracle provides to create a data warehouse". I have studies basic DW concepts, data cubes, data mining, etc. in theory and I also know how to build a basic DW with a "classic" RDBMS like Access (OK let's say MySQL...)

Now it's time to get my hands on something more practical, so...

  1. Can you suggest a few books, possibly free?
  2. Does Oracle provide free tools to build a DW for demo/educational purposes?

In general, Oracle have two ETL tools: Oracle Warehouse Builder (OWB) and Oracle Data Integrator (ODI). The latter one (ODI) is a new thing, and I would expect it to be an ancestor which will replace OWB in some (possibly long) future - but some people may want to kill me for such forecast. ;-) Anyway, those are rather expensive tools, and Oracle have "nothing" for free - they actually have some things, like JDeveloper, but it works best with their database which is free only for non commercial use, and with some additional limitations. ;-)

If you want to start with data warehousing concept you should read one of Ralph Kimball or Bill Inmon books (or both, as they represent different concept) - it't important to know the concept first. After that, you should try to find your ETL tool. I am not sure, but Oracle generally is keen on Inmon's concept.

You can experiment with Talend Open Studio first, which is free, and yet very functional. When you know it, you will be able to easilly switch to other tool, as they are generally very similar.

Scenario:

I have some
MAIN_TABLE1
MAIN_TABLE2
these main tables are in relational schema.

Now, I need to move the data from this relational schema to some other flat tables:

flat_Parent_Table
flat_Child_Table1
flat_Child_Table2
flat_GrandChild1
flat_GrandChild2

Now one record from MAIN_TABLE1 goes to multiple flat tables. Now what should be the logic here to move the data in these flat tables and create a relational schema on these flat table.

This is an Extract, Transform, Load (ETL) problem. If you're using SQL Server 2005 or above, you can use SSIS. If not, you can use DTS. The precise logic for how you de-normalize your data depends on what you are trying to accomplish. There are many good books on that subject. Also, check out the Kimball Group.

We have a database that has been growing for around 5 years. The main table has near 100 columns and 700 million rows (and growing).

The common use case is to count how many rows match a given criteria, that is: select count(*) where column1='TypeA' and column2='BlockC'.

The other use case is to retrieve the rows that match a criteria.

The queries started by taking a bit of time, now they take a couple of minutes. I want to find some DBMS that allows me to make the two use cases as fast as possible.

I've been looking into some Column store databases and Apache Cassandra but still have no idea what is the best option. Any ideas?

I am going to assume this is an analytic (historical) database with no current data. If not, you should consider separating your dbs.

You are going to want a few features to help speed up analysis:

  • Materialized views. This is essentially pre-calculating values, and then storing the results for later analysis. MySQL and Postgres (coming soon in Postgres 9.3) do not support this, but you can mimic with triggers.

  • easy OLAP analysis. You could use Mondrian OLAP server (java), but then Excel doesn't talk to it easily, but JasperSoft and Pentaho do.

  • you might want to change the schema for easier OLAP analysis, ie the star schema. Good book:

http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimensional/dp/0471200247/ref=pd_sim_b_1

If you want open source, I'd go Postgres (doesn't choke on big queries like mysql can), plus Mondrian, plus Pentaho.

If not open source, then best bang for buck is likely Microsoft SQL Server with Analysis Services.

We have a product that runs Sql Server Express 2005 and uses mainly ASP.NET. The database has around 200 tables, with a few (4 or 5) that can grow from 300 to 5000 rows per day and keep a history of 5 years, so they can grow to have 10 million rows.
We have built a reporting platform, that allows customers to build reports based on templates, fields and filters.
We face performance problems almost since the beginning, we try to keep reports display under 10 seconds but some of them go up to 25 seconds (specially on those customers with long history).
We keep checking indexes and trying to improve the queries but we get the feeling that there's only so much we can do. Off course the fact that the queries are generated dynamically doesn't help with the optimization. We also added a few tables that keep redundant data, but then we have the added problem of maintaining this data up to date, and also Sql Express has a limit on the size of databases.
We are now facing a point where we have to decide if we want to give up real time reports, or maybe cut the history to be able to have better performance.
I would like to ask what is the recommended approach for this kind of systems.
Also, should we start looking for third party tools/platforms? I know OLAP can be an option but can we make it work on Sql Server Express, or at least with a license that is cheap enough to distribute to thousands of deployments?

Thanks

Query execution in a case like this can be improved by maintaining a denormalized version of your database. SQL Express does not offer much in the way of BI "out of the box," so your best option (within the constraints of SQL Express) is to design a solution manually.

This means designing a denormalized version of your database, to which you export data for reporting. Be aware that denormalized databases take up a lot more space. Here is a book that deals with designing data warehouses.

You should also look into architectural options that would centralize the data in a fully-featured RDBMS.