Thursday, October 28, 2010

Challenging Stonebraker’s Assertions On Data Warehouses - Part 1

I have tremendous respect for Michael Stonebraker. He is an apt visionary. What I like the most about him is his drive and passion to commercialize the academic concepts. ACM recently published his article “My Top 10 Assertions About Data Warehouses." If you haven’t read it, I would encourage you to read it.

I agree with some of his assertions and disagree with a few. I am grounded in reality, but I do have a progressive viewpoint on this topic. This is my attempt to bring an alternate perspective to the rapidly changing BI world that I am seeing. I hope the readers take it as constructive criticism. This post has been sitting in my draft folder for a while. I finally managed to publish it. This is Part 1 covering the assertions 1 to 5. The Part 2 with the rest of the assertions will follow in a few days.

“Please note that I have a financial interest in several database companies, and may be biased in a number of different ways.”

I appreciate Stonebraker’s disclaimer. I do believe that his view is skewed to what he has seen and has invested into. I don’t believe there is anything wrong with it. I like when people put money where their mouth is.

As you might know, I work for SAP, but this is my independent blog and these are my views and not those of SAP’s. I also try hard not to have SAP product or strategy references on this blog to maintain my neutral perspective and avoid any possible conflict of interest.

Assertion 1: Star and snowflake schemas are a good idea in the data warehouse world.

This reads like an incomplete statement. The star and snowflake schemas are a good idea because they have been proven to perform well in the data warehouse world with row and column stores. However, there are emergent NoSQL based data warehouse architectures I have started to see that are far from a star or a snowflake. They are in fact schemaless.

“Star and Snowflake schemas are clean, simple, easy to parallelize, and usually result in very high-performance database management system (DBMS) applications.”

The following statement contradicts the statement above.

“However, you will often come up with a design having a large number of attributes in the fact table; 40 attributes are routine and 200 are not uncommon. Current data warehouse administrators usually stand on their heads to make "fat" fact tables perform on current relational database management systems (RDBMSs).”

There are a couple of problems with this assertion:
  1. The schema is not simple; 200 attributes, fact tables, and complex joins. What exactly is simple?
  2. Efficient parallelization of a query is based on many factors, beyond the schema. How the data is stored and partitioned, performance of a database engine, and hardware configuration are a few to name.
"If you are a data warehouse designer and come up with something other than a snowflake schema, you should probably rethink your design.”

Really?

The requirement, that the schema has to be perfect upfront, has introduced most of the problems in the BI world. I call it the design time latency. This is the time it takes after a business user decides what report/information to request and by the time she gets it (mostly the wrong one.) The problem is that you can only report based what you have in your DW and what’s tuned.

This is why the schemaless approach seems more promising as it can cut down the design time latency by allowing the business users to explore the data and run ad hoc queries without locking down on a specific structure.

Assertion 2: Column stores will dominate the data warehouse market over time, replacing row stores.

This assertion assumes that there are only two ways of organizing data, either in a row store or in a column store. This is not true. Look at my NoSQL explanation above and also in my post “The Future Of BI In The Cloud”, for an alternate storage approach.

This assertion also assumes that the access performance is tightly dependent on how the data is stored. While this is true in the most cases, many vendors are challenging this assumption by introducing an acceleration layer on top of the storage layer. This approach makes is feasible to achieve consistent query performance, by clever acceleration architecture, that acts as an access layer, and does not depend on how data is stored and organized.

“Since fact tables are getting fatter over time as business analysts want access to more and more information, this architectural difference will become increasingly significant. Even when "skinny" fact tables occur or where many attributes are read, a column store is still likely to be advantageous because of its superior compression ability."

I don’t agree with the solution that we should have fatter fact tables when business analysts want more information. Even if this is true, how will column store be advantageous when the data grows beyond the limit where compression isn’t that useful?

“For these reasons, over time, column stores will clearly win”

Even if it is only about rows versus columns, the column store may not be a clear commercial winner in the marketplace. Runtime performance is just one of many factors that the customers consider while investing in DW and business intelligence.

“Note that almost all traditional RDBMSs are row stores, including Oracle, SQLServer, Postgres, MySQL, and DB2.”

Exactly!

The row stores, with optimization and acceleration, have demonstrated reasonably good performance to stay competitive. Not that I favor one over the other, but not all row-based DW are that large or growing rapidly, and have serious performance issues, warranting a switch from a row to a column.

This leads me to my last issue with this assertion. What about a hybrid store – row and column? Many vendors are trying to figure this one out and if they are successful, this could change the BI outlook. I will wait and watch.

Assertion 3: The vast majority of data warehouses are not candidates for mainmemory or flash memory.

I am assuming that he is referring to the volatile flash memory and not flash memory as storage. Though, the SSD block storage have huge potential in the BI world.

“It will take a long time before main memory or flash memory becomes cheap enough to handle most warehouse problems.”

Not all DW are growing at the same speed. One size does not fit all. Even if I agree that the price won’t go down significantly, at the current price point, main memory and flash memory can speed up many DW without breaking the bank.

The cost of DW, and especially the cost of flash memory, is a small fraction of the overall cost; hardware, license, maintenance, and people. If the added cost of flash memory makes business more agile, reduces maintenance cost, and allows the companies to make faster decisions based on smarter insights, it’s worth it. The upfront capital cost is not the only deciding factor for BI systems.

“As such, non-disk technology should only be considered for temporary tables, very "hot" data elements, or very small data warehouses.”

This is easier said than done. The customers will spend significant more time and energy, on a complicated architecture, to isolate the hot elements and running them on a different software/hardware configuration.

Assertion 4: Massively parallel processor (MPP) systems will be omnipresent in this market.

Yes, MPP is the future. No disagreements. The assertion is not about on-premise or the cloud, but I truly believe that cloud is the future for MPP. There are other BI issues that need to be addressed before cloud makes it a good BI platform for a massive scale DW, but the cloud will beat any other platform when it comes to MPP with computational elasticity.

Assertion 5: "No knobs" is the only thing that makes any sense.

“In other words, look for "no knobs" as the only way to cut down DBA costs.”

I agree that “no knobs” is what the customers should thrive for to simplify and streamline their DW administration, but I don’t expect these knobs to significantly drive down the overall operational cost, or even the cost just associated with the DBAs. Not all the DBAs have a full time job to manage and tune the DW. The DW deployments go through a cycle where the tasks include schema design, requirements gathering, ETL design etc. Tuning or using the “knobs” is just one of many tasks that the DBAs perform. I absolutely agree that the no-knobs would certainly take some burden off the shoulders of a DBA, but I disagree that it would result into significant DBA cost-savings.

For a fairly large deployment, there is significant cost associated with the number of IT layers
that are responsible to channel the reports to the business users. There is an opportunity to invest into the right kind of architecture, technology-stack for the DW, and the tools on top of that to help increase the ratio of Business users to the BI IT. This should also help speed up the decision-making process based on the insights gained from the data. Isn’t that the purpose to have a DW to begin with? I see the self-service BI as the only way to make IT scale. Instead of cutting the DBA cost, I would rather focus on scaling the BI IT with the same budget and a broader coverage amongst the business users in an organization.

Monday, October 25, 2010

The Future Of BI In The Cloud



Actual numbers vary based on whom you ask, but the general consensus is that the Business Intelligence (BI) and Analytics in the cloud is a fast growing market. IDC expects a compounded annual growth rate (CAGR) of 22.4% through 2013. This growth is primarily driven by two kinds of SaaS applications. The first kind is a purpose-specific analytics-driven application for business processes such as financial planning, cost optimization, inventory analysis etc. The second kind is a self-service horizontal analytics application/tool that allows the customers and ISVs to analyze data and create, embed, and share analysis and visualizations.

The category that is still nascent and would require significant work is the traditional general-purpose BI on large data warehouses (DW) in the cloud. For the most enterprises, not only all the DW are on-premise, but the majority of the business systems that feed data into these DW are on-premise as well. If these enterprises were to adopt BI in the cloud, it would mean moving all the data, warehouses, and the associated processes such as ETL in the cloud. But then, the biggest opportunities to innovate in the cloud exist to innovate the outside of it. I see significant potential to build black-box appliance style systems that sit on-premise and encapsulate the on-premise complexity – ETL, lifecycle management, and integration - in moving the data to the cloud.

Assuming that the enterprises succeed in moving data to the cloud, I see a couple of challenges, if treated as opportunities, will spur the most BI innovation in the cloud.

Traditional OLAP data warehouses don’t translate well into the cloud:

The majority of on-premise data warehouses run on some flavor of a relational or a columnar database. The most BI tools use SQL to access data from these DW. These databases are not inherently designed to run natively on the cloud. On top of that, the optimizations performed on these DW such as sharding, indices, compression etc. don’t translate well into the cloud either since cloud is a horizontally elastic scale-out platform and not a vertically integrated, scale-up, system.

The organizations are rethinking their persistence as well as access languages and algorithms options, while moving their data to the cloud. Recently, Netflix started moving their systems into the cloud. It’s not a BI system, but it has the similar characteristics such as high volume of read-only data, a few index-based look-ups etc. The new system uses S3 and SimpleDB instead of Oracle (on-premise). During this transition, Netflix picked availability over consistency. Eventual consistency is certainly an option that BI vendors should consider in the cloud. I have also started seeing DW in the cloud that uses HDFS, Dynamo, and Cassandra. Not all the relational and columnar DW systems will translate well into NoSQL, but I cannot overemphasize the importance of re-evaluating persistence store and access options when you decide to move your data into the cloud.

Hive, a DW infrastructure built on top of Hadoop, is a MapReduce meet SQL approach. Facebook has a 15 petabytes of data in their DW running Hive to support their BI needs. There are a very few companies that would require such a scale, but the best thing about this approach is that you can grow linearly, technologically as well as economically.

The cloud does not make it a good platform for I/O intensive applications such as BI:

One of the major issues with the large data warehouses is, well, the data itself. Any kind of complex query typically involves an intensive I/O computation. But, the I/O virtualization on the cloud, simply does not work for large data sets. The remote I/O, due to its latency, is not a viable option. The block I/O is a popular approach for I/O intensive applications. Amazon EC2 does have block I/O for each instance, but it obviously can’t hold all the data and it’s still a disk-based approach.

For BI in the cloud to be successful, what we really need is ability for scale-out block I/O, just like scale-out computing. Good news is that there is at least one company, Solidfire, that I know, working on it. I met Dave, the founder, at the Structure conference reception. He explained to me what he is up to. Solidfire has a software solution that uses solid state drives (SSD) as scale-out block I/O. I see huge potential in how this can be used for BI applications.

When you put all the pieces together, it makes sense. The data is distributed across the cloud on a number of SSDs that is available to the processors as block I/O. You run some flavor of NoSQL to store and access this data that leverages modern algorithms and more importantly horizontally elastic cloud platform. What you get is commodity and blazingly fast BI at a fraction of cost with pay-as-you-go subscription model.
Now, that’s what I call the future of BI in the cloud.

Friday, October 15, 2010

Can A Product Manager Be Effective Without Product Design Skills?

I am very passionate about the topic of design and design-thinking. When I saw this question on Quora, I decided to post my answer. Following is directly from my answer to this question on Quora:

The answer is "Definitely not."

It's not about the product design by itself, but it's about applying core and transferable product design skills to product management. Let's break it down:

1) Understanding users: Good product designers have great user research, observation, and listening skills to put themselves into the shoes of a user and understand the real, mostly unspoken and latent, needs of the end users.

2) Being self-critical: If you are a trained designer, you would stay away from self-referential design, which is a root cause for many failed products. Good product designers are self-critical about their approach and the deliverables and are always open to feedback to iterate on their design.

3) Working with designers: If you are a designer, you have great empathy for fellow designers. I have seen products fail, simply because, the product managers can't work with the designers and don't share the same mindset.

4) A "maker" mentality: The designers are makers. They make things. The product managers typically don't, the engineers do. For a product manager, it's incredibly important to have a "maker" mentality. They should continuously be making and refining, by themselves or with the help of the engineers. The product managers, who believe that their responsibility ends when they are done gathering the requirements are likely to fail, miserably in most cases.

5) A "T-shaped" product manager: If you're a product manager, the vertical line of the "T" is your core PM skills. However, successful product managers go beyond their core skills, the horizontal line in the letter "T", to learn more about product design, engineering etc. This ensures that they have a holistic perspective of the product. That leads me to my last point.

6) General Manager: viable, feasible, and desirable: A good product from a vendor's perspective is commercially viable, technologically feasible, and desirable by the end users. Many product managers stop at the business needs, but they truly need to go beyond that to work with the engineering to make it technologically feasible, and have a design mindset to work with the designers to make it desirable by the end users. The product managers should thrive for a "general manager" mindset, of which, product design is a core element.