Data Mart
Data mart vs data warehouse
Data warehouses attempt to be a central repository for enterprise-wide analytics by consolidating structured data from relational database management systems, customer relationship management systems, and other data sources.
A data mart is a mini data warehouse. Copying data most relevant to a business unit’s needs from the warehouse to the business unit’s data mart simplifies exploration, discovery, and analytics.
Data mart vs data lake
While data warehouses only store structured data, data lakes can store raw data in any format. These data repositories let users access more diverse data to generate insights and inform decision-making. However, they lack the analytics resources of a data warehouse.
Although data marts do not store unstructured data, they include the fast querying tools and analytics resources business units need to generate reports and perform ad hoc analyses.
Data mart vs data products
Data products are the automated reports, business dashboards, and other methods used to make data useful. They depend on reliable sources of high-quality data such as a data mart. Developing these products typically involves a collaboration between a business unit’s analysts and the data management team.
Data mart use cases
In order to serve as a single source of truth for the entire organization, a data warehouse consolidates data from various sources to meet diverse analytics needs. Experienced data scientists, much less non-technical end-users, lack the engineer’s deep understanding of the warehouse’s design, making data access difficult.
Data marts improve accessibility by creating a walled garden containing the data business users need without the warehouse’s complexity.
A common use case occurs at the department level, where a data mart will contain specific data the department uses regularly — for example, giving marketing access to customer data without navigating a warehouse full of logistics data.
Data marts also support short-term projects like data mining or machine learning initiatives by providing a focused pool of relevant data the project team can process efficiently.
Benefits of a data mart
Avoiding the complexity of a large enterprise data warehouse helps business units in several ways.
Focused source of truth: Data marts create a common reference for everyone in the business unit, so there’s little risk of interpreting data in different ways.
Accessibility: By only storing data pertaining to one subject area, data marts are less cluttered and easier to access.
Faster analytics: Streamlining data access eliminates delays in business unit’s analytics processes so executives can make more informed decisions faster.
Cost-effectiveness: Requesting help from the central data team adds costs to a business unit’s IT budget. Data marts reduce these costs by opening data access to more users.
Security and governance: If designed correctly, ELT pipelines can apply data security, privacy, and governance rules while copying data into the data mart.
Types of data marts
In most cases, these data marts will support the business unit’s analytics needs, but some variations play a more operational role. The context of each use case will determine the types of data marts a company implements.
Dependent data marts
A top-down approach creates data mart partitions within the centralized data warehouse. These dependent data marts maintain the warehouse’s role as the single source of truth while giving departments access to curated data sets.
Dependent data marts can be more cost-effective for business units since the data team handles most management tasks at the warehouse level. That allows business intelligence analysts to handle data mart maintenance.
Independent data marts
Small or mid-sized organizations without an existing data warehouse will operate independent data marts for each business unit. In effect, each department has its own mini-warehouse that ingests data from internal and external sources through traditional extract, transform, and load data pipelines.
Hybrid data marts
Hybrid data marts are stand-alone systems that extract data from a central data warehouse and external data sources. Common uses for hybrid data mart are new business initiatives that can’t justify the cost of modifying the data warehouse. Should the project become operational, the engineering team can modify the warehouse to ingest data from the initiative’s external sources.
Analytical and operational data marts
Most data marts serve the analytical needs of their users, letting them process historical data to generate insights. However, data warehousing is less appropriate for managing operational systems that use real-time data sets to make second-by-second or minute-by-minute adjustments.
An operational data store is a type of data mart for real-time analytics and control. Pipelines extract data from operational systems into the operational data store’s short-term storage. Visualization tools, dashboards, and other reporting systems let users continuously monitor their systems.
Unlike an analytical data mart, however, the operational data store does not keep this transitory data. If needed for future analysis, the data moves to the company’s warehouse.
Challenges of a data mart
While data marts simplify analytics at the department level, they can introduce new data management challenges.
Integration and data silos
Independent data marts can become silos that prevent data from being used elsewhere in the organization. Smaller businesses may not have the internal expertise to manage data sharing between independent data marts, leaving each system as separate islands of data.
Hybrid data marts in large organizations can also become isolated data repositories when initiatives keep running without being formally operationalized. The data mart will continue ingesting data from internal and external sources, but it won’t be part of the warehouse that the rest of the enterprise searches.
Difficulties with maintenance
Adopting a data mart model results in the proliferation of ETL and ELT pipelines that data engineers must create and maintain. In the case of hybrid and dependent data marts, this work is on top of the data warehouse’s pipeline maintenance workloads. These burdens can fall harder on smaller organizations with limited resources for maintaining independent data marts.
Scalability
Dependent data marts can struggle to keep pace with growing data volumes and source diversity. Once a data mart is in place, it only contains the data it was designed to collect. The business unit won’t benefit from any new sources of relevant data added to the warehouse without a regular process of exploration and redesign.
Scalability is also an issue due to the financial model of data warehouse solutions. The vendor’s storage fees multiply as more data gets copied into the data mart. Since pricing structures couple storage and compute, the company overpays for unused capacity.
Redundancy
When companies rely on isolated, independent data marts often force their business units to collect the same data independently. This fragmented approach not only adds to the company’s storage costs but also makes it more difficult for departments to work together and with customers.
Consider what happens when the marketing and service departments use different systems to collect customer information. Each department stores names, addresses, and phone numbers in different formats, making the records difficult to synchronize. They may even assign different identification numbers, treating one customer as two relationships.
Data governance
Silos, redundancy, and other data mart risks make data governance more challenging. The company in the previous example collects and stores its personally identifiable information multiple times. If each department manages its data marts independently without central governance oversight, the company’s risk of violating privacy regulations increases dramatically.
Similar issues can arise in large organizations unless dependent data marts aggregate or process warehouse data to preserve privacy. Additional access controls will be needed to prevent unauthorized access to protected data.
In addition, the way data marts duplicate data adds to the company’s attack surface. Hackers have more opportunities to discover and infiltrate data stored in multiple locations and systems.
How Starburst can help with business units that have data marts
Starburst’s federated approach to data analytics is a powerful alternative to the data mart approach. By abstracting enterprise data sources to create a virtual access layer, Starburst unifies data warehouses, relational databases, and other sources into a single point of access for the entire business.
An ANSI-standard SQL engine helps democratize access for business users at any skill level. Data scientists and experienced analysts can write SQL statements to query data, while non-technical users can use SQL-compliant visualization tools like Tableau to get the data they need.