Today we are excited to announce the general availability of attribute-based access controls (ABAC) in Gravity!
With ABAC, customers can grant or deny access to data based on factors related to the user and the business context of the data. This lets data admins assign policies based on factors like data sensitivity, department, or any other relevant metadata, making it a highly scalable approach to data governance.
In our public preview post, we showed how easy it is to get started! You simply tag your data with relevant contextual information and then create policies based off of those tags. In this blog we will quickly review the key features going GA along with details on a new feature included in this release – AI Data Classification.
Key features of ABAC
While the primary functionality of ABAC is the ability to tag data and drive permissions based on those tags, there are other features included as well. These are all in use in mission critical applications for our customers.
Tags and ABAC policies
At the core, users are able to tag data at any level (i.e. Catalog, Schema, Table, or Column) with custom tags, and these tags can be inherited for consistency and ease. Tags are centrally managed by admins in order to reduce tag sprawl and can have two levels for organization or to act as K/V pairs.
Once data is tagged, a policy is applied to a role with a few components:
- Scope: defines what range of objects you want to apply the policy to (i.e. one or more combinations of Catalog, Schema, or Table. This allows you to have as broad of a policy or narrow as needed
- Matching Expression: a logic statement that matches tags in order to apply this policy to objects in the scope
- Privileges: the specific permissions applied to the matching entities. These can be ALLOW or DENY allowing flexibility and exceptions
- (Optional) Expiration: Policies can timebound for granting temporary access
- (Optional) Column Masks & Row Filters
Real-time column masks
Column masks are designed to obfuscate sensitive data in realtime when you want to provide some access to a column but not the full contents. Doing this in realtime removes the need to write data multiple times for multiple user types.
The way this works in Galaxy is simple: a SQL expression is defined with a variable called “@column” that is then applied at read time to that column for roles that have a mask applied. Starburst Galaxy comes with five common mask types (i.e. mask all but the last for characters), but admins can also define their own.
The net result is that less privileged users can see obfuscated data when need be, while those with elevated permissions can see the full data.
Real-time row filters
Similar to column masks, row filters are designed to reduce the number of rows available for a table – again at read time. Also like masks, admins define a SQL expression that can be as simple as column = ‘value’ or even subqueries to leverage lookup tables. This functionality combined with column level security gives cell level security to meet the most stringent security requirements.
This removes the need of multiple tables or views to be defined for different roles, simplifying the data management workloads.
NEW! AI data classification available for preview
ABAC uses the concept of tags to allow users to identify the data’s business context (e.g. attaching a “PII” tag to an email or name). Users can then grant or deny access to that data based off of those tags.
While ABAC is more scalable than RBAC for many larger organizations, this process shifts a lot of manual work to the process of tagging that data. Data classification in Galaxy leverages AI paradigms to remove that burden by proactively suggesting relevant tags that administrators can choose to accept or deny.
This occurs through a concept called “data classifier jobs” in Starburst Galaxy. Users will be able to run a classifier job against an attached cluster either on demand or on a schedule.
When the classifier job is running, it will take a sample of the underlying data, run that sample through an AI language model that can identify different types of data (called classifications), and then send back the model’s tag recommendations to the Galaxy UI.
The data steward can then choose to accept or reject the recommended tags.
In early testing, we’ve seen automatic data classification reduce the risk of human error in the tagging process – increasing the data security by ensuring no sensitive data is “missed”, all while significantly reducing the burden on data stewards to tag their data.
For the AI aficionados out there: the current AI model powering the classification is derived from a DeBERTaV3 language model that has been fine-tuned on ~100 languages and nearly 300 million parameters. This makes it ideal for this type of zero-shot classification including cross-language use cases. Starburst will continue to refine the AI model overtime increasing its inference performance and adding new classifiers.
What’s next for automatic data classification?
Right now, automatic data classification is limited to approximately 22 PII and location-based classifiers. However, we are already working to include more “built-in” classifiers, as well as allow customers to define their own custom classifiers.
To learn how to get started with automatic data classification in Starburst Galaxy, see the docs page or watch the below demo.
What are some next steps you can take?
Below are three ways you can continue your journey to accelerate data access at your company
- 1
- 2
Automate the Icehouse: Our fully-managed open lakehouse platform
- 3
Follow us on YouTube, LinkedIn, and X(Twitter).