Data masking and control practices|V4.3.4| docs|Distributed Database

Data masking and control practices

Last Updated：2025-05-23 02:03:30 Updated

Background information

In the field of collaborative database development, sensitive data management is crucial for both enterprises and users. In recent years, with the implementation of laws and regulations such as the Data Security Law and Personal Information Protection Law, national and societal emphasis on privacy data security has continuously increased. Operational auditing and privacy data protection have become increasingly important and are now key considerations when selecting database collaboration tools.

Privacy data typically includes sensitive information such as user ID numbers, phone numbers, and home addresses - essential data for normal business operations. To prevent data leakage, strict restrictions on business database access are necessary. However, overly restrictive access may also reduce collaboration efficiency. Therefore, for enterprise-level database development scenarios, we must carefully consider: how to proactively prevent internal data leakage risks in a secure and compliant manner to ensure the strictest protection of user privacy data? This article will share ODC's design philosophy and solutions for privacy data protection and security compliance.

Scenarios

During operations like maintenance, development, and data analysis, enterprises often need to perform database queries. These databases may contain sensitive data required for business operations that cannot be separated. Direct queries of such data may pose privacy leakage risks, while overly strict access restrictions may cause inefficiencies. Data masking technology addresses this by processing sensitive data during SQL queries, data exports, and other outbound scenarios, maintaining database accessibility while protecting privacy.

Consider this example: A database contains a "student" table recording basic student information. When needing to analyze student ages while protecting addresses and contact details, the query results should appear as follows after masking. This ensures both workflow efficiency and maximum privacy protection.

ODC prioritizes sensitive data protection, providing enterprises with comprehensive, trustworthy data security solutions. Database access is often the primary (sometimes only) way to obtain sensitive data. Masking at the query stage establishes the first line of defense for data security.

Security administrators can configure sensitive data rules and masking algorithms as needed. Even DBAs and developers cannot access raw sensitive data after configuration, significantly reducing leakage risks. This strict protection mechanism ensures maximum security for enterprise data.

In ODC's collaborative workflow, users cannot directly query databases. Data access is limited to:

Viewing table data in the database object management page
Submitting export tickets to view data in files
Submitting change tickets with SELECT statements to view results
Executing SELECT statements in the SQL window

Methods 1 and 2 are easier to mask since accessed columns are known. Methods 3 and 4 present greater challenges as users may input complex queries, making dynamic masking more difficult than static masking.

Notably, ODC's data masking covers all outbound scenarios and supports all SQL syntax types for OceanBase in both MySQL compatible and Oracle compatible modes.

Technical background

The following are key terms used in ODC for data masking:

Sensitive Column: Database columns storing sensitive data
Masking Algorithm: Algorithms for processing sensitive data
Identification Rule: Conditions for marking columns as sensitive

Technical architecture

ODC's data masking implementation focuses on two core problems:

Identifying sensitive columns in databases
Determining masking methods for sensitive data

For problem 1, we introduced the sensitive column concept - marking physical columns containing sensitive data as sensitive column, and all data in the sensitive columns will be considered as sensitive data. For problem 2, we provide 21 masking algorithms covering common data types like Chinese/English names, phone numbers, emails, IDs, addresses, license plates, and IPs.

If you are interested to learn more technical details, visit ODC's open source community.

Domain model

Since ODC V4.2.0 introduced project-based management with isolated resources across organizations but shared data sources within organizations, our design implements data masking as shown:

Masking algorithms

Each organization shares a set of masking algorithms. All members within the project are allowed to view and test the algorithms. We referenced Ant Group's standards to provide 21 algorithms covering most scenarios:

Sensitive data management

Since data sensitivity depends on content typically known only to project members, sensitive columns are managed at the project level by admins or DBAs. Users can manually add sensitive columns or use automated scanning with identification rules.

ODC supports three identification rule types:

Path: Matches by database/table/column location
Regex: Matches by names/comments using regular expressions
Script: Custom Groovy scripts for granular matching

For example, to mark all "email" columns as sensitive, you can use a script as follows:

Then scan the data source based on this identification rule within just 3 steps:

Business scenario examples

Prepare the verification data

Before you start, use ODC to create two tables: employee_info and employee_salary.

CREATE TABLE test.employee_info (
id int NOT NULL COMMENT 'Employee ID',
name varchar(32) NULL COMMENT 'Employee name',
email varchar(64) NULL COMMENT 'Employee email',
address varchar(128) NULL COMMENT 'Employee address',
CONSTRAINT pk_id PRIMARY KEY (id)
) DEFAULT CHARSET = utf8mb4 COLLATE = utf8mb4_general_ci;

CREATE TABLE test.employee_salary (
id int NOT NULL COMMENT 'Employee ID',
salary float(10) NULL COMMENT 'Employee salary',
CONSTRAINT pk_id PRIMARY KEY (id)
) DEFAULT CHARSET = utf8mb4 COLLATE = utf8mb4_general_ci;

Insert the simulated data. The following shows examples of the original table data.

Note

The data below is simulated and does not contain real user information.
Add the columns that need to be masked to the sensitive column list. As only a few data columns are displayed, manually configure the sensitive column list:

Scenario verification

Next, let's explore how ODC can help you mask data in various data export scenarios.

Scenario 1: Viewing table data in GUI

As you can see, when you view the data in the employee_info table, sensitive data in the columns has been masked.

Scenario 2: Exporting to CSV

Next, we export a table that contains sensitive columns as a CSV file.

The sensitive columns are also masked in the exported CSV file:

Scenario 3: Executing SELECT queries

As mentioned earlier, the core challenge of dynamic masking lies in accurately identifying sensitive columns in various complex SQL syntaxes. The following sections describe the SQL syntaxes supported by ODC for dynamic masking.

First, a simple single-table query:

As you can see, the email and address columns have been successfully masked. Then, we tested some built-in functions, the CASE WHEN statement, and JOIN and UNION queries:

These queries are also covered by the dynamic data masking of ODC. All three columns in the result set have been properly masked. Let's analyze each case:

The id column is not sensitive by itself, but because it is combined with the sensitive address column via UNION operation, the resulting case_id column requires masking.
The name column is not marked as sensitive, but when concatenated with the salary column using CONCAT() function, the resulting concat(t1.name, '-', salary) column inherits sensitivity.
The merged result of email and name columns naturally requires masking as well.

We further tested multi-level nested subqueries:

As shown, ODC correctly handles both correlated and non-correlated subqueries appearing in FROM and SELECT clauses.

While ODC already handles most SQL queries effectively, we have extended support to include both recursive and non-recursive Common Table Expressions (CTEs), including sensitive data propagation in recursive CTEs.

You might wonder why the cte_id column was masked in this example. This demonstrates the "sensitive data contagion" effect in Recursive CTEs. In the first recursion, the CONCAT(cte_email, cte_name) operation makes cte_name inherit sensitivity from cte_email, thus cte_name becomes a sensitive column. In the second recursion, the CONCAT(cte_name, cte_id) operation propagates sensitivity to cte_id. Consequently, all three columns in the final result set require masking.

Additional scenarios

Database changes and SQL window queries are fundamentally similar - both execute SQL statements and output results. Therefore, the dynamic masking capabilities available in SQL windows apply equally to database change tickets.

For stored procedures, packages, triggers, and custom functions, we currently lack an optimal solution for dynamically intercepting their sensitive data access. However, this does not compromise overall data protection, because these objects must first be created before they can access sensitive data, and in ODC V4.2.0, CREATE permissions for these objects and PL debugging/execution permissions are controlled, therefore, unauthorized users cannot perform these "risky" operations. This comprehensive permission management ensures ODC's sensitive data protection across all aspects.