Normally, table data in a database is stored within the database’s own storage space, while external table data is stored in external storage services. OceanBase introduces the external table feature, which allows users to directly query data stored in external systems, or quickly import external data into OceanBase through external tables. With the external table feature, OceanBase can connect to and access data in file systems such as HDFS, OSS, and S3, and supports various data file formats including CSV, ORC, and Parquet.
Considerations
When the external data file directory has a clearly defined hierarchical structure, such as being organized by time, province, or city, it is recommended to specify it as an external partitioned table. This makes it easier to perform partition pruning during external table queries, improving query speed. For the syntax to create external partitioned tables, refer to Creating external table partitions in MySQL-compatible mode and Creating external table partitions in Oracle-compatible mode.
Before creating an external table, make sure that the network connection between the external data source and the OBServer is available.
Scenarios
As a database feature commonly used in analytical workloads, some typical scenarios for external tables include:
- Log analysis: Enterprises generate a large amount of log data every day, including access logs and error logs. These logs are often first stored in object storage. By creating external tables that point to these log files, you can quickly perform complex queries on the logs. For example, you can statistics the number of accesses within a specific time period or identify the module with the highest error frequency.
- Historical data analysis: For historical data that needs to be retained for a long time but is infrequently accessed, you can choose to store it in object storage or HDFS, which offers lower costs. You can then create external tables to directly query this data when needed. This approach saves storage space while maintaining efficient data access.
- Data integration from multiple sources: Modern enterprise business processes often involve multiple information systems, each generating data in different formats. External tables allow you to load data files in various formats (such as CSV, Parquet, and ORC) from different sources and execute cross-system queries on a unified platform, providing enterprises with comprehensive data insights.
- Data lake analytics: As data lake architectures become more popular, more and more enterprises are building their own data lakes to centrally manage and analyze raw or semi-structured data from various sources. In such cases, you can create external tables that point to specific locations in the data lake to directly participate in data exploration activities within the data lake, accelerating the decision-making process.
- Accelerated lakehouse analytics: Offline data warehouses in data lakes can provide non-real-time query analytics, but may not meet the real-time requirements in some scenarios. By using external tables to quickly import data to internal tables, you can enable more timely and performant query analytics.
Example
The following example simulates the scenario of reading and analyzing local log files.
A folder contains files in CSV format. You can use a partitioned external table to manage and organize these files. The file organization is as follows:
external_table_mock_log ├── 2023-06-01 │ ├── server_log1.csv │ └── server_log2.csv ├── 2023-06-02 │ └── server_log1.csv ├── 2023-06-03 │ ├── server_log1.csv │ ├── server_log2.csv │ └── server_log3.csv └── 2023-07-01 └── server_log1.csvTake the server_log1.csv file as an example. The content of the file is as follows:
2023-06-01 14:42:37.568624, INTERNAL ERROR, -4007, Not supported 2023-06-01 14:42:38.861356, ITER END, -4008, traverse map failed 2023-06-01 14:42:39.931161, NEED WAIT, -4076, query and update last id fail 2023-06-01 14:42:39.931161, SUCCESS, 0, do flush cache successSet the path of the imported file.
Set the system variable
secure_file_privto specify the path where files can be imported or exported. For more information, see secure_file_priv.Notice
For security reasons, you can only connect to the database locally to execute the SQL statement that modifies the global variable through a local socket when you set the system variable
secure_file_priv.Connect to the database and create an auto-partitioned external table to access the files in the directory. The SQL command is as follows:
obclient> create external table ex_t1 ( time date, errstate varchar(30), errcode int, errcontent varchar(100), date_key date as (substr(substr(metadata$fileurl, instr(metadata$fileurl, '%') + 1), 1, 10)) ) location='/home/admin/external_table_mock_log' FORMAT ( type = 'csv', field_delimiter = ',', SKIP_BLANK_LINES = TRUE ) partition by (date_key) ;The value in the
METADATA$FILEURLcolumn is the path and file name. If the file is a local file, the value is returned in the format ofip:port%2023-06-01/server_log2.csv.Query data.
Query data in the external table.
You can query the data by specifying the value or range of the partitioning key. In this case, partition pruning is performed. The external table reads data only from the specified partition.
obclient> select * from ex_t1 where date_key = '2023-06-01';Query the file list of the external table.
After the external table is created, you can also query the file list in the
ALL_OB_EXTERNAL_TABLE_FILESview.
SELECT * FROM oceanbase.ALL_OB_EXTERNAL_TABLE_FILES WHERE TABLE_NAME='ex_t1';Import data from the external table to an internal table.
You can modify the
default_load_modeparameter to specify the data import behavior. For more information, see default_load_mode.Create an internal table.
obclient> create table t1 ( time date, errstate varchar(30), errcode int, errcontent varchar(100), date_key date ) partition by list columns(date_key) (partition p20230601 values in ('2023-06-01'), partition p20230602 values in ('2023-06-02'), partition p20230603 values in ('2023-06-03'), partition p20230701 values in ('2023-07-01') );Import external data to the internal table.
obclient> insert into t1 select * from ex_t1;