Skip to main content

Connect to Databricks

Connection Modes: Direct

Connection

To connect to a Databricks Compute Cluster, select Add Datasource on the datasources overview page to open the Create Datasource dialog and select Databricks as the database type.

image-600 image-600

The Name is a required human-friendly identifier or description of the datasource in Cluvio. Datasource names need not be unique but we recommend to give each datasource a unique and meaningful name for ease of identification, especially if your organization uses multiple datasources.

The Host is the host name of your Databricks workspace and the Port defaults to the standard SSL/TLS port 443.

The HTTP Path identifies the Databricks compute cluster and the Default Catalog is the catalog used by default on Cluvio connections, meaning queries on schemas in this catalog need not be qualified with the catalog name.

For authentication with Databricks Cluvio supports OAuth secrets (recommended) or personal access tokens. Follow the instructions in the Databricks documentation to generate the chosen credentials. For OAuth secrets you must select the expiration as shown in the Databricks UI. Cluvio uses the expiration to send e-mail reminders to organization admins one week before the secret expires. This helps ensure uninterrupted operation for your analysts and viewers.

Databricks connections always use verified SSL/TLS connections.

When you have entered all the required information, select Test Connection to check that Cluvio can connect to your Databricks compute cluster. The connection test will report errors if the connection fails. See Troubleshooting for common problems.

Configuration

The Configuration tab of the datasource dialog shows settings that affect the datasource's behavior.

image-600 image-600

The Data Time Zone defaults to UTC and is the time zone that Cluvio assumes for any timestamps returned from queries that do not contain time zone information. See Data Time Zone for details.

The Maximum number of concurrent query executions control the maximum concurrency that Cluvio allows for the datasource. This setting can be used to control the maximum load on your database. The default is 20.

The Included Schemas configure the database schemas that are included in the almanac. The default selection of All includes all current and future schemas in all current and future accessible catalogs in the almanac. A selection of Catalogs includes current and future schemas from the selected catalogs in the almanac. A selection of Schemas includes only the selected schemas in the almanac.

The toggle Update schema nightly controls whether Cluvio queries your database schema nightly to ensure that the almanac in the report editor has up-to-date information on your database schema. Together with fetching schema information, Cluvio also tries to retrieve approximate row counts in each table. If you disable nightly schema updates, the almanac is only updated when you manually trigger a schema refresh on the datasource from the Cluvio datasources overview. Nightly schema updates are enabled by default.

The toggle Update exact row counts is only available when Update schema nightly is enabled. This setting controls whether Cluvio will determine exact row counts for every table in your database schema. Row counts are shown in the report editor almanac. Determining exact row counts usually involves issuing a COUNT(*) query on each table, which may cause undesirable load on your database. You can disable this setting to avoid these nightly queries. When disabled, the tables in the report editor almanac may not show row count information if the database does not provide approximate row counts.

Troubleshooting

If you need help connecting to your Databricks compute cluster, please contact support@cluvio.com.