Data Sources
SQL-Only
As a SQL-based analytics tool, Cluvio needs to connect to a SQL-capable database or data warehouse in order to run the queries that produce the data for your dashboards. Most commonly, the database used for analytics with Cluvio is either a read-only replica of a primary database, or is a dedicated analytics database that contains data extracted from other sources. In Cluvio, the connection configuration for a database is called a datasource.
In Cluvio, a datasource is queried in-place. This has several benefits:
- There is no required data extraction or data modeling in Cluvio. You can connect to your database and run queries within seconds.
- It is easy to get started, as you can use your existing SQL experience, with only a few Cluvio-specific concepts to learn.
- You have the full expressiveness of SQL at hand - and where things get complicated or impossible in SQL, you can use R to post-process the SQL results.
Nevertheless, there are situations where Cluvio's SQL-only approach may not be ideal:
- Because the analytical queries are expressed in SQL, Cluvio is less suitable in situations where non-technical business users need to create reports themselves (a feature that is usually called self-service BI). We have many customers where brave people from operations managed to learn the basics of SQL and be productive with Cluvio, but this is not always the case.
- You (the analyst) have to do part of what an OLAP BI tool would do: Translate the analytical questions to be answered into SQL - typically involving the joining of multiple tables, together with grouping and aggregate functions. This requires a solid understanding of the relational data model. A simple
SELECT * FROM table
does not typically give answers to interesting questions, unless the table already contains pre-computed data.
3 Connection Options
A datasource in Cluvio is used to configure access to your database, including the connection method, the database user credentials and any additional connection options. Cluvio supports the following 3 connection methods:
-
Cluvio Agents, which you can install and run on your host or laptop/PC to provide easy and secure database connectivity, even to databases on your local computer, without requiring any inbound firewall rules for Cluvio. The agent only needs to be able to establish outbound connections both to Cluvio and to your internal database.
-
SSH Tunnels, whereby you set up an SSH server with a publicly reachable IP address that Cluvio can connect to. Cluvio owns a secret key whose corresponding public key needs to be trusted by your SSH server. As with agents, the SSH tunnel host must have network access to the database you want to connect to. Additionally, the SSH server must allow inbound traffic from a set of static public IP addresses controlled by Cluvio.
-
Direct connections, which can be used when your database is a "cloud-native" database such as BigQuery or has a publicly reachable IP address. In the latter case, analogous to a setup with an SSH server, the database server must allow inbound traffic from a set of static public IP addresses controlled by Cluvio.
Not all connection methods are available with all databases. The Cluvio UI will only offer those connection methods which are supported for the chosen database type.
Connection Security
If you use a Cluvio Agent or SSH tunnel to connect to a database that resides in a private network, your database connections are encrypted and secure by default. The same applies to direct connections to cloud databases like BigQuery, Athena, Snowflake and Databricks.
Using a direct connection to a self-hosted database is not recommended, as Cluvio does not offer full certificate verification with custom certificates used on these connections. As a result, your database connections, even if encrypted, may be subject to man-in-the-middle attacks. The recommended connection method for self-hosted databases, which includes services such as AWS RDS, is a Cluvio Agent running in a private network that has access to the database.
For added safety, unless the ability for your analysts to make changes to the
data via Cluvio is explicitly desired, we also recommend to restrict Cluvio to
read-only access to avoid
inadvertent changes to your data. Otherwise Cluvio can be used by admins and
analysts in your organization to execute any type of SQL query, not only
SELECT
statements.
Database Metadata
When you successfully configure a datasource, Cluvio will query database information schema to learn database metadata such as the tables and columns in the connected database. Cluvio will also obtain table row counts. All of this information is made available in in the almanach of the report editor to help you explore your data.
To keep the information in the almanach current, Cluvio refreshes the database metadata once per day. If the periodic queries for database metadata put unwanted load on your database you may choose to disable this behavior in the datasource configuration, at the cost of potentially working with stale schema information in Cluvio's report editor.
Even if you disable the periodic metadata updates, you can always trigger an explicit refresh of a datasource's database metadata from the context menu of the specific datasource on the datasource overview page.
Query Result Data
The data in your database is only queried by Cluvio via the SQL queries that you use to define reports, dashboards, SQL alerts, and other Cluvio features.
How Cluvio treats your data and how it is kept secure is described in detail on our website: Security & Privacy.
Limits
When Cluvio runs a SQL query against your database, the result size is limited as follows:
- The number of rows is limited to 5,000 on the Free & Pro plans and to 10,000 on the Business plan.
- The result size in bytes: 50 MB
Note that these are constraints on query result data, not the size of your database tables. Analytical queries can run on tables with millions or billions of rows of data but should produce aggregated results that are practical to process and visualize in web browsers.
There are some exceptions to these limits, such as when exporting results as "unlimited" CSV files, which is available on paid plans.
If you find the default plan limits to be insufficient, customer-specific limits can be offered.
Please detail your requirements and use-cases by contacting support@cluvio.com
.
Smart Caching
When you create reports in Cluvio and your SQL queries are run, the results are cached by Cluvio and reused by all objects (reports & filters) with the same SQL query and parameters. On any dashboard, report or SQL-based filter, you can configure your requirement for "data freshness", i.e. the maximum age that is still considered "current" and can be served from the cache without re-running the SQL query. See Dashboard Data Caching.
This smart caching of results provides a significant load reduction for your database or data warehouse, esp. when lots of users are accessing your dashboards. It also makes the dashboards load much faster for your users. In this way, Cluvio can reduce the size and cost requirements on your own database.
Supported Databases
The following databases are supported as data sources:
- Amazon Redshift / Panoply
- PostgreSQL / Heroku Postgres / CrateDB
- MySQL / MariaDB / Amazon Aurora
- MongoDB "MySQL" BI Connector
- Google BigQuery
- Presto
- Trino
- Amazon Athena
- Snowflake
- Microsoft SQL Server
- Oracle
- Vertica
- Databricks
Local Databases
Cluvio is generally used with databases running on servers, whether on cloud providers (AWS, Azure, Google Cloud, ...) or within your own server infrastructure. However, you can also connect Cluvio to a database running on your PC or laptop by installing and running a Cluvio Agent locally. A Cluvio Agent is open-source software that can run on your local computer and connects to Cluvio over HTTPS. Note that the SQL queries run in Cluvio can only succeed if your computer is running and the the Cluvio Agent is running and connected to Cluvio.
Other Data Sources
Cluvio does currently not support working directly with raw data (CSV, Excel, etc.), data from APIs or business software (Salesforce, Zendesk, Xero, etc.) or connecting to non-SQL databases (except for MongoDB via the BI Connector).
To integrate with such data sources, a common approach is to use an ETL tool, such as StitchData, Fivetran, ETLeap, Airbyte or similar. You can set these up to aggregate data from many different sources, loading them into a SQL database or data warehouse. This setup offers a lot of flexibility to evolve and perform your data analytics with different tools, like Cluvio, without a vendor lock-in.