As a SQL-based analytics tool, Cluvio needs to connect to a SQL database or data warehouse in order to run the queries that produce the data for your dashboards. Most commonly, the database used for analytics with Cluvio is either a read-only replica of a primary database, or is a dedicated analytics database that contains data extracted from one or more other sources. In Cluvio, the connection configuration for a database is called a datasource.
The data from a datasource is queried in-place without the need to first perform any extraction or to create data models. This has several benefits:
- There is no extraction step for your data: You can connect and run queries within seconds.
- It is easy to get started, as you can use your existing SQL experience, with only few Cluvio-specific concepts to learn.
- You have the full expressiveness of SQL at hand - and where things get complicated or impossible in SQL, you can use R to further process the SQL results.
Nevertheless, there are situations where Cluvio's SQL-based approach is not ideal:
- Because the analytical queries are expressed in SQL, Cluvio is less suitable for cases where business users need to create analyses themselves (usually called self-service BI). We have many customers where brave people from operations managed to learn the basics of SQL and be productive with Cluvio, but this is not always the case.
- You (the analyst) have to do part of what an OLAP BI tool would do: Translate the analytical questions to be answered into SQL - typically involving the joining of multiple tables, together with grouping and aggregate functions. A simple
SELECT * FROM tabledoes not typically give answers to interesting questions, unless the table already contains pre-computed results that only need visualization.
3 Connection Options
A datasource in Cluvio is used to configure access to your database, including the connection method, the database user credentials and any additional connection options. Cluvio supports the following 3 connection methods:
- Cluvio Agents, which you can install and run on your host or laptop/PC to provide easy and secure database connectivity without requiring any inbound firewall rules for Cluvio. The agent only needs to be able to establish outbound connections both to Cluvio and your internal database.
- SSH Tunnels, whereby you set up an SSH server with a publicly reachable IP address that Cluvio can connect to. Cluvio owns a secret key whose corresponding public key needs to be trusted by your SSH server. As with agents, the SSH tunnel host must have network access to the database you want to connect to, and it must allow inbound traffic from a set of static public IP addresses controlled by Cluvio.
- Direct connections, which can be used when your database is a "cloud-native" database such as BigQuery or has a publicly reachable IP address.
How Cluvio treats your data and how it is kept secure is described in detail on our website: Security & Privacy.
The following databases can be used as data sources:
- Amazon Redshift / Panoply
- PostgreSQL / Heroku Postgres / CrateDB
- MySQL / MariaDB / Amazon Aurora
- MongoDB "MySQL" BI Connector
- Google BigQuery
- Trino / Amazon Athena
- Microsoft SQL Server
Not all connection methods are applicable to all database types. The UI will only offer those connection methods which are available for the chosen database type.
When you successfully configure a datasource, Cluvio will query some database metadata, schema information (tables and columns of the selected database) and table row counts. This information is queried to give you a convenient database schema overview in form of the almanach to begin exploring your data.
The data in your database is only queried by Cluvio via the SQL that you use to define reports, dashboards, SQL alerts, etc. Query result data is cached for ~24 hours to improve load times and scalability of your dashboards for your viewers without overloading your database with redundant queries.
It is usually not a good idea to expose a production database (or any database with real data) to the internet. Apart from direct TLS connections to your database, there are two main approaches that can be employed to ensure that your database access remains secure:
running a Cluvio Agent in front of your database, which establishes outbound connections to Cluvio without the need to explicitly permit inbound traffic from Cluvio server IP addresses and provides an encrypted channel for all query executions and result data between the agent machine and Cluvio servers. See Using Cluvio Agents for details.
running an SSH server in front of your database and connecting your Cluvio datasource to your database through an SSH tunnel. By restricting access on your SSH server to a specific user and Cluvio's SSH public key, only Cluvio servers can access your database through the tunnel as that particular user and using Cluvio's SSH secret key. See Connecting through an SSH tunnel for details. When using direct connections or connections through an SSH tunnel for a Cluvio datasource, your firewall must permit inbound traffic from a particular set of Cluvio IP addresses.
For direct connections or SSH tunnels, Cluvio's query executors connect via one of the following IP addresses, depending on the location of your Cluvio account (found in the Admin Settings). You should further secure the connectivity by adding firewall rules to only allow connections from these IP addresses to the DB or the SSH tunnel.
If you run on AWS, you can add the applicable three IPs to your Security Group inbound rules that guard access to the database (e.g. RDS, Redshift, or your own EC2 instance).
DB user permissions
Additionally, for most use cases, you would want to create a read-only user in the database for use by Cluvio to avoid inadvertent changes to your data.
When we run a SQL query against your database, the result size is limited in:
- The number of rows: 5,000 on the Free & Pro plans and 10,000 on the Business plan (custom limits can be offered for convincing use-cases).
- The result size in bytes: 50 MB
This does not mean that you cannot query against tables with millions or billions of rows of data, just that you need to apply some aggregation to get the actual results to display within a size that would be practical to process and visualize. There are some exceptions to these limits, such as when exporting results as "unlimited" CSV files.
When you create reports in Cluvio and SQL queries are run, the results are cached by Cluvio and reused by all objects (i.e. reports & filters) with the same SQL query. On any of the objects, you can specify requirements for "data freshness", i.e. the maximum age that is still considered "current" and can be served from the cache without re-running the SQL query. See Dashboard Data Caching.
This smart caching of results provides a significant load reduction for your database or data warehouse, esp. when lots of users are accessing your dashboards. It also makes the dashboards load much faster for your users.
Other Data Sources
Cluvio does not support working directly with raw data (CSV, Excel, etc.), data from APIs or business software (Salesforce, Zendesk, Xero, etc.) or connecting to non-SQL databases (except for MongoDB, supported via a BI Connector).
To integrate with such data sources, a common approach is to use an ETL tool, such as StitchData, Fivetran, ETLeap, Airbyte or similar. Setting these up to fetch data from many different sources and load them into a SQL database or data warehouse is typically easy to set up and operate. A setup like this provides a lot of flexibility to evolve and perform your data analytics with different tools (like Cluvio) without a vendor lock-in.