API Overview
dremio-arrow
package is a single module Python API that exposes two methods; a class DremioArrowClient
and a function dremio_query
. DremioArrowClient
class implements flight middlewares and is the gateway to dremio flight server. dremio_query
function is a shorthand to DremioArrowClient
and a very fast way to invoke the client especially when re-use of client parameters is not useful!
Using the client in R environment
While the client is developed in python
, it can be used to interact with dremio flight engine; thanks to reticulate.
From {reticulate}
s official documentation, the reticulate package provides a comprehensive set of tools for interoperability between Python and R. The package includes facilities for:
-
Calling Python from R in a variety of ways including R Markdown, sourcing Python scripts, importing Python modules, and using Python interactively within an R session.
-
Translation between R and Python objects (for example, between R and Pandas data frames, or between R matrices and NumPy arrays).
-
Flexible binding to different versions of Python including virtual environments and Conda environments.
Given that the client outputs a pandas.DataFrame
data, reticulate
can conveniently convert the data to native R data.frame
outside the box (no hacks whatsoever).
Handling python datetime objects
At the moment, conversion between python datetime
and R timestamp
is not well covered in reticulate
. The client provides ts_col
and ts_fmt
extra options during query operation to enable conversion of python datetime
to string counterpart.
Working Environment
This example assumes you have python3
, a virtual environment created and the client installed in the virtual environment. If this is not the case, See Python Downloads for python installation and Client Installation for virtual environment creation and client installation.
Install reticulate
reticulate
is a R
package and thus should be installed in R
environment. Below commands can be executed in a Rmarkdown
, R script
or R interpreter/terminal
.
-
Install required
R
package {reticulate} alongside its dependencies. -
Load
reticulate
package into working environment -
Set path to
python interpreter
. The/path/to/vertual/environment
can be absolute or relative and should not includebin/python
part. The path points to directory containingbin/python
.
Now that we have the environment set, it is time to rea some data from dremio engine!
using class DremioArrowClient
-
First step is to import dremio client package into
R
. This makes the client available in our working environment. -
Initialize the client with dremio flight server connection credentials. These are the credentials you use to access your dremio engine.
-
Here we define SQL query string with data fetch instructions. The table name path must exist, whether as phisical or virtual dataset.
-
We now execute our SQL query against dremio flight server.
-
This step is not necessary but a way to preview the dataset we just fetched. Form here you have your data as a
pandas.DataFrame
and thus can proceed with analysis, processing and reporting bit.
'Cleaner
' Credentials Provision
Dremio flight server connection parameters can be supplied using environment variables. In that case, it is not necessary to supply the credentials when initializing the client.
The environment variables can be defined by either exporting the variables in current terminal session (We can use them only if running R from the terminal), persist them on ~/.Renviron
, ~/.Rprofile
, ~/.profile
(ubuntu) or ~/.zshrc
(Mac).
Write the variables into /.Renviron
.Renviron is a user-controllable file that can be used to create environment variables. This is especially useful to avoid including credentials like API keys inside R scripts. This file is written in a key-value format, so environment variables are created in the format:
Write the variables into /.Rprofile
R will source only one .Rprofile file. So if you have both a project-specific .Rprofile file and a user .Rprofile file that you want to use, you explicitly source the user-level .Rprofile at the top of your project-level .Rprofile with source("~/.Rprofile").
.Rprofile files are sourced as regular R code, so setting environment variables must be done inside a Sys.setenv(key = "value") call.
Define the virables on current terminal session
Execute below commands replacing placeholder texts with actual credential value.
Write the variables into ~/.profile
(ubuntu) or ~/.zshrc
(Mac).
- Copy the enviroment variables into the file replacing placeholder texts with actual credential value.
Once done, exit all running applications and logout or restart the computer for the changes to take effect.
With the environment variables correctly set, we nolonger need to set connection credentials. The client is smart enough to extract the variables from the environmet!
If we had a second query to execute, we would just have reused the client with a different SQL query string.
We didn't create a new client
In this repeat operation, we did not create a new client object, instead we reused the one we earlier created above!
Using function dremio_query
This function is very useful when we are interested in executing a single query and are not sure when a second query might be executed. The method takes authentication credentials and returns data. In essence, this is to mean session bearer token cannot be re-used because we are not using the client directly!
-
First step is to import the client package.
-
Define SQL query string with data fetch instructions.
-
Execute the SQL query string against dremio flight server.
With credentials defined as environment variables, the above is refactired to:
Created: July 4, 2023