Title: | Access the 'Spark Catalog' API via 'sparklyr' |
---|---|
Description: | Gain access to the 'Spark Catalog' API making use of the 'sparklyr' API. 'Catalog' <https://spark.apache.org/docs/2.4.3/api/java/org/apache/spark/sql/catalog/Catalog.html> is the interface for managing a metastore (aka metadata catalog) of relational entities (e.g. database(s), tables, functions, table columns and temporary views). |
Authors: | Nathan Eastwood [aut, cre] |
Maintainer: | Nathan Eastwood <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.1.1 |
Built: | 2025-02-07 03:26:23 UTC |
Source: | https://github.com/nathaneastwood/catalog |
Spark SQL can cache tables using an in-memory columnar format by calling
cache_table()
. Spark SQL will scan only required columns and will
automatically tune compression to minimize memory usage and GC pressure.
You can call uncache_table()
to remove the table from memory. Similarly you
can call clear_cache()
to remove all cached tables from the in-memory
cache. Finally, use is_cached()
to test whether or not a table is cached.
cache_table(sc, table) clear_cache(sc) is_cached(sc, table) uncache_table(sc, table)
cache_table(sc, table) clear_cache(sc) is_cached(sc, table) uncache_table(sc, table)
sc |
A |
table |
|
cache_table()
: If successful, TRUE
, otherwise FALSE
.
clear_cache()
: NULL
, invisibly.
is_cached()
: A logical(1)
vector indicating TRUE
if the table is
cached and FALSE
otherwise.
uncache_table()
: NULL
, invisibly.
create_table()
, get_table()
, list_tables()
, refresh_table()
,
table_exists()
, uncache_table()
## Not run: sc <- sparklyr::spark_connect(master = "local") mtcars_spark <- sparklyr::copy_to(dest = sc, df = mtcars) # By default the table is not cached is_cached(sc = sc, table = "mtcars") # We can manually cache the table cache_table(sc = sc, table = "mtcars") # And now the table is cached is_cached(sc = sc, table = "mtcars") # We can uncache the table uncache_table(sc = sc, table = "mtcars") is_cached(sc = sc, table = "mtcars") ## End(Not run)
## Not run: sc <- sparklyr::spark_connect(master = "local") mtcars_spark <- sparklyr::copy_to(dest = sc, df = mtcars) # By default the table is not cached is_cached(sc = sc, table = "mtcars") # We can manually cache the table cache_table(sc = sc, table = "mtcars") # And now the table is cached is_cached(sc = sc, table = "mtcars") # We can uncache the table uncache_table(sc = sc, table = "mtcars") is_cached(sc = sc, table = "mtcars") ## End(Not run)
Creates a table, in the hive warehouse, from the given path and returns the
corresponding DataFrame
. The table will contain the contents of the file
that is in the path
parameter.
create_table(sc, table, path, source, ...)
create_table(sc, table, path, source, ...)
sc |
A |
table |
|
path |
|
source |
|
... |
Additional options to be passed to the |
The default data source type is parquet.
This can be changed using source
or setting the configuration option
spark.sql.sources.default
when creating the spark session using or after
you have created the session using
config <- sparklyr::spark_config() config[["spark.sql.sources.default"]] <- "csv"
A tbl_spark
.
cache_table()
, get_table()
, list_tables()
, refresh_table()
,
table_exists()
, uncache_table()
Returns the current database in this session. By default your session will be
connected to the "default" database (named "default") and to change database
you can use set_current_database()
.
current_database(sc)
current_database(sc)
sc |
A |
character(1)
, the current database name.
set_current_database()
, database_exists()
, list_databases()
## Not run: sc <- sparklyr::spark_connect(master = "local") current_database(sc = sc) ## End(Not run)
## Not run: sc <- sparklyr::spark_connect(master = "local") current_database(sc = sc) ## End(Not run)
Check if the database with the specified name exists. This will check the list of hive databases in the current session to see if the database exists.
database_exists(sc, name)
database_exists(sc, name)
sc |
A |
name |
|
A logical(1)
vector indicating TRUE
if the database exists and FALSE
otherwise.
current_database()
, set_current_database()
, list_databases()
## Not run: sc <- sparklyr::spark_connect(master = "local") database_exists(sc = sc, name = "default") database_exists(sc = sc, name = "fake_database") ## End(Not run)
## Not run: sc <- sparklyr::spark_connect(master = "local") database_exists(sc = sc, name = "default") database_exists(sc = sc, name = "fake_database") ## End(Not run)
drop_global_temp_view()
: Drops the global temporary view with the given
view name in the catalog.
drop_temp_view()
: Drops the local temporary view with the given view name
in the catalog. Local temporary view is session-scoped. Its lifetime is the
lifetime of the session that created it, i.e. it will be automatically
dropped when the session terminates. It's not tied to any databases.
drop_global_temp_view(sc, view) drop_temp_view(sc, view)
drop_global_temp_view(sc, view) drop_temp_view(sc, view)
sc |
A |
view |
|
A logical(1)
vector indicating whether the temporary view was dropped
(TRUE
) or not (FALSE
).
## Not run: sc <- sparklyr::spark_connect(master = "local") mtcars_spark <- sparklyr::copy_to(dest = sc, df = mtcars) # We can check which temporary tables are in scope list_tables(sc = sc) # And then drop those we wish to drop drop_temp_view(sc = sc, view = "mtcars") ## End(Not run)
## Not run: sc <- sparklyr::spark_connect(master = "local") mtcars_spark <- sparklyr::copy_to(dest = sc, df = mtcars) # We can check which temporary tables are in scope list_tables(sc = sc) # And then drop those we wish to drop drop_temp_view(sc = sc, view = "mtcars") ## End(Not run)
Check if the function with the specified name exists in the specified database.
function_exists(sc, fn, database = NULL)
function_exists(sc, fn, database = NULL)
sc |
A |
fn |
|
database |
|
function_exists()
includes in-built functions such as abs
. To see if a
built-in function exists you must use the unqualified name. If you create a
function you can use the qualified name. If you want to check if a built-in
function exists specify the database
as NULL
.
A logical(1)
vector indicating TRUE
if the function exists within the
specified database and FALSE
otherwise.
## Not run: sc <- sparklyr::spark_connect(master = "local") function_exists(sc = sc, fn = "abs") ## End(Not run)
## Not run: sc <- sparklyr::spark_connect(master = "local") function_exists(sc = sc, fn = "abs") ## End(Not run)
Get the function with the specified name.
get_function(sc, fn, database = NULL)
get_function(sc, fn, database = NULL)
sc |
A |
fn |
|
database |
|
If you are trying to get an in-built function then use the unqualified name
and pass NULL
as the database
name.
A spark_jobj
which includes the class name, database, description, whether
it is temporary and the name of the function.
function_exists()
, list_functions()
## Not run: sc <- sparklyr::spark_connect(master = "local") get_function(sc = sc, fn = "Not") ## End(Not run)
## Not run: sc <- sparklyr::spark_connect(master = "local") get_function(sc = sc, fn = "Not") ## End(Not run)
Get the table or view with the specified name in the specified database. You can use this to find the table's description, database, type and whether it is a temporary table or not.
get_table(sc, table, database = NULL)
get_table(sc, table, database = NULL)
sc |
A |
table |
|
database |
|
An object of class spark_jobj
and shell_jobj
.
cache_table()
, create_table()
, list_tables()
, refresh_table()
,
table_exists()
, uncache_table()
Returns a list of columns for the given table/view in the specified database. The result includes the name, description, dataType, whether it is nullable or if it is partitioned and if it is broken in buckets.
list_columns(sc, table, database = NULL)
list_columns(sc, table, database = NULL)
sc |
A |
table |
|
database |
|
A tibble
with 6 columns:
name
- The name of the column.
description
- Description of the column.
dataType
- The column data type.
nullable
- Whether the column is nullable or not.
isPartition
- Whether the column is partitioned or not.
isBucket
- Whether the column is broken in buckets.
## Not run: sc <- sparklyr::spark_connect(master = "local") mtcars_spark <- sparklyr::copy_to(dest = sc, df = mtcars) list_columns(sc = sc, table = "mtcars") ## End(Not run)
## Not run: sc <- sparklyr::spark_connect(master = "local") mtcars_spark <- sparklyr::copy_to(dest = sc, df = mtcars) list_columns(sc = sc, table = "mtcars") ## End(Not run)
Returns a list of databases available across all sessions. The result contains the name, description and locationUri of each database.
list_databases(sc)
list_databases(sc)
sc |
A |
A tibble
containing 3 columns:
name
- The name of the database.
description
- Description of the database.
locationUri
- Path (in the form of a uri) to data files.
current_database()
, database_exists()
, set_current_database()
## Not run: sc <- sparklyr::spark_connect(master = "local") list_databases(sc = sc) ## End(Not run)
## Not run: sc <- sparklyr::spark_connect(master = "local") list_databases(sc = sc) ## End(Not run)
Returns a list of functions registered in the specified database. This includes all temporary functions. The result contains the class name, database, description, whether it is temporary and the name of each function.
list_functions(sc, database = NULL)
list_functions(sc, database = NULL)
sc |
A |
database |
|
A tibble
containing 5 columns:
name
- Name of the function.
database
- Name of the database the function belongs to.
description
- Description of the function.
className
- The fully qualified class name of the function.
isTemporary
- Whether the function is temporary or not.
function_exists()
, get_function()
## Not run: sc <- sparklyr::spark_connect(master = "local") list_functions(sc = sc) list_functions(sc = sc, database = "default") ## End(Not run)
## Not run: sc <- sparklyr::spark_connect(master = "local") list_functions(sc = sc) list_functions(sc = sc, database = "default") ## End(Not run)
Returns a list of tables/views in the current database. The result includes the name, database, description, table type and whether the table is temporary or not.
list_tables(sc, database = NULL)
list_tables(sc, database = NULL)
sc |
A |
database |
|
A tibble
containing 5 columns:
name
- The name of the table.
database
- Name of the database the table belongs to.
description
- Description of the table.
tableType
- The type of table (e.g. view/table)
isTemporary
- Whether the table is temporary or not.
cache_table()
, create_table()
, get_table()
, refresh_table()
,
table_exists()
, uncache_table()
## Not run: sc <- sparklyr::spark_connect(master = "local") mtcars_spakr <- sparklyr::copy_to(dest = sc, df = mtcars) list_tables(sc = sc) ## End(Not run)
## Not run: sc <- sparklyr::spark_connect(master = "local") mtcars_spakr <- sparklyr::copy_to(dest = sc, df = mtcars) list_tables(sc = sc) ## End(Not run)
recover_partitions()
: Recovers all the partitions in the directory of a
table and update the catalog. This only works for partitioned tables and not
un-partitioned tables or views.
refresh_by_path()
: Invalidates and refreshes all the cached data (and the
associated metadata) for any Dataset that contains the given data source
path. Path matching is by prefix, i.e. "/" would invalidate everything that
is cached.
refresh_table()
: Invalidates and refreshes all the cached data and
metadata of the given table. For performance reasons, Spark SQL or the
external data source library it uses might cache certain metadata about a
table, such as the location of blocks. When those change outside of Spark
SQL, users should call this function to invalidate the cache. If this table
is cached as an InMemoryRelation
, drop the original cached version and make
the new version cached lazily.
recover_partitions(sc, table) refresh_by_path(sc, path) refresh_table(sc, table)
recover_partitions(sc, table) refresh_by_path(sc, path) refresh_table(sc, table)
sc |
A |
table |
|
path |
|
NULL
, invisibly. These functions are mostly called for their side effects.
cache_table()
, create_table()
, get_table()
, list_tables()
,
table_exists()
, uncache_table()
Sets the current default database in this session.
set_current_database(sc, name)
set_current_database(sc, name)
sc |
A |
name |
|
If successful, TRUE
, otherwise errors.
current_database()
, database_exists()
, list_databases()
## Not run: sc <- sparklyr::spark_connect(master = "local") set_current_database(sc = sc, name = "new_db") ## End(Not run)
## Not run: sc <- sparklyr::spark_connect(master = "local") set_current_database(sc = sc, name = "new_db") ## End(Not run)
Check if the table or view with the specified name exists in the specified database. This can either be a temporary view or a table/view.
table_exists(sc, table, database = NULL)
table_exists(sc, table, database = NULL)
sc |
A |
table |
|
database |
|
If database
is NULL
, table_exists
refers to a table in the current
database (see current_database()
).
A logical(1)
vector indicating TRUE
if the table exists within the
specified database and FALSE
otherwise.
cache_table()
, create_table()
, get_table()
, list_tables()
,
refresh_table()
, uncache_table()
## Not run: sc <- sparklyr::spark_connect(master = "local") mtcars_spark <- sparklyr::copy_to(dest = sc, df = mtcars) table_exists(sc = sc, table = "mtcars") ## End(Not run)
## Not run: sc <- sparklyr::spark_connect(master = "local") mtcars_spark <- sparklyr::copy_to(dest = sc, df = mtcars) table_exists(sc = sc, table = "mtcars") ## End(Not run)