Package 'sparkavro'

Title: Load Avro file into 'Apache Spark'
Description: Load Avro Files into 'Apache Spark' using 'sparklyr'. This allows to read files from 'Apache Avro' <https://avro.apache.org/>.
Authors: Aki Ariga
Maintainer: Aki Ariga <[email protected]>
License: Apache License 2.0 | file LICENSE
Version: 0.3.0
Built: 2025-02-24 05:19:02 UTC
Source: https://github.com/chezou/sparkavro

Help Index


Reads a Avro File into Apache Spark

Description

Reads a Avro file into Apache Spark using sparklyr.

Usage

spark_read_avro(
  sc,
  name,
  path,
  readOptions = list(),
  repartition = 0L,
  memory = TRUE,
  overwrite = TRUE
)

Arguments

sc

An active spark_connection.

name

The name to assign to the newly generated table.

path

The path to the file. Needs to be accessible from the cluster. Supports the ‘⁠"hdfs://"⁠’, ‘⁠"s3n://"⁠’ and ‘⁠"file://"⁠’ protocols.

readOptions

A list of strings with additional options.

repartition

The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning.

memory

Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?)

overwrite

Boolean; overwrite the table with the given name if it already exists?

Examples

## Not run: 
## If you haven't got a Spark cluster, you can install Spark locally like this
library(sparklyr)
spark_install(version = "2.0.1")

sc <- spark_connect(master = "local")
df <- spark_read_avro(
  sc,
  "twitter",
  system.file("extdata/twitter.avro", package = "sparkavro"),
  repartition = FALSE,
  memory = FALSE,
  overwrite = FALSE
)

spark_disconnect(sc)

## End(Not run)

Write a Spark DataFrame to a Avro file

Description

Serialize a Spark DataFrame to the Parquet format.

Usage

spark_write_avro(x, path, mode = NULL, options = list())

Arguments

x

A Spark DataFrame or dplyr operation

path

The path to the file. Needs to be accessible from the cluster. Supports the ‘⁠"hdfs://"⁠’, ‘⁠"s3n://"⁠’ and ‘⁠"file://"⁠’ protocols.

mode

Specifies the behavior when data or table already exists.

options

A list of strings with additional options. See http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration.