Configuration

Default configuration

pyspark-data-mocker configures spark in a way that optimize tests executions.

>>> from pyspark_data_mocker import DataLakeBuilder
>>> builder = DataLakeBuilder.load_from_dir("./tests/data/basic_datalake")  # byexample: +timeout=20 +pass
>>> spark = builder.spark
>>> spark_conf = spark.conf
>>> spark_conf.get("spark.app.name")
'test'
>>> spark_conf.get("spark.master")  # 1 thread for the execution
'local[1]'
>>> spark_conf.get("spark.sql.warehouse.dir")  # Temporal directory to store the data warehouse
'/tmp/tmp<...>/spark_warehouse'
>>> spark_conf.get("spark.sql.shuffle.partitions")
'1'

>>> spark_conf.get("spark.ui.showConsoleProgress")
'false'

>>> spark_conf.get("spark.ui.enabled")
'false'
>>> spark_conf.get("spark.ui.dagGraph.retainedRootRDDs")
'1'
>>> spark_conf.get("spark.ui.retainedJobs")
'1'
>>> spark_conf.get("spark.ui.retainedStages")
'1'
>>> spark_conf.get("spark.ui.retainedTasks")
'1'
>>> spark_conf.get("spark.sql.ui.retainedExecutions")
'1'
>>> spark_conf.get("spark.worker.ui.retainedExecutors")
'1'
>>> spark_conf.get("spark.worker.ui.retainedDrivers")
'1'

>>> spark_conf.get("spark.sql.catalogImplementation")
'in-memory'

To better understand what these configuration means and why it is configured like this, you can take a look on Sergey Ivanychev's excellent research on "Faster PySpark Unit Test".

Custom configuration

Some of these configurations can be overridden by providing a config yaml file. For example lets build a custom configuration.

$ echo "
> spark_configuration:
>    app_name: test_complete
>    number_of_cores: 4
>    enable_hive: True
>    warehouse_dir: "/tmp/full_delta_lake"
>    delta_configuration:
>        scala_version: '2.12'
>        delta_version: '2.0.2'
>        snapshot_partitions: 2
>        log_cache_size: 3
> " > /tmp/custom_config.yaml

To use a custom configuration, you can pass a string or pathlib.Path optional argument to load_from_dir.

>>> builder = DataLakeBuilder.load_from_dir("./tests/data/basic_datalake", "/tmp/custom_config.yaml")  # byexample: +timeout=20
<...>
>>> spark_conf = SparkSession.builder.getOrCreate().conf
>>> spark_conf.get("spark.app.name")
'test_complete'
>>> spark_conf.get("spark.master")
'local[4]'
>>> spark_conf.get("spark.sql.warehouse.dir")
'/tmp/full_delta_lake/spark_warehouse'

>>> spark_conf.get("spark.jars.packages")
'io.delta:delta-core_2.12:2.0.2'
>>> spark_conf.get("spark.sql.extensions")
'io.delta.sql.DeltaSparkSessionExtension'
>>> spark_conf.get("spark.databricks.delta.snapshotPartitions")
'2'
>>> spark_conf.get("spark.sql.catalog.spark_catalog")
'org.apache.spark.sql.delta.catalog.DeltaCatalog'

>>> spark_conf.get("spark.sql.catalogImplementation")
'hive'

Note that now the spark session now use 4 CPU cores, the delta framework is enabled, and it uses the hive catalog implementation.

Configuration file explanation

¿But, what do those values represent? Let's take a closer look on the levers that we can control in this configuration file

App configuration

config name type default value description
schema SCHEMA_CONFIG DEFAULT_CONFIG Schema configuration. You can set a custom yaml where you will define the schema of each table, or let spark to infer it. More info below
disable_spark_configuration BOOL False If set to true, then all spark optimization mentioned before will be disabled. It is responsability of the developer to configure spark as he wish
spark_configuration SPARK_CONFIG A reduced amount of levers to modify the spark configuration recommended for tests.

Schema configuration

Inside the app configuration, there is a special configuration for the schema. There you can set these options as you please

config name type default description
infer BOOL false Enable automatic column type infering
config_file STRING schema_config.yaml Config file name to read for manual schema definition

More about schema inferring can be seen here

Spark configuration

This parameter is desired if you want to let pyspark-data-mocker to handle the spark configuration for you. It tries to abstract the user how the session should be and let him concentrate on define good tests without worrying much about performance and fine-tuning.

config name type default value description
number_of_cores INTEGER 1 change the amount of CPU cores The spark session will use
enable_hive BOOL false Enables the usage of Apache Hive's catalog
warehouse_dir STRING tempfile.TemporaryDirectory() If set, it will create a persistent directory where the wharehouse will live. By default pyspark_data_mocker uses a TemporaryDirectory that will exists as long the builder instance exists
delta_configuration DELTA_CONFIG None If set, it will enable Delta Lake framework

Delta configuration

Among the things you can change when enabling Delta capabilities are:

config name type description
scala_version STRING Version of Scala that the spark session will use. Take into consideration that the scala version MUST be compatible with the Delta-core version used
delta_version STRING Version of delta core used. The version used highly depends on the pyspark version
snapshot_partitions INTEGER Tells delta how should the partitions be done
log_cache_size INTEGER Limits the Delta log cache

Important note: If you enable Delta capabilities, check your pyspark version, and configure the right value of scala and delta version.

Important note 2: For the delta configuration, take into consideration that ALL VALUES should be explicitly set-up, there is no default value for each one of them.

Disable spark optimizations

One advance developer may ask by himself: "why are you obfuscating me how the spark session is configured? Let me handle it". To that advance user , you can disable the automatic spark configuration by setting as True the value disable_spark_configuration.

That Engineer is responsible to configure spark before using this package, using the jars he wants. Here we recommend to still stick to the recommendations commented here in order to make the tests as fast as possible. Take in mind that the default spark configuration behaves poorly when handling a low amount of data, and if you write a considerable amount of test, the pipeline that run all the test suit may take forever!.