Configuration
Default configuration
pyspark-data-mocker
configures spark in a way that optimize tests executions.
>>> from pyspark_data_mocker import DataLakeBuilder
>>> builder = DataLakeBuilder.load_from_dir("./tests/data/basic_datalake") # byexample: +timeout=20 +pass
>>> spark = builder.spark
>>> spark_conf = spark.conf
>>> spark_conf.get("spark.app.name")
'test'
>>> spark_conf.get("spark.master") # 1 thread for the execution
'local[1]'
>>> spark_conf.get("spark.sql.warehouse.dir") # Temporal directory to store the data warehouse
'/tmp/tmp<...>/spark_warehouse'
>>> spark_conf.get("spark.sql.shuffle.partitions")
'1'
>>> spark_conf.get("spark.ui.showConsoleProgress")
'false'
>>> spark_conf.get("spark.ui.enabled")
'false'
>>> spark_conf.get("spark.ui.dagGraph.retainedRootRDDs")
'1'
>>> spark_conf.get("spark.ui.retainedJobs")
'1'
>>> spark_conf.get("spark.ui.retainedStages")
'1'
>>> spark_conf.get("spark.ui.retainedTasks")
'1'
>>> spark_conf.get("spark.sql.ui.retainedExecutions")
'1'
>>> spark_conf.get("spark.worker.ui.retainedExecutors")
'1'
>>> spark_conf.get("spark.worker.ui.retainedDrivers")
'1'
>>> spark_conf.get("spark.sql.catalogImplementation")
'in-memory'
To better understand what these configuration means and why it is configured like this, you can take a look on Sergey Ivanychev's excellent research on "Faster PySpark Unit Test".
Custom configuration
Some of these configurations can be overridden by providing a config yaml file. For example lets build a custom configuration.
$ echo "
> spark_configuration:
> app_name: test_complete
> number_of_cores: 4
> enable_hive: True
> warehouse_dir: "/tmp/full_delta_lake"
> delta_configuration:
> scala_version: '2.12'
> delta_version: '2.0.2'
> snapshot_partitions: 2
> log_cache_size: 3
> " > /tmp/custom_config.yaml
To use a custom configuration, you can pass a string
or pathlib.Path
optional argument to load_from_dir
.
>>> builder = DataLakeBuilder.load_from_dir("./tests/data/basic_datalake", "/tmp/custom_config.yaml") # byexample: +timeout=20
<...>
>>> spark_conf = SparkSession.builder.getOrCreate().conf
>>> spark_conf.get("spark.app.name")
'test_complete'
>>> spark_conf.get("spark.master")
'local[4]'
>>> spark_conf.get("spark.sql.warehouse.dir")
'/tmp/full_delta_lake/spark_warehouse'
>>> spark_conf.get("spark.jars.packages")
'io.delta:delta-core_2.12:2.0.2'
>>> spark_conf.get("spark.sql.extensions")
'io.delta.sql.DeltaSparkSessionExtension'
>>> spark_conf.get("spark.databricks.delta.snapshotPartitions")
'2'
>>> spark_conf.get("spark.sql.catalog.spark_catalog")
'org.apache.spark.sql.delta.catalog.DeltaCatalog'
>>> spark_conf.get("spark.sql.catalogImplementation")
'hive'
Note that now the spark session now use 4 CPU cores, the delta framework is enabled, and it uses the hive
catalog
implementation.
Configuration file explanation
¿But, what do those values represent? Let's take a closer look on the levers that we can control in this configuration file
App configuration
config name | type | default value | description |
---|---|---|---|
schema |
SCHEMA_CONFIG | DEFAULT_CONFIG | Schema configuration. You can set a custom yaml where you will define the schema of each table, or let spark to infer it. More info below |
disable_spark_configuration |
BOOL | False | If set to true, then all spark optimization mentioned before will be disabled. It is responsability of the developer to configure spark as he wish |
spark_configuration |
SPARK_CONFIG | A reduced amount of levers to modify the spark configuration recommended for tests. |
Schema configuration
Inside the app configuration, there is a special configuration for the schema. There you can set these options as you please
config name | type | default | description |
---|---|---|---|
infer |
BOOL | false | Enable automatic column type infering |
config_file |
STRING | schema_config.yaml | Config file name to read for manual schema definition |
More about schema inferring can be seen here
Spark configuration
This parameter is desired if you want to let pyspark-data-mocker
to handle the spark configuration for you.
It tries to abstract the user how the session should be and let him concentrate on define good tests without worrying
much about performance and fine-tuning.
config name | type | default value | description |
---|---|---|---|
number_of_cores |
INTEGER | 1 | change the amount of CPU cores The spark session will use |
enable_hive |
BOOL | false | Enables the usage of Apache Hive's catalog |
warehouse_dir |
STRING | tempfile.TemporaryDirectory() | If set, it will create a persistent directory where the wharehouse will live. By default pyspark_data_mocker uses a TemporaryDirectory that will exists as long the builder instance exists |
delta_configuration |
DELTA_CONFIG | None | If set, it will enable Delta Lake framework |
Delta configuration
Among the things you can change when enabling Delta capabilities are:
config name | type | description |
---|---|---|
scala_version |
STRING | Version of Scala that the spark session will use. Take into consideration that the scala version MUST be compatible with the Delta-core version used |
delta_version |
STRING | Version of delta core used. The version used highly depends on the pyspark version |
snapshot_partitions |
INTEGER | Tells delta how should the partitions be done |
log_cache_size |
INTEGER | Limits the Delta log cache |
Important note: If you enable Delta capabilities, check your pyspark version, and configure the right value of scala and delta version.
Important note 2: For the delta configuration, take into consideration that ALL VALUES should be explicitly set-up, there is no default value for each one of them.
Disable spark optimizations
One advance developer may ask by himself: "why are you obfuscating me how the spark session is configured? Let me
handle it". To that advance user , you can disable the automatic spark configuration by setting as True
the
value disable_spark_configuration
.
That Engineer is responsible to configure spark before using this package, using the jars he wants. Here we recommend to still stick to the recommendations commented here in order to make the tests as fast as possible. Take in mind that the default spark configuration behaves poorly when handling a low amount of data, and if you write a considerable amount of test, the pipeline that run all the test suit may take forever!.