Welcome to “Debugging Apache Spark Application Issues.” After watching this video, you will be able to: Identify common Apache Spark application issues, debug issues through the application UI, describe application log file locations and content. Running a Spark application on a cluster is a complex process with many working parts and many ways in which an application can fail. Common reasons for application failure on a cluster include user code, system and application configurations, application dependencies that are missing or an incorrect version, improper resource allocation, and network communication among cluster nodes. What is user code? User code is made up of the driver program, which runs in the driver process and the functions and variables serialized that the executor runs in parallel. Both the driver and executor processes run the application user code of an application passed to the spark-submit script. The user code in the driver creates the SparkContext and creates jobs based on operations to the DataFrames. These DataFrame operations become serialized closures sent throughout the cluster and run-on executor processes as tasks. The serialized closures contain the necessary functions, classes, and variables needed to run each task. Spark usually immediately terminates when syntax, serialization, data validation, and other user errors occur. Spark reports task errors to the driver and immediately cancels the related executor tasks, terminating the application. For example, closures are cleaned before being serialized, flushing out most issues right away and ensuring that closures can be serialized and executed remotely. User code serialized as a closure might not error until after the user code runs on another executor process. These errors could be due to runtime calculations, network communication issues, or unexpected data issues. If a task fails due to an error, Spark can attempt to rerun the task for a set number of retries. If all attempts to run the task fail, Spark reports an error to the driver and the application is terminated. The cause of an application failure can usually be found in the driver event log. A Spark application can have many dependencies including application files such as Python script files, Java JAR files, and even required data files. Applications also depends on the libraries used and their dependencies. Dependencies must be made available on all nodes of the cluster, either by pre-installation or by including the dependencies in the spark-submit script bundled with the application, or as additional arguments. For example, a task will error if a Python library is not installed in the Python environment of the executor process. An even more subtle error can occur if a library is installed with different versions on executors that might have different APIs or produce unexpected results. The best way to identify this type of issue is by examining the event log for stack trace errors that identify which libraries the application loaded. Application resources, such as CPU cores and memory, can become an issue if a task is in the scheduling queue and the available workers do not have enough available resources to run the tasks. As a worker finishes a task, the CPU and memory in use are freed, allowing the scheduling of another task. However, if the application asks for more resources that can ever become available, the tasks might never be run and eventually time out. Similarly, suppose that the executors are running long tasks that never finish. In that case, their resources never become available, which also causes future tasks never to run, resulting in a timeout error. You can readily access these errors when you view the UI or event logs. While the Application UI provides a great deal of information, viewing the log files provides details that give more insights into possible application failure causes. You’ll find the application log files in the Spark installation directory, under `work/<application-id>/,` where you will find one log file for `stdout` and one log file for `stderr` output. The application-id is a unique ID that Spark assigns to each application. These log files appear for each executor and driver process that the application runs. Additionally, if you are running a Spark standalone cluster, the master and workers both output log files to the `log/` directory under the Spark installation directory from where they run. In this video, you learned that: Common reasons for application failure on a cluster include user code, system and application configurations, missing dependencies, improper resource allocation, and network communications. Application log files will often show the complete details of a failure, which are located in the Spark installation directory.