One of the main discussions developers usually have in their coffee breaks is about what's the best programming language and framework for supporting their particular projects? Such discussions are less prominent among data scientists. The general understanding is that R and Python are the languages at present, and Scala and Julia are the languages of the future. The prefered framework is ApacheSpark, a framework already used in this course. So let's have a look at the programming language options and why we have chosen Python for this course. ApacheSpark is a fast and general engine for big data processing with built in modules for streaming, SQL, machine learning, and graph processing. One thing to notice is that ApacheSpark itself has been implemented in Scala. It runs on top of the java virtual machine, but fortunately this fact doesn't limit us to implementing Spark only in Scala. In fact, currently there are bindings for Java, Scala, Python and ever R. So with ApacheSpark, we generally have the choice among the most prominent languages R and Python. So let's have a brief look at each option and provide you with some support for such decisions in the future. Scala is the defector standard when it comes to ApacheSpark. Every ApacheSpark API is supported in Scala. And Scala code normally runs faster than all other options. Let's see what our first example looks like when using Scala, instead of Python. Remember that the example just invokes basic functions on the RDD API. This is a good example to compare the languages because it demonstrates using the RDD API from different languages. We start Scala effort by creating a jupiter notebook in the data science workbench. Choose any name and choose Scala as the programming language. Click on Create Notebook. Now, you have a new and empty notebook ready to write ApacheSpark applications in Scala. Once again we creat an RBD, a resilient distributive data set. In Scala you have to use the val key work, which tells tally that a constant variable is defined. In a SpacheSpark, RBD is used in the same was the c s object us used in Python. SC stands for Spark Context and it is used to create an oddity from an array or any other supported data source. And here is the main synthetic difference between the two languages. In Scala, the generation of an array ranging from 0 to 100 looks a bit different than in Python. Let's run the code. Since Spark uses lazy evaluation this doesn't take long. Now, we count the number of elements. Rdd.count gives back the value of 100 as answer. There are 100 elements in the array. Rdd.take with a parameter of 10 to just take the first 10 elements. Or rdd.collect to copy the whole contents of this rdd to the Spark driver. Until now, everything looks quite similar. The difference becomes more noticeable when you use external libraries like Num Py, which provide Python with powerful access to matrix and vector operations, which is lacking in ApacheSpark. This is beyond the scope of this topic. So let's have a look at Java. Java is definitely not a primary choice of data scientists because of the overhead of Java syntax. But when using ApacheSpark the same complete set of API is available for Java as it is for Scala. Java is also the de-factor standard in Enterprise IT. So if you are not in academic research of blocking for strata you most probably will have to use the Java on some point or another. Finally, Java is the programming language of Hadoop which cause the de-factor standard for big data processing before a patches back came into play. So let's see our simple example and how to implement it in Java. We are in an Eclipse environment. For our Java application, the first thing to do is creating a new class. Within the class, we actually create a Spark configuration, which can be used to create a Spark context. So now, let's create the spa context out of the spa configuration object, using a java context implementing the context interface. Now we are ready to create an RDT. Java is strongly typed so we have to declare the type of the RDT as well as the type of the contents of the RDT we are intending to create. Now, it's time to create an array containing integers from 0 to 99, but unfortunately there is no way to do this in Java inline. Therefore, we will create an empty list and loop in order to fill the list with integers from 0 to 99. The oddity is type integer, the oddity is now ready to be used, let's start with the count part. Now let's get the first 10 records. And conclude with a call to the collect method to copy all contents of the oddity vector out of travel venture machine. Note, that in Java return values of method calls are not automatically printed to stand out as contributor notebooks. Therefore, we have to add an additional command to achieve this. Let's do this for all three commands. Now, we are ready to run this class on an ApacheSpark cluster. R is THE Data Science programming language. But, there is only a subset of the ApacheSpark API available to R Despite the newest academic research. An academic research is basically the main contributor of more than 8,000 add-on packages. R has awesome plotting and charting libraries, which are simply outstanding. But R is one of the slowest programming languages I've ever seen. As long as you're using R only to execute computations on ApacheSpark this won't be a problem. But as soon as you mix and match local and parallel computations, you will notice the limitations of the language. Once again let's create a new notebook in the IBM data science experience tool. Provide a name for the notebook, select R as the programming language. Finally, click on create notebook. Let's again create an RDD from an array ranging from 0-99. Running the application, let's count the number of elements. Take the first 10 RDD elements using R. You have already seen the Python example in the IBM Data Science Workbench. Let me highlight a few more reasons as to why we have chosen Python for the course and why it is the preferred language for data scientists. Python is nearly as widely used in Data Science as R. But, from a developers point of view, Python is much more common, and, in case you neither know Python, or R, you will have an easier learning Python than R. The same holds for Skyler in Java, with the additional disadvantage that, among Data Scientists, those languages are less widely used. Again, not all APIs of ApacheSpark have bindings for Python, but for this course, this will not limit us. Python has a very nice plotting library called matplotlib, which we are going to use. But it can't compete with the plotting capabilities you find in R. Finally, Python is an interpreted language and that can get slow sometimes, especially when used in conjunction with ApacheSpark since a lot of inter-process memory copying is taking place. This decision matrix summarizes some of the key considerations and how they're address by each language in order to help you decide on the best programming language to use in your own projects. Scala and Java have complete API support with the ApacheSpark. So in case you want, for example, to use a craft processing engine for draft x, there is no way of using it in our Python. In contrast to R and Python, Scala and Java are more complex to learn. As you have seen Java is a very language and in my opinion you should only it if theres no other choice. Python and R on the other hand are very easy to use and interpret. This is specially cruel for python but R can also be learned in a very short timeframe Java and Scala usually perform better than R and Python. Scala is one of the fastest languages and is only surpassed by C and C++. When it comes to third party libraries that support data signs that tries on the java virtual machine is still limited although catching up. So, for example indy for J and indy for S are provided the functionality of the famous python, non pi library but for say we R and Python, in contrast, are very rich in libraries that support a data scientist. And in Python, the famous pandas, NumPy, and SciPy libraries are commonly used among data scientists. But unfortunately, all those libraries are not parallelized. So unless you are using the ApacheSpark API you are only running on a single machine. So in the next video, we'll get our hands dirty and actually learn how to compute on oddities using the oddity functional programming API in order to create distributed parallel data processing chops.