Install Hadoop with Spark and the Scala Programming Language

Introduction to Hadoop, Spark, and Scala

The open-source Apache Hadoop framework is used to process huge datasets across cluster nodes. Its distributed file system called HDFS is at the heart of Hadoop. Because Hadoop is Java-built, it seamlessly harmonizes with simplistic programming models and this results in providing a vast amount of scaling capabilities.

Apache Spark is the analytics engine that powers Hadoop. Large dataset processing requires a reliable way to handle and distribute heavy workloads fast and easy application building.

Object-oriented Scalable Language or Scala is a functional, statically typed programming language that runs on the Java Virtual Machine (JVM). Apache Spark, which is written on Scala, has high-performance large dataset processing proficiency.

This tutorial will explain the steps on how to setup the big three essentials for large dataset processing: Hadoop Spark Scala.

NOTE: Proceed with this tutorial if your operating system is UNIX-based. For example, if you use Linux or a macOS X or similar, these instructions will work for you. However, the steps given here for setting up Hadoop Spark Scala don’t apply to Windows OS systems.

Scala, Java, and the Spark framework

There are benefits to using Scala’s framework. With Scala, functionality is preferred over all paradigms in coding that are object-oriented. This direction enables it to outperform Java. The framework for Spark uses Scala and Hadoop is based on it.

Open JDK is another perk of Scala. Unlike Oracle JDK, Scala is free. Developers bypass the added expense of paying fees for subscriptions or licenses.

Prerequisites to using Spark

  • Install Visual Studio Code for your environment for developing.

  • Install the Scala Syntax extension from Visual Studio Code to install Scala.

  • Install Eclipse or another type of IDE.

NOTE: If you opt to select a different IDE other than Eclipse, make sure it supports both Scala and Java syntaxes.

Screenshot of the Scala Syntax extension for Visual Studio Code

Install Java before using Spark

  • Install and run Java.

  • Verify the installation with this command:

java -version

NOTE: Linux users, the package manager and repository for your distro is the best way to install Java, the default-jdk from Oracle.

Installing Java on macOS with Homebrew

  • Use Homebrew with this command brew cask install java if you’re installing Java on a macOS X.

Install the Hadoop cluster

Perform a primary node Hadoop cluster installation prior to installing Scala or Spark.

Install Hadoop on macOS X using Homebrew

  • MacOS users, install Hadoop with the Homebrew with the command brew.
brew install hadoop

Install Hadoop on Linux

  • Linux users, Apache has archived version of Hadoop for your type of OS. Use the wget command. Next, input the command tar -xvf and then the path of the archive you downloaded along with the filename. You’re performing a file extraction. Here’s the code:
cd ~/Downloads/
wget http://apache.claz.org/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
tar xzf hadoop-3.2.1.tar.gz
  • When you’re finished with the tar extraction process, change the name of the folder.

  • Use sudo privileges with the command mv to put the folder in the directory usr/local.

  • For macOS, use the sbt compiler open-source build tool.

mv hadoop-3.2.1 hadoop
sudo mv hadoop /usr/local/hadoop
Set the path for Hadoop in the ~/.bashrc file
  • Use a text editor such as gedit, nano, vim, or try Sublime IDE to complete a Hadoop path export in the script ~/.bashrc. The command for Sublime is subl ~/.bashrc. Otherwise, use one of the text editors mentioned to continue the installation processes for Hadoop Spark Scala.

  • Add the commands for export:

export HADOOP_HOME=/usr/local/hsbt compiler in macOSadoop
export PATH=$PATH:/usr/local/hadoop/bin/
export HADOOP_INSTALL=/usr/local/hadoop
export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop
  • Save the file with the edits, and then use the command source to perform a file reload.
source ~/.bashrc
  • From a terminal window, confirm the variable $PATH addition with the echo $PATH command.

  • Alternatively, input this command below to get the version number of Hadoop:

hadoop version
  • The path root of your Hadoop HDFS can be found with this command:
hadoop fs -ls

NOTE: Try to export the Hadoop configuration directory files HADOOP_CONF_DIR again if errors were returned when you checked the path.

Fixing the command not found error in Hadoop

  • Fix a hadoop: command not found error by verifying the path of the directory. Check system paths such as /usr/bin or the /usr/local/.

  • Next, confirm that the commands for export located in your profile for bash are pointed to the system path.

  • Finally, use the command source to do a bash profile reload.

Screenshot of Linux terminal using wget to download Hadoop and tar extract

Install the Scala programming language

  • If you’re using a macOS, use the sbt compiler to install Scala or read on to learn how to use Homebrew command.

  • The repository apt-get gives you the default Scala stable release for distros of Linux with this code:

sudo apt-get install scala

NOTE: Always check that you have the latest stable version of Scala. It’s listed on the site.

  • As promised, here’s the Scala installation via the Homebrew method for macOS users:
brew install scala
  • From a terminal window within Scala, use the command scala to get the version information:
util.Properties.versionString
  • To leave the Scala interface and get back to a command prompt, type :q.

Screenshot of Scala interface in a terminal window getting version

Troubleshooting compiling and execution in Scala

  • Troubleshoot Linux compiling errors by uninstalling scala and the scala-library defaults. It could be a compatibility issue, so try to reinstall another version of scala to continue moving forward with the installation steps for Hadoop Spark Scala.

Uninstall Scala from Debian Linux

  • Repair Debian Linux distros by reinstalling Scala. First, uninstall scala and the scala-library using Oracle JDK and sudo command:
sudo apt remove scala-library scala
sudo apt-get remove scala
sudo apt-get remove --auto-remove scala
sudo apt-get purge scala

Install a different version of Scala

  • Get a more compatible version of Scala with the wget command:
wget https://www.scala-lang.org/files/archive/scala-2.13.1.deb
  • Lastly, install the .deb Debian packages with the dpkg command:
sudo dpkg -i scala-2.13.1.deb

Install Spark

  • Use Homebrew for the Spark installation if you have a macOS:
brew install apache-spark
  • Linux systems must download Spark and build it. See the next section to learn how to do it.

Build Spark on Linux

  • Hadoop versions 2.7 or later, download the Spark 3.0 preview version.

  • All other versions of Hadoop, get the stable version.

  • Next, from a terminal window, use tar xvf to extract the files:

tar xvf spark-3.0.0-preview-bin-hadoop3.2.tgz
  • Rename the extracted folder, and then put it in the directory /opt in Linux using the command mv:
sudo mv spark-3.0.0-preview-bin-hadoop3.2 /opt/spark

NOTE: You must have elevated privileges for sudo to use the command stated above due to the directory /opt permissions restrictions to complete the steps for the Hadoop Spark Scala installations.

  • Modify the bash profile script once more with this command:
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
  • Save the file.

Use the Spark Shell

  • From a terminal window, access the Spark shell:
spark-shell
  • Within the interface, add the library for Apache Spark:
import org.apache.spark.{SparkConf, SparkContext}
SparkContext
  • Exit with the command :q.

Screenshot of Spark importing libraries scala spark-shell

Conclusion

The Hadoop Spark Scala represents a set of large dataset processing fundamentals that benefit developers. They realize that application development occurs fast when all three are utilized together. And it’s no secret that when datasets are huge, speed in processing matters. All things considered, now’s the time to install Hadoop with Spark and the Scala programming language. When you do, increased productivity is within your reach.

Pilot the ObjectRocket Platform Free!

Try Fully-Managed CockroachDB, Elasticsearch, MongoDB, PostgreSQL (Beta) or Redis.

Get Started

Keep in the know!

Subscribe to our emails and we’ll let you know what’s going on at ObjectRocket. We hate spam and make it easy to unsubscribe.