Install Hadoop with Spark and the Scala Programming Language
Introduction to Hadoop, Spark, and Scala
The open-source Apache Hadoop framework is used to process huge datasets across cluster nodes. Its distributed file system called HDFS is at the heart of Hadoop. Because Hadoop is Java-built, it seamlessly harmonizes with simplistic programming models and this results in providing a vast amount of scaling capabilities.
Apache Spark is the analytics engine that powers Hadoop. Large dataset processing requires a reliable way to handle and distribute heavy workloads fast and easy application building.
Object-oriented Scalable Language or Scala is a functional, statically typed programming language that runs on the Java Virtual Machine (JVM). Apache Spark, which is written on Scala, has high-performance large dataset processing proficiency.
This tutorial will explain the steps on how to setup the big three essentials for large dataset processing: Hadoop Spark Scala.
NOTE: Proceed with this tutorial if your operating system is UNIX-based. For example, if you use Linux or a macOS X or similar, these instructions will work for you. However, the steps given here for setting up Hadoop Spark Scala don’t apply to Windows OS systems.
Scala, Java, and the Spark framework
There are benefits to using Scala’s framework. With Scala, functionality is preferred over all paradigms in coding that are object-oriented. This direction enables it to outperform Java. The framework for Spark uses Scala and Hadoop is based on it.
Open JDK is another perk of Scala. Unlike Oracle JDK, Scala is free. Developers bypass the added expense of paying fees for subscriptions or licenses.
Prerequisites to using Spark
Install Visual Studio Code for your environment for developing.
Install the
Scala Syntax
extension from Visual Studio Code to install Scala.Install Eclipse or another type of IDE.
NOTE: If you opt to select a different IDE other than Eclipse, make sure it supports both Scala and Java syntaxes.
Install Java before using Spark
Install and run Java.
Verify the installation with this command:
1 | java -version |
NOTE: Linux users, the package manager and repository for your distro is the best way to install Java, the
default-jdk
from Oracle.
Installing Java on macOS with Homebrew
- Use Homebrew with this command
brew cask install java
if you’re installing Java on a macOS X.
Install the Hadoop cluster
Perform a primary node Hadoop cluster installation prior to installing Scala or Spark.
Install Hadoop on macOS X using Homebrew
- MacOS users, install Hadoop with the Homebrew with the command
brew
.
1 | brew install hadoop |
Install Hadoop on Linux
- Linux users, Apache has archived version of Hadoop for your type of OS. Use the
wget
command. Next, input the commandtar -xvf
and then the path of the archive you downloaded along with the filename. You’re performing a file extraction. Here’s the code:
1 2 3 | cd ~/Downloads/ wget http://apache.claz.org/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz tar xzf hadoop-3.2.1.tar.gz |
When you’re finished with the
tar
extraction process, change the name of the folder.Use
sudo
privileges with the commandmv
to put the folder in the directoryusr/local
.For macOS, use the sbt compiler open-source build tool.
1 2 | mv hadoop-3.2.1 hadoop sudo mv hadoop /usr/local/hadoop |
Set the path for Hadoop in the ~/.bashrc file
Use a text editor such as gedit, nano, vim, or try Sublime IDE to complete a Hadoop path export in the script
~/.bashrc
. The command for Sublime issubl ~/.bashrc
. Otherwise, use one of the text editors mentioned to continue the installation processes for Hadoop Spark Scala.Add the commands for
export
:
1 2 3 4 | export HADOOP_HOME=/usr/local/hsbt compiler in macOSadoop export PATH=$PATH:/usr/local/hadoop/bin/ export HADOOP_INSTALL=/usr/local/hadoop export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop |
- Save the file with the edits, and then use the command
source
to perform a file reload.
1 | source ~/.bashrc |
From a terminal window, confirm the variable
$PATH
addition with theecho $PATH
command.Alternatively, input this command below to get the version number of Hadoop:
1 | hadoop version |
- The path root of your Hadoop
HDFS
can be found with this command:
1 | hadoop fs -ls |
NOTE: Try to export the Hadoop configuration directory files
HADOOP_CONF_DIR
again if errors were returned when you checked the path.
Fixing the command not found error in Hadoop
Fix a
hadoop: command not found
error by verifying the path of the directory. Check system paths such as/usr/bin
or the/usr/local/
.Next, confirm that the commands for
export
located in your profile for bash are pointed to the system path.Finally, use the command
source
to do a bash profile reload.
Install the Scala programming language
If you’re using a macOS, use the sbt compiler to install Scala or read on to learn how to use Homebrew command.
The repository
apt-get
gives you the default Scala stable release for distros of Linux with this code:
1 | sudo apt-get install scala |
NOTE: Always check that you have the latest stable version of Scala. It’s listed on the site.
- As promised, here’s the Scala installation via the Homebrew method for macOS users:
1 | brew install scala |
- From a terminal window within Scala, use the command
scala
to get the version information:
1 | util.Properties.versionString |
- To leave the Scala interface and get back to a command prompt, type
:q
.
Troubleshooting compiling and execution in Scala
- Troubleshoot Linux compiling errors by uninstalling
scala
and thescala-library
defaults. It could be a compatibility issue, so try to reinstall another version ofscala
to continue moving forward with the installation steps for Hadoop Spark Scala.
Uninstall Scala from Debian Linux
- Repair Debian Linux distros by reinstalling Scala. First, uninstall
scala
and thescala-library
using Oracle JDK and sudo command:
1 2 3 4 | sudo apt remove scala-library scala sudo apt-get remove scala sudo apt-get remove --auto-remove scala sudo apt-get purge scala |
Install a different version of Scala
- Get a more compatible version of Scala with the
wget
command:
1 | wget https://www.scala-lang.org/files/archive/scala-2.13.1.deb |
- Lastly, install the
.deb
Debian packages with thedpkg
command:
1 | sudo dpkg -i scala-2.13.1.deb |
Install Spark
- Use Homebrew for the Spark installation if you have a macOS:
1 | brew install apache-spark |
- Linux systems must download Spark and build it. See the next section to learn how to do it.
Build Spark on Linux
Hadoop versions 2.7 or later, download the Spark 3.0 preview version.
All other versions of Hadoop, get the stable version.
Next, from a terminal window, use
tar xvf
to extract the files:
1 | tar xvf spark-3.0.0-preview-bin-hadoop3.2.tgz |
- Rename the extracted folder, and then put it in the directory
/opt
in Linux using the commandmv
:
1 | sudo mv spark-3.0.0-preview-bin-hadoop3.2 /opt/spark |
NOTE: You must have elevated privileges for
sudo
to use the command stated above due to the directory/opt
permissions restrictions to complete the steps for the Hadoop Spark Scala installations.
- Modify the bash profile script once more with this command:
1 2 | export SPARK_HOME=/opt/spark export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin |
- Save the file.
Use the Spark Shell
- From a terminal window, access the Spark shell:
1 | spark-shell |
- Within the interface, add the library for Apache Spark:
1 2 |
- Exit with the command
:q
.
Conclusion
The Hadoop Spark Scala represents a set of large dataset processing fundamentals that benefit developers. They realize that application development occurs fast when all three are utilized together. And it’s no secret that when datasets are huge, speed in processing matters. All things considered, now’s the time to install Hadoop with Spark and the Scala programming language. When you do, increased productivity is within your reach.
Pilot the ObjectRocket Platform Free!
Try Fully-Managed CockroachDB, Elasticsearch, MongoDB, PostgreSQL (Beta) or Redis.
Get Started