BigSnarf blog

Infosec FTW

Running Apache Spark EMR and EC2 scripts on AWS with read write S3

Video Demo of Spark on EMR

Other posts I did learning EMR

https://bigsnarf.wordpress.com/2014/10/22/process-logs-with-kinesis-s3-apache-spark-on-emr-amazon-rds/

https://bigsnarf.wordpress.com/2015/01/05/apache-spark-1-0-0-emr-via-command-line/

Script to launch you own cluster on EC2

wget spark-latest-from-HTTP-link-download
tar zxvf spark-1.2.0-bin-hadoop1.tgz
cd spark-1.2.0-bin-hadoop1
cd ec2/
export AWS_ACCESS_KEY_ID=ASDFOA4L1234ASDF4A
export AWS_SECRET_ACCESS_KEY=ZEASDF9087ASDF987987ASDF987IlX
./spark-ec2 -i /Users/homeFolder/spark-east.pem -k spark-east -t m1.small –copy-aws-credentials launch test-cluster
./spark-ec2 -i /Users/homeFolder/spark-east.pem -k spark-east login test-cluster
./spark-ec2 -i /Users/homeFolder/spark-east.pem -k spark-east destroy test-cluster

view raw
gistfile1.txt
hosted with ❤ by GitHub

Spark Cluster Build Output for EC2

~/spark-1.2.0-bin-hadoop1/ec2 $./spark-ec2 -k spark-east -i /Users/home/spark-east.pem -t m1.small -z us-east-1a launch test-cluster
Setting up security groups…
Searching for existing cluster test-cluster…
Spark AMI: ami-5bb18832
Launching instances…
Launched 1 slaves in us-east-1e, regid = 1981cd
Launched master in us-east-1e, regid = 66fe1d
Waiting for all instances in cluster to enter 'ssh-ready' state……………
Generating cluster's SSH key on master…
Warning: Permanently added 'ec2-5-1-1-64.compute-1.amazonaws.com' (RSA) to the list of known hosts.
Connection to ec2-5-1-1-64.compute-1.amazonaws.com closed.
Transferring cluster's SSH key to slaves…
ec2-5.1.1.2.compute-1.amazonaws.com
Warning: Permanently added 'ec2-5.1.1.2.compute-1.amazonaws.com' (RSA) to the list of known hosts.
Cloning into 'spark-ec2'…
remote: Counting objects: 1698, done.
remote: Compressing objects: 100% (805/805), done.
remote: Total 1698 (delta 607), reused 1698 (delta 607)
Receiving objects: 100% (1698/1698), 273.67 KiB, done.
Resolving deltas: 100% (607/607), done.
Connection to ec2-5-1-1-64.compute-1.amazonaws.com closed.
Deploying files to master…
building file list … done
root/spark-ec2/ec2-variables.sh
sent 1594 bytes received 42 bytes 1090.67 bytes/sec
total size is 1453 speedup is 0.89
Running setup on master…
Connection to ec2-5-1-1-64.compute-1.amazonaws.com closed.
Setting up Spark on ip-1.1.1.2.ec2.internal…
Setting executable permissions on scripts…
Running setup-slave on master to mount filesystems, etc…
Setting up slave on ip-1.1.1.2.ec2.internal…
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 18.8494 s, 57.0 MB/s
mkswap: /mnt/swap: warning: don't erase bootbits sectors
on whole disk. Use -f to force.
Setting up swapspace version 1, size = 1048572 KiB
no label, UUID=4396-ba35-fcf6f1a66819
Added 1024 MB swap file /mnt/swap
SSH'ing to master machine(s) to approve key(s)…
ec2-5-1-1-64.compute-1.amazonaws.com
Warning: Permanently added 'ec2-5-1-1-64.compute-1.amazonaws.com,1.3.3.7' (ECDSA) to the list of known hosts.
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ip-1.1.1.2.ec2.internal' (ECDSA) to the list of known hosts.
SSH'ing to other cluster nodes to approve keys…
ec2-5.1.1.2.compute-1.amazonaws.com
Warning: Permanently added 'ec2-5.1.1.2.compute-1.amazonaws.com,10.2.3.4' (ECDSA) to the list of known hosts.
RSYNC'ing /root/spark-ec2 to other cluster nodes…
ec2-5.1.1.2.compute-1.amazonaws.com
id_rsa 100% 1679 1.6KB/s 00:00
Running slave setup script on other cluster nodes…
ec2-5.1.1.2.compute-1.amazonaws.com
Setting up slave on ip-1-3-4-2.ec2.internal…
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 18.6761 s, 57.5 MB/s
mkswap: /mnt/swap: warning: don't erase bootbits sectors
on whole disk. Use -f to force.
Setting up swapspace version 1, size = 1048572 KiB
no label, UUID=4959-a071-e5a02fe4bfee
Added 1024 MB swap file /mnt/swap
Connection to ec2-5.1.1.2.compute-1.amazonaws.com closed.
Initializing scala
~ ~/spark-ec2
Unpacking Scala
–2015-01-20 19:39:48– http://s3.amazonaws.com/spark-related-packages/scala-2.10.3.tgz
Resolving s3.amazonaws.com (s3.amazonaws.com)… 1.2.3.4
Connecting to s3.amazonaws.com (s3.amazonaws.com)|1.2.3.4|:80… connected.
HTTP request sent, awaiting response… 200 OK
Length: 30531249 (29M) [application/x-compressed]
Saving to: ‘scala-2.10.3.tgz’
100%[==================================================================================================================================>] 30,531,249 3.20MB/s in 9.8s
2015-01-20 19:39:58 (2.99 MB/s) – ‘scala-2.10.3.tgz’ saved [30531249/30531249]
~/spark-ec2
Initializing spark
~ ~/spark-ec2
–2015-01-20 19:39:59– http://s3.amazonaws.com/spark-related-packages/spark-1.2.0-bin-hadoop1.tgz
Resolving s3.amazonaws.com (s3.amazonaws.com)… 1.2.3.4
Connecting to s3.amazonaws.com (s3.amazonaws.com)|1.2.3.4|:80… connected.
HTTP request sent, awaiting response… 200 OK
Length: 203977086 (195M) [application/x-compressed]
Saving to: ‘spark-1.2.0-bin-hadoop1.tgz’
100%[==================================================================================================================================>] 203,977,086 38.2MB/s in 4.8s
2015-01-20 19:40:04 (40.4 MB/s) – ‘spark-1.2.0-bin-hadoop1.tgz’ saved [203977086/203977086]
Unpacking Spark
~/spark-ec2
Initializing shark
~ ~/spark-ec2
ERROR: Unknown Shark version
~/spark-ec2
Initializing ephemeral-hdfs
~ ~/spark-ec2
–2015-01-20 19:40:14– http://s3.amazonaws.com/spark-related-packages/hadoop-1.0.4.tar.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)… 1.2.3.4
Connecting to s3.amazonaws.com (s3.amazonaws.com)|1.2.3.4|:80… connected.
HTTP request sent, awaiting response… 200 OK
Length: 62793050 (60M) [application/x-gzip]
Saving to: ‘hadoop-1.0.4.tar.gz’
100%[==================================================================================================================================>] 62,793,050 41.4MB/s in 1.4s
2015-01-20 19:40:15 (41.4 MB/s) – ‘hadoop-1.0.4.tar.gz’ saved [62793050/62793050]
Unpacking Hadoop
RSYNC'ing /root/ephemeral-hdfs to slaves…
ec2-5.1.1.2.compute-1.amazonaws.com
~/spark-ec2
Initializing persistent-hdfs
~ ~/spark-ec2
–2015-01-20 19:40:49– http://s3.amazonaws.com/spark-related-packages/hadoop-1.0.4.tar.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)… 1.2.3.4
Connecting to s3.amazonaws.com (s3.amazonaws.com)|1.2.3.4|:80… connected.
HTTP request sent, awaiting response… 200 OK
Length: 62793050 (60M) [application/x-gzip]
Saving to: ‘hadoop-1.0.4.tar.gz’
100%[==================================================================================================================================>] 62,793,050 44.8MB/s in 1.3s
2015-01-20 19:40:51 (44.8 MB/s) – ‘hadoop-1.0.4.tar.gz’ saved [62793050/62793050]
Unpacking Hadoop
RSYNC'ing /root/persistent-hdfs to slaves…
ec2-5.1.1.2.compute-1.amazonaws.com
~/spark-ec2
Initializing spark-standalone
Initializing tachyon
~ ~/spark-ec2
–2015-01-20 19:41:24– https://s3.amazonaws.com/Tachyon/tachyon-0.4.1-bin.tar.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)… 1.2.3.4
Connecting to s3.amazonaws.com (s3.amazonaws.com)|1.2.3.4|:443… connected.
HTTP request sent, awaiting response… 200 OK
Length: 22751619 (22M) [application/x-gzip]
Saving to: ‘tachyon-0.4.1-bin.tar.gz’
100%[==================================================================================================================================>] 22,751,619 3.34MB/s in 7.1s
2015-01-20 19:41:32 (3.08 MB/s) – ‘tachyon-0.4.1-bin.tar.gz’ saved [22751619/22751619]
Unpacking Tachyon
~/spark-ec2
Initializing ganglia
Connection to ec2-5.1.1.2.compute-1.amazonaws.com closed.
Creating local config files…
Connection to ec2-5.1.1.2.compute-1.amazonaws.com closed.
Connection to ec2-5.1.1.2.compute-1.amazonaws.com closed.
Configuring /etc/ganglia/gmond.conf
Configuring /etc/ganglia/gmetad.conf
Configuring /etc/httpd/conf.d/ganglia.conf
Configuring /etc/httpd/conf/httpd.conf
Configuring /root/mapreduce/hadoop.version
Configuring /root/mapreduce/conf/core-site.xml
Configuring /root/mapreduce/conf/slaves
Configuring /root/mapreduce/conf/mapred-site.xml
Configuring /root/mapreduce/conf/hdfs-site.xml
Configuring /root/mapreduce/conf/hadoop-env.sh
Configuring /root/mapreduce/conf/masters
Configuring /root/persistent-hdfs/conf/core-site.xml
Configuring /root/persistent-hdfs/conf/slaves
Configuring /root/persistent-hdfs/conf/mapred-site.xml
Configuring /root/persistent-hdfs/conf/hdfs-site.xml
Configuring /root/persistent-hdfs/conf/hadoop-env.sh
Configuring /root/persistent-hdfs/conf/masters
Configuring /root/ephemeral-hdfs/conf/core-site.xml
Configuring /root/ephemeral-hdfs/conf/slaves
Configuring /root/ephemeral-hdfs/conf/mapred-site.xml
Configuring /root/ephemeral-hdfs/conf/hadoop-metrics2.properties
Configuring /root/ephemeral-hdfs/conf/hdfs-site.xml
Configuring /root/ephemeral-hdfs/conf/hadoop-env.sh
Configuring /root/ephemeral-hdfs/conf/masters
Configuring /root/spark/conf/core-site.xml
Configuring /root/spark/conf/spark-defaults.conf
Configuring /root/spark/conf/spark-env.sh
Configuring /root/tachyon/conf/slaves
Configuring /root/tachyon/conf/tachyon-env.sh
Configuring /root/shark/conf/shark-env.sh
Deploying Spark config files…
RSYNC'ing /root/spark/conf to slaves…
ec2-5.1.1.2.compute-1.amazonaws.com
Setting up scala
RSYNC'ing /root/scala to slaves…
ec2-5.1.1.2.compute-1.amazonaws.com
Setting up spark
RSYNC'ing /root/spark to slaves…
ec2-5.1.1.2.compute-1.amazonaws.com
Setting up shark
RSYNC'ing /root/shark to slaves…
ec2-5.1.1.2.compute-1.amazonaws.com
File or directory /root/hive* doesn't exist!
Setting up ephemeral-hdfs
~/spark-ec2/ephemeral-hdfs ~/spark-ec2
ec2-5.1.1.2.compute-1.amazonaws.com
Connection to ec2-5.1.1.2.compute-1.amazonaws.com closed.
RSYNC'ing /root/ephemeral-hdfs/conf to slaves…
ec2-5.1.1.2.compute-1.amazonaws.com
Formatting ephemeral HDFS namenode…
Warning: $HADOOP_HOME is deprecated.
15/01/20 19:42:35 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = ip-1.1.1.2.ec2.internal/1.3.3.7
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 1.0.4
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1393290; compiled by 'hortonfo' on Wed Oct 3 05:13:58 UTC 2012
************************************************************/
15/01/20 19:42:35 INFO util.GSet: VM type = 64-bit
15/01/20 19:42:35 INFO util.GSet: 2% max memory = 19.33375 MB
15/01/20 19:42:35 INFO util.GSet: capacity = 2^21 = 2097152 entries
15/01/20 19:42:35 INFO util.GSet: recommended=2097152, actual=2097152
15/01/20 19:42:36 INFO namenode.FSNamesystem: fsOwner=root
15/01/20 19:42:37 INFO namenode.FSNamesystem: supergroup=supergroup
15/01/20 19:42:37 INFO namenode.FSNamesystem: isPermissionEnabled=false
15/01/20 19:42:37 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
15/01/20 19:42:37 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
15/01/20 19:42:37 INFO namenode.NameNode: Caching file names occuring more than 10 times
15/01/20 19:42:37 INFO common.Storage: Image file of size 110 saved in 0 seconds.
15/01/20 19:42:37 INFO common.Storage: Storage directory /mnt/ephemeral-hdfs/dfs/name has been successfully formatted.
15/01/20 19:42:37 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ip-1.1.1.2.ec2.internal/1.3.3.7
************************************************************/
Starting ephemeral HDFS…
./ephemeral-hdfs/setup.sh: line 31: /root/ephemeral-hdfs/sbin/start-dfs.sh: No such file or directory
Warning: $HADOOP_HOME is deprecated.
starting namenode, logging to /mnt/ephemeral-hdfs/logs/hadoop-root-namenode-ip-1.1.1.2.ec2.internal.out
ec2-5.1.1.2.compute-1.amazonaws.com: starting datanode, logging to /mnt/ephemeral-hdfs/logs/hadoop-root-datanode-ip-1-3-4-2.ec2.internal.out
ec2-5.1.1.2.compute-1.amazonaws.com: Warning: $HADOOP_HOME is deprecated.
ec2-5.1.1.2.compute-1.amazonaws.com:
ec2-5-1-1-64.compute-1.amazonaws.com: Warning: $HADOOP_HOME is deprecated.
ec2-5-1-1-64.compute-1.amazonaws.com:
ec2-5-1-1-64.compute-1.amazonaws.com: starting secondarynamenode, logging to /mnt/ephemeral-hdfs/logs/hadoop-root-secondarynamenode-ip-1.1.1.2.ec2.internal.out
~/spark-ec2
Setting up persistent-hdfs
~/spark-ec2/persistent-hdfs ~/spark-ec2
Pseudo-terminal will not be allocated because stdin is not a terminal.
RSYNC'ing /root/persistent-hdfs/conf to slaves…
ec2-5.1.1.2.compute-1.amazonaws.com
Formatting persistent HDFS namenode…
Warning: $HADOOP_HOME is deprecated.
15/01/20 19:42:1 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = ip-1.1.1.2.ec2.internal/1.3.3.7
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 1.0.4
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1393290; compiled by 'hortonfo' on Wed Oct 3 05:13:58 UTC 2012
************************************************************/
15/01/20 19:42:55 INFO util.GSet: VM type = 64-bit
15/01/20 19:42:55 INFO util.GSet: 2% max memory = 19.33375 MB
15/01/20 19:42:55 INFO util.GSet: capacity = 2^21 = 2097152 entries
15/01/20 19:42:55 INFO util.GSet: recommended=2097152, actual=2097152
15/01/20 19:42:57 INFO namenode.FSNamesystem: fsOwner=root
15/01/20 19:42:58 INFO namenode.FSNamesystem: supergroup=supergroup
15/01/20 19:42:58 INFO namenode.FSNamesystem: isPermissionEnabled=false
15/01/20 19:42:58 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
15/01/20 19:42:58 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
15/01/20 19:42:58 INFO namenode.NameNode: Caching file names occuring more than 10 times
15/01/20 19:42:59 INFO common.Storage: Image file of size 110 saved in 0 seconds.
15/01/20 19:42:59 INFO common.Storage: Storage directory /vol/persistent-hdfs/dfs/name has been successfully formatted.
15/01/20 19:42:59 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ip-1.1.1.2.ec2.internal/1.3.3.7
************************************************************/
Persistent HDFS installed, won't start by default…
~/spark-ec2
Setting up spark-standalone
RSYNC'ing /root/spark/conf to slaves…
ec2-5.1.1.2.compute-1.amazonaws.com
RSYNC'ing /root/spark-ec2 to slaves…
ec2-5.1.1.2.compute-1.amazonaws.com
ec2-5.1.1.2.compute-1.amazonaws.com: no org.apache.spark.deploy.worker.Worker to stop
no org.apache.spark.deploy.master.Master to stop
starting org.apache.spark.deploy.master.Master, logging to /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-ip-1.1.1.2.ec2.internal.out
ec2-5.1.1.2.compute-1.amazonaws.com: starting org.apache.spark.deploy.worker.Worker, logging to /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-1-3-4-2.ec2.internal.out
Setting up tachyon
RSYNC'ing /root/tachyon to slaves…
ec2-5.1.1.2.compute-1.amazonaws.com
ec2-5.1.1.2.compute-1.amazonaws.com: Formatting Tachyon Worker @ ip-1-3-4-2.ec2.internal
ec2-5.1.1.2.compute-1.amazonaws.com: Removing local data under folder: /mnt/ramdisk/tachyonworker/
Formatting Tachyon Master @ ec2-5-1-1-64.compute-1.amazonaws.com
Formatting JOURNAL_FOLDER: /root/tachyon/libexec/../journal/
Formatting UNDERFS_DATA_FOLDER: hdfs://ec2-5-1-1-64.compute-1.amazonaws.com:9000/tachyon/data
Formatting UNDERFS_WORKERS_FOLDER: hdfs://ec2-5-5-1-4.compute-1.amazonaws.com:9000/tachyon/workers
TACHYON_LOGS_DIR: /root/tachyon/libexec/../logs
Killed 0 processes
Killed 0 processes
ec2-5-5-1-4.compute-1.amazonaws.com: Killed 0 processes
Starting master @ ec2-5-5-1-4.compute-1.amazonaws.com
ec2-5-5-1-4.compute-1.amazonaws.com: TACHYON_LOGS_DIR: /root/tachyon/libexec/../logs
ec2-5-5-1-4.compute-1.amazonaws.com: Formatting RamFS: /mnt/ramdisk (512mb)
ec2-5-5-1-4.compute-1.amazonaws.com: Starting worker @ ip-1-1-3-2.ec2.internal
Setting up ganglia
RSYNC'ing /etc/ganglia to slaves…
ec2-5-5-1-4.compute-1.amazonaws.com
Shutting down GANGLIA gmond: [FAILED]
Starting GANGLIA gmond: [ OK ]
Shutting down GANGLIA gmond: [FAILED]
Starting GANGLIA gmond: [ OK ]
Connection to ec2-5-5-1-4.compute-1.amazonaws.com closed.
Shutting down GANGLIA gmetad: [FAILED]
Starting GANGLIA gmetad: [ OK ]
Stopping httpd: [FAILED]
Starting httpd: [ OK ]
Connection to ec2-5-5-1-4.compute-1.amazonaws.com closed.
Spark standalone cluster started at http://ec2-5-5-1-4.compute-1.amazonaws.com:8080
Ganglia started at http://ec2-5-5-1-4.compute-1.amazonaws.com:5080/ganglia
Done!
~/Downloads/spark-1.2.0-bin-hadoop1/ec2 $

view raw
gistfile1.txt
hosted with ❤ by GitHub

Commands to experiment with Spark Shell and read write to S3

// in the spark shell here i load the file from S3
val myFile = sc.textFile("s3://some-s3-bucket/us-constitution.txt")
// Classic wordcount
val counts = myFile.flatMap(line => line.toLowerCase().replace(".", " ").replace(",", " ").split(" ")).map(word => (word, 1L)).reduceByKey(_ + _)
// create tuples for the words
val sorted_counts = counts.collect().sortBy(wc => -wc._2)
// print out a sample of 10 to see results
sorted_counts.take(10).foreach(println)
// save the files out to S3 bucket
sc.parallelize(sorted_counts).saveAsTextFile("s3n://some-s3-bucket/wordcount-us-constution")
// maybe you want to write it out as csv
val csvResults = sorted_counts map { case (key, value) => Array(key, value).mkString(",\t") }
// save csv out to S3
sc.parallelize(results).saveAsTextFile("s3n://some-s3-bucket/wordcount-csv-constitution")

view raw
gistfile1.txt
hosted with ❤ by GitHub

Output for Simple Word Count job on EMR

Screen Shot 2015-01-20 at 4.42.11 PM

Links to Apache Spark and Collection of Spark EMR Posts

2 responses to “Running Apache Spark EMR and EC2 scripts on AWS with read write S3

  1. David April 16, 2015 at 6:09 pm

    The video is about running Apache Spark on AWS EMR, but the text describes running Apache Spark as a stand-a-lone cluster (not on EMR). Also, the video seems to be from early 2014.

Leave a Reply to Security Dude Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: