Wednesday, April 03, 2013

Running pig job inside java wrapper on mapr hadoop


Here's how to get pig code running with a java wrapper on mapr hadoop.

[root@nmk-centos-60-1 ~]# cat idmapreduce.java
import java.io.IOException;
import org.apache.pig.PigServer;
public class idmapreduce{
   public static void main(String[] args) {
   try {
     PigServer pigServer = new PigServer("mapreduce");
     runIdQuery(pigServer, "/test/Mapr_rpm_Files");
   }
   catch(Exception e) {
   }
}
public static void runIdQuery(PigServer pigServer, String inputFile)
throws IOException {
   pigServer.registerQuery("A = load '" + inputFile + "' using
PigStorage('/');");
   pigServer.registerQuery("B = foreach A generate $0 as id;");
   pigServer.store("B", "/test/idout");
   }
}
[root@nmk-centos-60-1 ~]#


Then compile it

[root@nmk-centos-60-1 ~]# javac -cp
/opt/mapr/pig/pig-0.10/pig-0.10.0.jar idmapreduce.java

The binary is now in /root (my current working directory)

[root@nmk-centos-60-1 ~]# ls idmapreduce.*
idmapreduce.class  idmapreduce.java

Then run the java wrapper program with the pig jar location,

[root@nmk-centos-60-1 pig-0.10]# ls
autocomplete*  CHANGES.txt*  contrib/       ivy.xml*  lib-src/
mapr-build.properties*  pig-0.10.0.jar*                pigperf.jar*
        README.txt*         src/
bin/           conf/         doap_Pig.rdf*  KEYS*     license/
NOTICE.txt*             pig-0.10.0-withouthadoop.jar*
pig-withouthadoop.jar*  RELEASE_NOTES.txt*  test/
build.xml*     conf.new/     ivy/           lib/      LICENSE.txt*
pig-0.10.0-core.jar*    pig.jar*                       readme.md*
        shims/              tutorial/


and the location where the wrapper binary is located,

[root@nmk-centos-60-1 pig-0.10]# cd -
/root

and the whole classpath `hadoop classpath` expands to :

and also the location to the native io library as a -D option.

So the  final command would be :

[root@nmk-centos-60-1 ~]#  java -cp
/opt/mapr/pig/pig-0.10/pig.jar:.:`hadoop classpath`
-Djava.library.path=/opt/mapr/hadoop/hadoop-0.20.2/lib/native/Linux-amd64-64 idmapreduce

13/04/03 09:33:48 INFO executionengine.HExecutionEngine: Connecting to
hadoop file system at: maprfs:///
13/04/03 09:33:48 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/04/03 09:33:48 INFO security.JniBasedUnixGroupsMapping: Using
JniBasedUnixGroupsMapping for Group resolution
13/04/03 09:33:48 INFO executionengine.HExecutionEngine: Connecting to
map-reduce job tracker at: maprfs:///
13/04/03 09:33:49 INFO pigstats.ScriptState: Pig features used in the
script: UNKNOWN
13/04/03 09:33:49 INFO mapReduceLayer.MRCompiler: File concatenation
threshold: 100 optimistic? false
13/04/03 09:33:49 INFO mapReduceLayer.MultiQueryOptimizer: MR plan
size before optimization: 1
13/04/03 09:33:49 INFO mapReduceLayer.MultiQueryOptimizer: MR plan
size after optimization: 1
13/04/03 09:33:49 INFO pigstats.ScriptState: Pig script settings are
added to the job
13/04/03 09:33:49 INFO mapReduceLayer.JobControlCompiler:
mapred.job.reduce.markreset.buffer.percent is not set, set to default
0.3
13/04/03 09:33:49 INFO mapReduceLayer.JobControlCompiler: creating jar
file Job6414903787816249153.jar
13/04/03 09:33:56 INFO mapReduceLayer.JobControlCompiler: jar file
Job6414903787816249153.jar created
13/04/03 09:33:56 INFO mapReduceLayer.JobControlCompiler: Setting up
single store job
13/04/03 09:33:56 INFO mapReduceLayer.MapReduceLauncher: 1 map-reduce
job(s) waiting for submission.
13/04/03 09:33:56 INFO zookeeper.ZooKeeper: Client
environment:zookeeper.version=3.3.6-1366786, built on 07/29/2012 06:22
GMT
13/04/03 09:33:56 INFO zookeeper.ZooKeeper: Client
environment:host.name=nmk-centos-60-1
13/04/03 09:33:56 INFO zookeeper.ZooKeeper: Client
environment:java.version=1.6.0_25
13/04/03 09:33:56 INFO zookeeper.ZooKeeper: Client
environment:java.vendor=Sun Microsystems Inc.
13/04/03 09:33:56 INFO zookeeper.ZooKeeper: Client
environment:java.home=/usr/java/jdk1.6.0_25/jre
13/04/03 09:33:56 INFO zookeeper.ZooKeeper: Client
environment:java.class.path=/opt/mapr/pig/pig-0.10/pig.jar:.:/opt/mapr/hadoop/hadoop-0.20.2/bin/../conf:/usr/java/default/lib/tools.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/..:/opt/mapr/hadoop/hadoop-0.20.2/bin/../hadoop*core*.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/amazon-s3.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/asm-3.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/aspectjrt-1.6.5.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/aspectjtools-1.6.5.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/aws-java-sdk-1.3.26.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-cli-1.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-codec-1.5.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-configuration-1.8.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-daemon-1.0.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-el-1.0.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-httpclient-3.0.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-httpclient-3.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-lang-2.6.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-logging-1.0.4.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-logging-1.1.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-logging-api-1.0.4.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-math-2.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-net-1.4.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-net-3.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/core-3.1.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/emr-metrics-1.0.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/eval-0.5.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/gson-1.4.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/guava-13.0.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/hadoop-0.20.2-dev-capacity-scheduler.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/hadoop-0.20.2-dev-core.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/hadoop-0.20.2-dev-fairscheduler.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/hsqldb-1.8.0.10.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/httpclient-4.1.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/httpcore-4.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jackson-core-asl-1.5.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jackson-mapper-asl-1.5.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jasper-compiler-5.5.12.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jasper-runtime-5.5.12.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jersey-core-1.8.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jersey-json-1.8.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jersey-server-1.8.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jets3t-0.6.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jetty-6.1.14.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jetty-servlet-tester-6.1.14.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jetty-util-6.1.14.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/junit-4.5.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/kfs-0.2.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/log4j-1.2.15.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/logging-0.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/maprfs-0.20.2-2.1.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/maprfs-jni-0.20.2-2.1.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/mockito-all-1.8.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/mockito-all-1.8.5.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/mysql-connector-java-5.0.8-bin.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/oro-2.0.8.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/protobuf-java-2.4.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/servlet-api-2.5-6.1.14.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/slf4j-api-1.4.3.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/slf4j-log4j12-1.4.3.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/xmlenc-0.52.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/zookeeper-3.3.6.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jsp-2.1/jsp-2.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jsp-2.1/jsp-api-2.1.jar
13/04/03 09:33:56 INFO zookeeper.ZooKeeper: Client
environment:java.library.path=/opt/mapr/hadoop/hadoop-0.20.2/lib/native/Linux-amd64-64
13/04/03 09:33:56 INFO zookeeper.ZooKeeper: Client
environment:java.io.tmpdir=/tmp
13/04/03 09:33:56 INFO zookeeper.ZooKeeper: Client
environment:java.compiler=
13/04/03 09:33:56 INFO zookeeper.ZooKeeper: Client environment:os.name=Linux
13/04/03 09:33:56 INFO zookeeper.ZooKeeper: Client environment:os.arch=amd64
13/04/03 09:33:56 INFO zookeeper.ZooKeeper: Client
environment:os.version=2.6.32-71.el6.x86_64
13/04/03 09:33:56 INFO zookeeper.ZooKeeper: Client environment:user.name=root
13/04/03 09:33:56 INFO zookeeper.ZooKeeper: Client environment:user.home=/root
13/04/03 09:33:56 INFO zookeeper.ZooKeeper: Client environment:user.dir=/root
13/04/03 09:33:56 INFO zookeeper.ZooKeeper: Initiating client
connection, connectString=nmkc1:5181,nmkc2:5181,nmkc3:5181
sessionTimeout=30000 watcher=com.mapr.fs.JobTrackerWatcher@40bb2bc3
13/04/03 09:33:56 INFO zookeeper.ClientCnxn: Opening socket connection
to server nmkc3/10.10.80.93:5181
13/04/03 09:33:56 INFO zookeeper.ClientCnxn: Socket connection
established to nmkc3/10.10.80.93:5181, initiating session
13/04/03 09:33:56 INFO zookeeper.ClientCnxn: Session establishment
complete on server nmkc3/10.10.80.93:5181, sessionid =
0x23db71ab8a301d9, negotiated timeout = 30000
13/04/03 09:33:56 INFO fs.JobTrackerWatcher: Current running
JobTracker is: nmk-centos-60-1/10.10.80.91:9001
13/04/03 09:33:56 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the
same.
13/04/03 09:33:57 INFO mapReduceLayer.MapReduceLauncher: 0% complete
13/04/03 09:33:57 INFO input.FileInputFormat: Total input paths to process : 1
13/04/03 09:33:57 INFO util.MapRedUtil: Total input paths to process : 1
13/04/03 09:33:57 WARN snappy.LoadSnappy: Snappy native library not loaded
13/04/03 09:33:57 INFO util.MapRedUtil: Total input paths (combined)
to process : 1
13/04/03 09:33:58 INFO mapReduceLayer.MapReduceLauncher: HadoopJobId:
job_201304010834_0003
13/04/03 09:33:58 INFO mapReduceLayer.MapReduceLauncher: More
information at:
http://maprfs:50030/jobdetails.jsp?jobid=job_201304010834_0003
13/04/03 09:33:58 INFO fs.JobTrackerWatcher: Current running
JobTracker is: nmk-centos-60-1/10.10.80.91:9001
13/04/03 09:34:17 INFO mapReduceLayer.MapReduceLauncher: 50% complete
13/04/03 09:34:18 INFO mapReduceLayer.MapReduceLauncher: 100% complete
13/04/03 09:34:18 INFO pigstats.SimplePigStats: Script Statistics:

HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt      Features
1.0.3   0.10.0  root    2013-04-03 09:33:49     2013-04-03 09:34:18     UNKNOWN

Success!

Job Stats (time in seconds):
JobId   Maps    Reduces MaxMapTime      MinMapTIme      AvgMapTime
 MaxReduceTime   MinReduceTime   AvgReduceTime   Alias   Feature
Outputs
job_201304010834_0003   1       0       3       3       3       0
 0       0       A,B     MAP_ONLY        /test/idout,

Input(s):
Successfully read 4826 records (5229 bytes) from: "/test/Mapr_rpm_Files"

Output(s):
Successfully stored 4826 records in: "/test/idout"

Counters:
Total records written : 4826
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_201304010834_0003


13/04/03 09:34:18 INFO mapReduceLayer.MapReduceLauncher: Success!

[root@nmk-centos-60-1 ~]#

To see output,

[root@nmk-centos-60-1 ~]# hadoop fs -ls /test*
Found 2 items
-rwxr-xr-x   3 root root     342071 2013-04-03 08:54 /test/Mapr_rpm_Files
drwxr-xr-x   - root root          2 2013-04-03 09:34 /test/idout
[root@nmk-centos-60-1 ~]#

Thursday, March 07, 2013

Error sqoop seems to think you are running a JRE, Please install JDK


If you hit this error

13/03/07 12:31:51 ERROR orm.CompilationManager: It seems as though you are running sqoop with a JRE.
13/03/07 12:31:51 ERROR orm.CompilationManager: Sqoop requires a JDK that can compile Java code.
13/03/07 12:31:51 ERROR orm.CompilationManager: Please install a JDK and set $JAVA_HOME to use it.
13/03/07 12:31:51 ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: Could not start Java compiler.
at org.apache.sqoop.orm.CompilationManager.compile(CompilationManager.java:175)
at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:83)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:390)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:476)
at org.apache.sqoop.Sqoop.run(Sqoop.java:145)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229)
at org.apache.sqoop.Sqoop.main(Sqoop.java:238)

you need to install a package that has javac

like java-1.6.0-openjdk-devel


Thursday, January 10, 2013

unable to resolve class org.apache.solr.client.solrj.impl.CommonsHttpSolrServer



[root@nmk-centos-60-3 samples]# groovy bugadd.groovy 7603.xml



org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
/opt/bugs/sources/samples/bugadd.groovy: 4: unable to resolve class org.apache.solr.client.solrj.impl.CommonsHttpSolrServer
 @ line 4, column 1.
   import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer
   ^

1 error

[root@nmk-centos-60-3 samples]#


The problem was that CommonsHttpSolrServer was changed to HttpSolrServer and StreamingUpdateSolrServer is now ConcurrentUpdateSolrServer as of solr 4.0.


So replace the lines 
import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer
with
import org.apache.solr.client.solrj.impl.HttpSolrServer

and

SolrServer server = new CommonsHttpSolrServer("http://10.10.80.93:8080/solr/");    
with
SolrServer server = new HttpSolrServer("http://10.10.80.93:8080/solr/");    

and you should be ready to go.

Saturday, November 03, 2012

Pig HBase integration on MapR

This procedure helps integrate HBase transactions using Pig on MapR Hadoop clusters.

On the client node where you have installed Pig, in /opt/mapr/conf/env.sh

Add
export PIG_CLASSPATH=$PIG_CLASSPATH:/location-to-hbase-jar

If you are launching PIG on a node where you have hbase-regionserver
or hbase-master installed, then simply add the location of the
hbase-0.92-1.jar to the PIG_CLASSPATH variable above. Eg

export
PIG_CLASSPATH="$PIG_CLASSPATH:/opt/mapr/hbase/hbase-0.92.1/hbase-0.92.1.jar"

If you don't have hbase installed, the HBase jar can be copied over
directly from any node where hbase is installed to some location on
the pig client node. Include the location where you copied it to in
the above definition.Eg
export PIG_CLASSPATH=$PIG_CLASSPATH:/opt/mapr/lib/hbase-0.92.1.jar

Then identify your zookeeper nodes,
maprcli node listzookeepers

and accordingly add this variable to /opt/mapr/conf/env.sh

export PIG_OPTS="-Dhbase.zookeeper.property.clientPort=5181
-Dhbase.zookeeper.quorum=10.10.80.61,10.10.80.62,10.10.80.63"

Launch the pig job and you should be able to access HBase.
NB: Use only the HBase table name directly for accessing tables. Do not use
hbase:// prefixes. Eg script

Sample env.sh

[root@nmk-centos-60-3 ~]# cat /opt/mapr/conf/env.sh
#!/bin/bash
# Copyright (c) 2009 & onwards. MapR Tech, Inc., All rights reserved
# Please set all environment variable you want to be used during MapR cluster
# runtime here.
# namely MAPR_HOME, JAVA_HOME, MAPR_SUBNETS

export PIG_OPTS="-Dhbase.zookeeper.property.clientPort=5181
-Dhbase.zookeeper.quorum=10.10.80.61,10.10.80.62,10.10.80.63"
export
PIG_CLASSPATH="$PIG_CLASSPATH:/opt/mapr/hbase/hbase-0.92.1/conf:/usr/java/default/lib/tools.jar:/opt/mapr/hbase/hbase-0.92.1:/opt/mapr/hbase/hbase-0.92.1/hbase-0.92.1.jar"
export HADOOP_CLASSPATH="$HADOOP_CLASSPATH:$PIG_CLASSPATH"
export CLASSPATH="$CLASSPATH:$HADOOP_CLASSPATH"
#export JAVA_HOME=
#export MAPR_SUBNETS=
#export MAPR_HOME=
#export MAPR_ULIMIT_U=
#export MAPR_ULIMIT_N=
#export MAPR_SYSCTL_SOMAXCONN=
#export PIG_CLASSPATH=:$PIG_CLASSPATH
[root@nmk-centos-60-3 ~]#

Sample hbase insertion script

[root@nmk-centos-60-3 nabeel]# cat hbase_pig.pig
raw_data = LOAD '/user/mapr/input2.csv' USING PigStorage(',') AS (
listing_id: chararray,
fname: chararray,
lname: chararray );

STORE raw_data INTO 'sample_names' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage (
'info:fname info:lname');


Thursday, October 04, 2012

Awk display / print file from column N onwards


This can be piped to any output to display the lines from the second column onwards.
awk -F: -v nr=N '{ for (x=nr; x<=NF; x++) {printf $x " "};print " ";}'
An example usage, find the first disk re-inited by MapRFS this month.
find . -name mfs.log -exec grep -H ^2012\-10\-* {} \; | grep spinit.cc:1002 | awk -F: -v nr=2 '{ for (x=nr; x<=NF; x++) \
{printf $x " "};print " ";}' | awk -v nr=2 '{ for (x=nr; x<=NF; x++) {printf $x " "};print " ";}' | sort -n | head -n 5

UPDATE:

Just realized cut could do this in a much simpler wa
cut -d : -f N-

where N is the column from where you need to display the lines.

Thursday, September 13, 2012

Finding the culprit in df disk utilization different from du

Sometimes df and du show different outputs due to the open file descriptor issue. That's explained in detail all over google, so I'm not touching that here. Here's a quick command to find the culprit that makes the difference. This shows the top 5 files that consume around 1GB of space. You can adjust the number after the digit regex to change that. Keep in mind, occasionally temp files have names that can run into digits, so that will return in the grep results, but the sort should be accurately on the filesize columns.


lsof -n -P | grep -E [[:digit:]]{10} | sort -k 7 | head -n 5

Monday, July 02, 2012

"java.io.IOException: Could not create FileClient"

"java.io.IOException: Could not create FileClient"

If you see this error in the jobtracker logs for a failed job, you are most likely looking at an irrelevant error. To be clearer, your job failed before this point and was subsequently killed by the jobtracker. Some processes that linger in memory after the job was killed keeps attempting to access some of the resources assigned to them and this continues till they time out and eventually get killed. The resources, for eg. the file or parent directory etc. would have been removed as part of killing the job by the job tracker, hence the stray process would not be able to create any new files.