How can I calculate exact median with Apache Spark

0 votes

This page contains some statistics functions (mean, stdev, variance, etc.) but it does not contain the median. How can I calculate exact median?

Oct 8, 2018 in Big Data Hadoop by slayer
• 29,300 points
2,377 views

1 answer to this question.

0 votes

You need to sort RDD and take element in the middle or average of two elements. Here is example with RDD[Int]:

  import org.apache.spark.SparkContext._

  val rdd: RDD[Int] = ???

  val sorted = rdd.sortBy(identity).zipWithIndex().map {
    case (v, idx) => (idx, v)
  }

  val count = sorted.count()

  val median: Double = if (count % 2 == 0) {
    val l = count / 2 - 1
    val r = l + 1
    (sorted.lookup(l).head + sorted.lookup(r).head).toDouble / 2
  } else sorted.lookup(count / 2).head.toDouble
answered Oct 8, 2018 by Omkar
• 69,130 points

Related Questions In Big Data Hadoop

0 votes
0 answers
0 votes
1 answer

How can I import data from mysql to hive tables with incremental data?

Hi@dharmendra, It is common to perform one-time ingestion ...READ MORE

answered Nov 23, 2020 in Big Data Hadoop by MD
• 94,990 points
77 views
0 votes
1 answer

How can I download only hdfs and not hadoop?

No, you cannot download HDFS alone because ...READ MORE

answered Mar 15, 2018 in Big Data Hadoop by nitinrawat895
• 11,380 points
273 views
0 votes
1 answer

How can I download hadoop documentation for a specific version?

You can go through this SVN link:- ...READ MORE

answered Mar 21, 2018 in Big Data Hadoop by Shubham
• 13,460 points
217 views
0 votes
1 answer

How can I get the respective Bitcoin value for an input in USD when using c#

Simply make call to server and parse ...READ MORE

answered Mar 25, 2018 in Big Data Hadoop by charlie_brown
• 7,780 points
238 views
0 votes
1 answer

How do I connect my Spark based HDInsight cluster to my blob storage?

Go through this blog: https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-blob-storage#access-blobs I went through this ...READ MORE

answered Apr 15, 2018 in Big Data Hadoop by Shubham
• 13,460 points
1,292 views
0 votes
1 answer

How can I put file to HDFS directly without copying it local disk?

Can use pipe from wget to hdfs. You ...READ MORE

answered Apr 15, 2018 in Big Data Hadoop by kurt_cobain
• 9,390 points
2,566 views
0 votes
2 answers

How can I list NameNode & DataNodes from any machine in the Hadoop cluster?

You can browse hadoop page from any ...READ MORE

answered Jan 23, 2020 in Big Data Hadoop by MD
• 94,990 points
7,250 views
0 votes
1 answer

Where can I find logs in Spark on YARN?

You can access logs through the command yarn ...READ MORE

answered Nov 8, 2018 in Big Data Hadoop by Omkar
• 69,130 points
203 views
0 votes
1 answer

In Hadoop MapReduce, how can i set an Object as the Value for Map output?

Try this and see if it works: public ...READ MORE

answered Nov 20, 2018 in Big Data Hadoop by Omkar
• 69,130 points
185 views