RDD word count with line numbers

0 votes
Hi,

Could you please send me the Pyspark snippet to find word count and list of line numbers where that word present.

Ex.

Text file contains following text

Hello world

Hello world

Hello

Output

Hello 3  [1,2,3]

World 2  [1,2]

Here,

hello is present in line numbers 1,2,3

World is present in line numbers 1,2
Jul 25, 2019 in Apache Spark by Rishi
1,028 views

1 answer to this question.

0 votes
df = spark.createDataFrame([("A", 2000), ("A", 2002), ("A", 2007), ("B", 1999), ("B", 2015)], ["Group", "Date"])

+-----+----+

|Group|Date|

+-----+----+

| A|2000|

| A|2002|

| A|2007|

| B|1999|

| B|2015|

+-----+----+


# accepted solution above



from pyspark.sql.window import *

from pyspark.sql.functions import row_number


df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date")))



# accepted solution above output



+-----+----+-------------+

|Group|Date|row_num|

+-----+----+-------------+

| B     |1999|       1   |

| B     |2015|        2  |

| A     |2000|        1  |

| A     |2002| 2         |

| A     |2007| 3         |

+-----+----+-------+

After this you can write a UDF to list it out. 

answered Jul 25, 2019 by Siri

Related Questions In Apache Spark

0 votes
2 answers

How to use RDD filter with other function?

val x = sc.parallelize(1 to 10, 2)   // ...READ MORE

answered Aug 17, 2018 in Apache Spark by zombie
• 3,790 points
6,491 views
0 votes
1 answer

How is RDD in Spark different from Distributed Storage Management? Can anyone help me with this ?

Some of the key differences between an RDD and ...READ MORE

answered Jul 26, 2018 in Apache Spark by zombie
• 3,790 points
603 views
0 votes
1 answer

How to remove the elements with a key present in any other RDD?

Hey, You can use the subtractByKey () function to ...READ MORE

answered Jul 22, 2019 in Apache Spark by Gitika
• 65,950 points
1,861 views
0 votes
2 answers

5)Using which one of the given choices will you create an RDD with specific partitioning?

Hi, @Ritu, option b for you, as Hash Partitioning ...READ MORE

answered Nov 23, 2020 in Apache Spark by Gitika
• 65,950 points
668 views
+1 vote
2 answers
+1 vote
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
7,893 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
1,327 views
+2 votes
11 answers

hadoop fs -put command?

Hi, You can create one directory in HDFS ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by nitinrawat895
• 11,380 points
62,496 views
+1 vote
1 answer

How can I write a text file in HDFS not from an RDD, in Spark program?

Yes, you can go ahead and write ...READ MORE

answered May 29, 2018 in Apache Spark by Shubham
• 13,480 points
5,604 views
+2 votes
14 answers

How to create new column with function in Spark Dataframe?

val coder: (Int => String) = v ...READ MORE

answered Apr 5, 2019 in Apache Spark by anonymous

edited Apr 5, 2019 by Omkar 73,960 views