How to implement my clustering algorithm in pyspark without using the ready library for example k-means

Question

Hello, I'm a beginner at PySpark. I have a question about PySpark. I have my clustering algorithm in python and want to implement it in PySpark (without using the ready library for example k-means). Please help me on how to implement it. Thanks

MD · Answer 1 · Oct 14, 2020

Hi@dani,

As you said you are a beginner in this area, then you should go through the existing modules. You are trying to implement the K-means algorithm. So first learn what is the mathematical concept behind the algorithm. If you are clear with the concept then try to analyze the code of the existing model. These steps will lead you to create your own K-means modules using python or any other language.

Hope this helps!

To know more about Pyspark, it's recommended that you join PySpark course online.

Thanks.

answered Oct 14, 2020 by MD
• 95,460 points

I know clustering algorithms like kmeans. And I can implement the algorithm by using its ready-made library in pyspark. But suppose I want to implement the kmeans algorithm without using it library. If do this then I can implement my own clustering algorithm(or give me kmeans source code exaample address in pyspark). thanks

commented Oct 14, 2020 by dani
• 160 points

Hi@dani,

If you know the concept, then you can start with the python. For example, K-means works on the shortest distance, and from that, it finds the centroid. So try to create a simple python code that finds the shortest distance from a list of points.

commented Oct 14, 2020 by MD
• 95,460 points

My problem is writing code in pyspark. If possible, please give me a link from source code kmeans example in pyspark(without use import 'pyspark.ml.clustering KMeans').

commented Oct 14, 2020 by dani
• 160 points

Hi,

I understood your requirement. You are trying to create your own customized module. That's why I told you to use python to create that. PySpark means Spark with python. You create one mathematical expression to find the shortest distance and write your code in python. After that import that script into your PySpark. For example, your module name can be like dani.pyspark.ml.