Unable to use ml library in pyspark

Question

>>> from pyspark.ml.feature import Tokenizer
Traceback (most recent call last):
File "", line 1, in 
File "/usr/lib/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/__init__.py", line 22, in 
from pyspark.ml.base import Estimator, Model, Transformer
File "/usr/lib/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/base.py", line 21, in 
from pyspark.ml.param import Params
File "/usr/lib/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/param/__init__.py", line 26, in 
import numpy as np
ImportError: No module named numpy

score 0 · Answer 1 · Jul 30, 2019

The error message you have shared with us we can see the error is related to numpy package we suggest you to follow the commands below in your terminal to first install pip and then numpy after this try to import Tokenizer

1. Add the EPEL Repository
Pip is not available in CentOS 7 core repositories. To install pip we need to enable the EPEL repository:

sudo yum install epel-release

2. Install pip
Once the EPEL repository is enabled we can install pip and all of its dependencies with the following command:

sudo yum install python-pip

3. Verify Pip installation
To verify that the pip is installed correctly run the following command which will print the pip version:

pip --version

After this use

pip install numpy

to install numpy package

Hope this helps!

To know more about Pyspark, it's recommended that you join PySpark Training today.

Thanks.

answered Jul 30, 2019 by Karan

Unable to use ml library in pyspark

Your comment on this question:

1 answer to this question.

Your answer

Your comment on this answer:

Related Questions In Apache Spark

How to add third party java jars for use in PySpark?

How to implement my clustering algorithm in pyspark (without using the ready library for example k-means)?

Which query to use for better performance, join in SQL or using Dataset API?

How to use ftp scheme using Yarn in Spark application?

How do I get number of columns in each line from a delimited file??

Hadoop Mapreduce word count Program

hadoop.mapred vs hadoop.mapreduce?

hadoop fs -put command?

How to change the spark Session configuration in Pyspark?

Not able to use sc in spark shell

Subscribe to our Newsletter, and get personalized recommendations.

TRENDING CERTIFICATION COURSES

TRENDING MASTERS COURSES

COMPANY

WORK WITH US

DOWNLOAD APP

CATEGORIES

CATEGORIES

TRENDING BLOG ARTICLES

TRENDING BLOG ARTICLES