Usecase 1: titanic¶
usecase 1a: k-nearest neighbors (k-NN) model in python¶
- Goal: Prediction of survival for the Titanic's passengers
- Resources: https://www.kaggle.com/c/titanic/overview/
- Technique: machine learning
- Tools used : scikit-learn, python library minio
Implementation¶
The implementation of a processing on the dataset goes through several steps:
- Get the
titanic
usecase from the Github repository https://github.com/criann/datalab-normandie-demos-saagie.git - Upload the dataset in the group's S3 bucket
- Create a Service Account
- Define environment variables at the project level
- Create an archive containing the processing script and a
requirements.txt
file for the necessary python modules - Create a Python job (min. 3.6) and define the package (previously created archive) and the execution command line.
Get a copy of the Github repository¶
git clone https://github.com/criann/datalab-normandie-demos-saagie.git
cd datalab-normandie-demos-saagie/usecases/titanic
Upload the dataset in the group's S3 bucket¶
- Connect to the S3 datalake web console : https://s3-console.atelier.datalab-normandie.fr with your login and password
- Go to the Buckets section
- then click on the Browse button of your group bucket
group-xxxx
- Click on the New path button and fill in
titanic
. - once in the
titanic
path, click on Upload Files then Upload folder. - select the
data
folder of your copy of the Github repository and validate - the
data
folder has been imported in thetitanic
folder of your group bucket
Create a Service Account¶
- Create a Service Account for the project
- Go to Identity / Service Account
- Click Create Service Account button
- Validate the form with the default values
- Click on Download to download a file containing a reminder of the Access Key and Secret Key tokens generated
Define environment variables at the project level¶
- Connect to the data processing tool with your login and password: https://dln-p1.atelier.datalab-normandie.fr
- Go to the project and then to the Environment variables section
- Define variables to keep the values for the Access Key and Secret Key from the previous step
- Click New variable button
- Define the values of the fields
- Click Save button
Variable name | Description | is password | value |
---|---|---|---|
DATALAKE_ACCESS_KEY | Access key for S3 datalake access | value of Access Key from previous step | |
DATALAKE_SECRET_KEY | Secret key for access to the S3 datalake | value of Secret Key from the previous step |
Variables are defined at the global level and projects and their jobs automatically inherit them.
Variable name | Description | is password | value |
---|---|---|---|
DATALAKE_HOST | Access key for S3 datalake access | s3.atelier.datalab-normandie.fr | |
DATALAKE_SCHEME | HTTP Scheme of the complete URL | https | |
DATALAKE_URL | Complete API access URL (https://...) | https://s3.atelier.datalab-normandie.fr |
Create an archive containing the processing script¶
The archive that will be used by the job should contain 2 elements:
- a file
requirements.txt
which will be automatically used when the job is launched - a file
__main__.py
which is the script to launch the processing to be done
# from the usecases/titanic folder
cd with_datalake_s3
cp titanic_pandas.py __main__.py
zip archive.zip requirements.txt __main__.py
!!! warning "Be careful when creating the .zip file". When you create the archive, make sure that both files are in the root of the archive.
When you unzip the archive, you must have the files directly and not a sub-folder containing the files.
Create a Python job¶
- Connect to the data processing tool with your login and password: https://dln-p1.atelier.datalab-normandie.fr
- Go to the project and then the Jobs section
- Click on the New job button
- Define a job name (for example
titanic_s3
) - Select
Extraction / Python
. - Select the desired python version (default is
3.9
) - Select the archive created in the previous step as Package.
-
Fill in the following Command line:
1
python {file}
-
Validate
Launch the job¶
- go to the job
titanic_s3
created in the previous step from the job list - Click on the Run button to run the job in the state of configuration of the job (the version): this creates an instance of the job for the last selected version
- You can launch a refresh of the page via the symbol at the top right
- the status of the instance goes through the following list of states:
Requested
,Queued
,Running
,Failed
,Killing
,Killed
,Succeeded
andUnknown
. - The goal is to reach a
Succeeded
status
- the status of the instance goes through the following list of states:
- Go to the section Job instances : you will find the information (status, dates, job version, logs) about the job instances
- The logs can be downloaded with the Download button and can be consulted directly
Usecase 1b: selective extraction by S3 SELECT¶
- Objective: retrieve the number of surviving passengers of the Titanic who embarked at Cherbourg
- Technique: S3 SELECT
- Tools used : lib python boto3
Steps:
- Repeat the first steps of usecase 1
- Go to the
s3_select
folder - Create the job archive
- Create a python job
Go to the S3 SELECT
folder¶
# from the usercases/titanic folder on your computer
cd s3_select
Create the job archive¶
zip archive.zip requirements.txt __main__.py
Create a Python job for the usecase 1b¶
- Login to the data processing tool with your login and password: https://dln-p1.atelier.datalab-normandie.fr
- Go to the project and then the Jobs section
- Click on the New job button
- Define a job name (for example
titanic_s3_select
) - Select
Extraction / Python
. - Select the desired python version (default is
3.9
) - Select the archive created in the previous step as Package.
-
Fill in the following Command line:
1
python {file}
-
Validate
Launch the titanic_s3_select
job¶
- go to the job
titanic_s3
created in the previous step from the job list - Click on the Run button to run the job in the state of configuration of the job (the version): this creates an instance of the job for the last selected version
- You can launch a refresh of the page via the symbol at the top right
- the status of the instance goes through the following list of states:
Requested
,Queued
,Running
,Failed
,Killing
,Killed
,Succeeded
andUnknown
. - The goal is to reach a
Succeeded
status
- the status of the instance goes through the following list of states:
- Go to the section Job instances : you will find the information (status, dates, job version, logs) about the job instances
- The logs can be downloaded with the Download button and can be consulted directly
Last update: April 5, 2022 18:30:41