Skip to content

Usecase 1: titanic

usecase 1a: k-nearest neighbors (k-NN) model in python

Implementation

The implementation of a processing on the dataset goes through several steps:

  1. Get the titanic usecase from the Github repository https://github.com/criann/datalab-normandie-demos-saagie.git
  2. Upload the dataset in the group's S3 bucket
  3. Create a Service Account
  4. Define environment variables at the project level
  5. Create an archive containing the processing script and a requirements.txt file for the necessary python modules
  6. Create a Python job (min. 3.6) and define the package (previously created archive) and the execution command line.

Get a copy of the Github repository

git clone https://github.com/criann/datalab-normandie-demos-saagie.git
cd datalab-normandie-demos-saagie/usecases/titanic

Upload the dataset in the group's S3 bucket

  1. Connect to the S3 datalake web console : https://s3-console.atelier.datalab-normandie.fr with your login and password
  2. Go to the Buckets section
  3. then click on the Browse button of your group bucket group-xxxx
  4. Click on the New path button and fill in titanic.
  5. once in the titanic path, click on Upload Files then Upload folder.
  6. select the data folder of your copy of the Github repository and validate
  7. the data folder has been imported in the titanic folder of your group bucket

Create a Service Account

  1. Create a Service Account for the project
    • Go to Identity / Service Account
    • Click Create Service Account button
    • Validate the form with the default values
    • Click on Download to download a file containing a reminder of the Access Key and Secret Key tokens generated

Define environment variables at the project level

  1. Connect to the data processing tool with your login and password: https://dln-p1.atelier.datalab-normandie.fr
  2. Go to the project and then to the Environment variables section
  3. Define variables to keep the values for the Access Key and Secret Key from the previous step
    • Click New variable button
    • Define the values of the fields
    • Click Save button
Variable name Description is password value
DATALAKE_ACCESS_KEY Access key for S3 datalake access value of Access Key from previous step
DATALAKE_SECRET_KEY Secret key for access to the S3 datalake value of Secret Key from the previous step

Variables are defined at the global level and projects and their jobs automatically inherit them.

Variable name Description is password value
DATALAKE_HOST Access key for S3 datalake access s3.atelier.datalab-normandie.fr
DATALAKE_SCHEME HTTP Scheme of the complete URL https
DATALAKE_URL Complete API access URL (https://...) https://s3.atelier.datalab-normandie.fr

Create an archive containing the processing script

The archive that will be used by the job should contain 2 elements:

  • a file requirements.txt which will be automatically used when the job is launched
  • a file __main__.py which is the script to launch the processing to be done
# from the usecases/titanic folder
cd with_datalake_s3
cp titanic_pandas.py __main__.py
zip archive.zip requirements.txt __main__.py

!!! warning "Be careful when creating the .zip file". When you create the archive, make sure that both files are in the root of the archive.
When you unzip the archive, you must have the files directly and not a sub-folder containing the files.

Create a Python job

  1. Connect to the data processing tool with your login and password: https://dln-p1.atelier.datalab-normandie.fr
  2. Go to the project and then the Jobs section
  3. Click on the New job button
  4. Define a job name (for example titanic_s3)
  5. Select Extraction / Python.
  6. Select the desired python version (default is 3.9)
  7. Select the archive created in the previous step as Package.
  8. Fill in the following Command line:

    1
    python {file}
    
  9. Validate

Launch the job

  1. go to the job titanic_s3 created in the previous step from the job list
  2. Click on the Run button to run the job in the state of configuration of the job (the version): this creates an instance of the job for the last selected version
  3. You can launch a refresh of the page via the symbol at the top right
    • the status of the instance goes through the following list of states: Requested, Queued, Running, Failed, Killing, Killed, Succeeded and Unknown.
    • The goal is to reach a Succeeded status
  4. Go to the section Job instances : you will find the information (status, dates, job version, logs) about the job instances
    • The logs can be downloaded with the Download button and can be consulted directly

Usecase 1b: selective extraction by S3 SELECT

  • Objective: retrieve the number of surviving passengers of the Titanic who embarked at Cherbourg
  • Technique: S3 SELECT
  • Tools used : lib python boto3

Steps:

  • Repeat the first steps of usecase 1
  • Go to the s3_select folder
  • Create the job archive
  • Create a python job

Go to the S3 SELECT folder

# from the usercases/titanic folder on your computer
cd s3_select

Create the job archive

zip archive.zip requirements.txt __main__.py

Create a Python job for the usecase 1b

  1. Login to the data processing tool with your login and password: https://dln-p1.atelier.datalab-normandie.fr
  2. Go to the project and then the Jobs section
  3. Click on the New job button
  4. Define a job name (for example titanic_s3_select)
  5. Select Extraction / Python.
  6. Select the desired python version (default is 3.9)
  7. Select the archive created in the previous step as Package.
  8. Fill in the following Command line:

    1
    python {file}
    
  9. Validate

Launch the titanic_s3_select job

  1. go to the job titanic_s3 created in the previous step from the job list
  2. Click on the Run button to run the job in the state of configuration of the job (the version): this creates an instance of the job for the last selected version
  3. You can launch a refresh of the page via the symbol at the top right
    • the status of the instance goes through the following list of states: Requested, Queued, Running, Failed, Killing, Killed, Succeeded and Unknown.
    • The goal is to reach a Succeeded status
  4. Go to the section Job instances : you will find the information (status, dates, job version, logs) about the job instances
    • The logs can be downloaded with the Download button and can be consulted directly

Last update: April 5, 2022 18:30:41