Export MLFLOW experiments from restricted HPC systems

Computing (HPC) environments, especially in research and educational institutions, limit communication to outbound TCP connections. Run a simple command line ping or curly Checking packet transfers using MLFlow tracking URLs on HPC Bash Shell can be successful. However, when running a job on a node, communication fails and time.
This makes it impossible to track and manage experiments on MLFlow. I faced this problem and established a workaround that bypassed direct communication. We will focus on:
- Set up a local HPC MLFLOW server on a port with local directory storage.
- When running machine learning experiments, use the local tracking URL.
- Export the experimental data to a local temporary folder.
- Transfer the experimental data from the local temperature folder on the HPC to the remote MLFLOW server.
- Import the experimental data into the database of the remote MLFLOW server.
I have deployed Charmed MLFlow (MLFlow server, MySQL, Minio) using JUJU, and the whole content is hosted on Microk8s localhost. You can find Canonical’s installation guide here.
Prerequisites
Make sure you have Python Load on your HPC and install on your MLFlow server. For the whole article, I think you have Python 3.2. You can make changes accordingly.
On HPC:
1) Create a virtual environment
python3 -m venv mlflow
source mlflow/bin/activate
2) Install MLFLOW
pip install mlflow
On HPC and MLFlow servers:
1) Install MLFlow-Export-Import
pip install git+
On HPC:
1) Determine the port you want the local MLFLOW server to run. You can check if the port is free (it should not include any process ID):
lsof -i :
2) Set environment variables for applications that want to use MLFlow:
export MLFLOW_TRACKING_URI=
3) Use the following command to start the MLFlow server:
mlflow server
--backend-store-uri file:/path/to/local/storage/mlruns
--default-artifact-root file:/path/to/local/storage/mlruns
--host 0.0.0.0
--port 5000
Here we set the path to local storage in a folder called MLRUNS. Metadata like experiments, runs, parameters, metrics, labels and artifacts (such as model files, loss curves, and other images) will be stored in the MLRUNS directory. We can set the host to 0.0.0.0 or 127.0.0.1 (more secure). Since the whole process is short, I used 0.0.0.0. Finally, assign any other application’s unused port number.
(Optional) Sometimes, your HPC may not detect libpython3.12, This basically makes Python run. You can follow the steps below to find and add it to your path.
search libpython3.12:
find /hpc/packages -name "libpython3.12*.so*" 2>/dev/null
Returns something similar to: /path/to/python/3.12/lib/libpython3.12.so.1.0
Set the path to an environment variable:
export LD_LIBRARY_PATH=/path/to/python/3.12/lib:$LD_LIBRARY_PATH
4) We will export the experimental data from the MLRUNS local storage directory to a temporary folder:
python3 -m mlflow_export_import.experiment.export_experiment --experiment "" --output-dir /tmp/exported_runs
(Optional) Run export_exporment Features on HPC Bash Shell can cause thread exploitation errors, such as:
OpenBLAS blas_thread_init: pthread_create failed for thread X of 64: Resource temporarily unavailable
This happens because MLFlow is used internally Scipy For cultural relics and metadata processing, request to pass Open Brass, This exceeds the allowable limit for HPC settings. In this case, limit the number of threads by setting the following environment variables.
export OPENBLAS_NUM_THREADS=4
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
If the problem persists, try lowering the thread limit to 2.
5) Transmission experiments are run to the MLFLOW server:
Move everything from HPC to a temporary folder on the MLFlow server.
rsync -avz /tmp/exported_runs @:/tmp
6) Stop the local MLFlow server and clean the port:
lsof -i :
kill -9
On the MLFlow server:
Our goal is to transfer experimental data from the TMP folder to mysql and Minio.
1) Since Minio is Amazon S3 compatible, it uses Boto3 (AWS Python SDK) to communicate. So we will set up proxy AWS-like credentials and use them to communicate with Minio using Boto3.
juju config mlflow-minio access-key= secret-key=
2) The following are the commands for transmitting data.
Set up the MLFlow server and Minio address in our environment. To avoid repeating this, we can enter this in the .bashrc file.
export MLFLOW_TRACKING_URI="http://:port"
export MLFLOW_S3_ENDPOINT_URL="http://:port"
All experimental files can be found in the Exported_Runs folder in the TMP directory. this Import experience The function does our job.
python3 -m mlflow_export_import.experiment.import_experiment --experiment-name "experiment-name" --input-dir /tmp/exported_runs
in conclusion
The workaround helped me keep track of experiments even when communication and data transfer were restricted on my HPC cluster. Rotate the local MLFLOW server instances, export the experiments, and then import them into my remote MLFLOW server without changing the workflow.
However, if you are dealing with sensitive data, make sure the transfer method is safe. Creating CRON jobs and automation scripts may remove manual overhead. Also, be aware of your local storage space as it is easy to fill.
Finally, if you work in a similar environment, this article can provide you with a solution without providing any administrative privileges in a short time. Hopefully this helps the team that gets stuck in the same problem. Thank you for reading this article!
You can contact me on LinkedIn.