How to Copy/Move S3 Files With Apache Airflow

With Apache-Airflow's AWS providers S3 operations are a cake-walk.

  1. Create an AWS account, and ensure you have the right roles and policies set before proceeding with the following code

  2. Create a working instance of Apache Airflow in local or on your preferred cloud provider

  3. Create an Airflow connection with AWS_SECRET, AWS_ACCESS and role_arn

  4. The connection extras will look something like this. Replace <your-role-arn> with the AWS role that you created.\

    {"region_name": "us-west-2", "role_arn": "<your-role-arn>", "assume_role_method": "assume_role"}

  5. Add the following DAG to your dags folder

  6. Run the code

from datetime import datetime
from typing import List, Optional, Tuple
from airflow import DAG
from airflow.providers.amazon.aws.operators.s3 import S3FileTransformOperator
import os

# This fixed NEGSIG.SIGEV error
os.environ['no_proxy'] = '*'

DAG_ID = "s3_file_transform"


with DAG(
    dag_id=DAG_ID,
    schedule=None,
    start_date=datetime(2022, 11, 10),
    tags=["example"],
    catchup=False,
) as dag:

    move_files = S3FileTransformOperator(
        task_id="move_files",
        source_s3_key='s3://v-glue-example-bucket/example.txt',
        dest_s3_key="s3://v-glue-example-bucket/processed/{{ ds }}/example.txt",
        transform_script="/bin/cp"
    )
  1. You can change the transform_script form /bin/cp to /bin/mv to move files.

  2. Note that the dest_key has {{ds}} in it. This ensures a new blob is created every time the DAG runs.

  3. You can also pass a python script as a string to transform_script


Last updated