What is GLUE?
- Fully managed ETL service that makes it simple and cost effective to categorize your data, clean it, enrich it and move it reliably between various data stores.
- It’s a serverless system.
- Automatically handle discovery and definition of table definitions and schema.
- Its main use is to serve as a central metadata repository for your data lake.
- Discover those schemas out of your unstructured data, sitting in S3 or whatever, and publish table definitions for use with analysis tools such as Athena or Redshift or EMR.
- The purpose of GLUE itself is to extract structure from your unstructured data.
- If you have data sitting in a data lake, it can provide a schema for that so that you
can query it using sequel or sequel like tools including Redshift and Athena and Amazon EMR and pretty much any other SQL database that can use a schema like that.
-Glue does custom ETL jobs, so once it discovers that schema, it can do some processing of it for you.
-trigger driven
-Schedule on demand
-fully managed
-ETL jobs uses SPARK under the hood. No need to manage spark cluster.
Glue Crawler & Data Catalog
- Scan data in S3 and often it will infer that schema automatically just based on the data but sometimes you have to give it some hints.
- Can run periodically.
- Populate the Glue Data Catalog.
- Stores only table definition(columns n datatypes).
- original data stores in S3.(No duplication)
- It uses this information to allow things like redshift,Athena or systems running an EMR like Hive to query your unstructured data in S3 as if it were structure data, just like it were a traditional relational database.
GLUE and S3 Partitions
- The glue crawler will extract partitions of your data based on how your S3 data is organized.
- Think up front about how you’re going to be querying your data lake in S3.
- Example: Devices send sensor data every hour.
- do you query primarily by time ranges?
- if so, organise your buckets as yyyy/mm/dd/device
- do you query primarily by device?
- if so, organise your buckets as device/yyyy/mm/dd
GLUE + HIVE
- Hive lets you run SQL-like queries from EMR.
- Glue Data catalogue can serve as a Hive “METASTORE”
- Conversely, you can also import a Hive Metastore into GLUE.
GLUE ETL
- Transform data, clean data, enrich data (before doing analysis).
- Automatically code generation for you transforming your data.
- Python or Scala
- Target can be S3,JDBC(RDS,Redshift) or in Glue data catalog.
- Fully Managed,cost effective,pay only for the resources consumed.
- Jobs run on serverless spark platform
- Glue scheduler to schedule the jobs.
- Glue triggers to automate job runs based on “events”
- Encryption
- Server side (at rest)
- SSL (in transit)
- Can be event-driven(so you can have ETL processes that kick off automatically as soon as new data is seen by Glue)
- Sometimes, you need to provision additional DPU(Data procession units) to increase performance of underlying Spark JOBS
- Errors reported to CLOUDWATCH.
- Could tie into SNS for notification.
GLUE ETL TRANSFORMATIONS
- Bundled Tranformations:
- DropFields, DropNullFields – remove (null) fields.
- Filter -specify a function to filter records.
- Join – to enrich data
- Map – add fields ,delete fields, perform external lookups
- Machine Learning Transformations:
- FindMatchesML : Identify duplicate or matching records in your dataset, even when the records do not have a common unique identifier and no fields match exactly.
- Format conversions – CSV,JSON,Avro,Parquet,ORC,XML
- Apache Spark transformations(Ex- K-Means)
GLUE Development ENDPOINTS
- ETL scripts using a notebook environment where you’re sort of interactively developing and iterating and testing your code within a Web browser that’s connected to one of these GLUE development endpoint instances.
- Then create an ETL job that runs your script(Using spark n GLUE)
- Apache Zeppelin
- Sagemaker notebook
- Terminal Window
- Pycharm professional edition
- Use Elastic Ips to access a private endpoint access.
RUNNING GLUE JOBS
- Time based schedule(Cron style)
- Job bookmark
- Persists state from the job run
- Prevents reprocessing of old data
- Allows you to process new data only when re-running on a schedule.
- Works with s3 sources in variety of formats.
- Works with relational databases via JDBC (if PKs are in sequential order)
- Only handles new rows, not updated rows.
- CLOUDWatch Events
- Fire off a lambda function or SNS notification when ETL succeeds or fails.
- Invoke EC2 run, send event to Kinesis,activate a Step function.
GLUE COST MODEL
- Billed by minute for crawler and ETL jobs.
- First million objects stored and accesses are free for the GLUE DATA CATALOG.
- Development endpoints for developing ETL code charged by the minute.