AWS Community Day Melbourne — Recap (Part 2)
Post Lunch Presentations
There were many interesting sessions that I wanted to go to post-lunch but they overlap each other so I chose the one that was most relevant to my interests. See Part 1.
Disclaimer: There were many presentations throughout the day and not enough coffee so I may have interpreted some information incorrectly but where possible I sourced extra information online.
Facial Recognition with AWS Rekognition & Lambda
The problem: Difficulty of diagnosing sudden unconsciousness
Aish Kodali, told us an interesting story about how she discovered an important issue in the medical diagnosis procedure through her friend’s multiple episodes of sudden unconsciousness. Her friend would randomly fall unconscious and to diagnose the condition the doctor requires specific tests to be done during the episode or immediately after.
Aiding the diagnoses with technologies
Aish realises the potential of using technology to assist the doctor with performing specific tests using deep learning technology and that is through the use of facial recognition. She needed it to be highly scalable, within budget, and easy to integrate.
AWS Rekognition was the solution to her problem. She showcased an app that she had built which allows you to create a user profile, add medical details, and attach an image which will capture the sentiments of the person’s face.
A few AWS services such as S3 was used to store the images where it gets accessed by Rekognition for facial recognition and analysis. Lambda was used to drive functionalities and populate indexes to DynamoDB which contains a collection of user information. Aish even ran a tutorial on how Rekognition works and how to use lambda with S3 triggers.
GPU Accelerated Analysis and Machine Learning
Evolution of Data Analysis and Machine Learning
Daniel Bradby showed us the evolution of exploratory data analysis and machine learning through the use of different data processing methodologies
The use of HDFS allowed us to deal with big data but through writing and reading from disc. Spark was the predecessor of this by allowing in-memory processing (15–100x improvements) through the use of CPU. From Spark, the use of GPU allowed for further improvements (5–10x).
Data format conversion — The problem
The issue with using GPU is that the internal format of the data has to be converted when moving between CPU and GPU which slows the process. The solution is the use of Apache Arrow which is a standardish columnar format allows for efficient analytic operations, zero-copy reads, and usage across systems including Spark, pandas, impala and HBase with no data alterations.
Apache Arrow — The Solution
RAPIDS is an open-sourced project with it’s own ecosystem for libraries that supports deep learning and end-to-end analytics using GPU. Leveraging off Apache Arrow, it gives you the ability to do analytics without changing the internal format of the data. This allowed for further processing improvements (50 – 100x) from Spark.
RAPIDS ecosystem libraries consist of:
- cuDG — Manipulation of data frames using GPU and Apache Arrow
- cuIO — Support data ingestion, GPU accelerated parsing and decompression
- BlazingSQL — SQL engine on top of GPU Data that performs ETL 20x faster than Spark at price parity
- cuML — Supports GPU accelerated ML (classification, regression, clustering, time series forecasting etc.)
There are various GPUs on AWS, most notably the P2 and P3 instances. According to Daniel, the P3 instance can make use of RAPID but not P2. I recall talking to a representative from NVIDIA at re:Invent and he had indicated that any recent NVIDIA GPUs can use RAPID but the NVIDIA devblogs specifically compares RAPIDS on V16 Tesla V100 GPU (essentially P3 instance) with AWS r4.2xLarge.
If you want to get started with RAPIDS, you can use NVIDIA docker and the docker image from RAPIDS — courtesy of Daniel and RAPIDS.
Live coding a self-driving car (without a car)
Unity, the game engine
Paris Buttfield-Addison and Mars Geldard, entertained the audiences through their presentation on how a virtual self driving car is built using Unity and Tensorflow.
Unity is a game engine which has been used by many game developers to create some of the most popular games you see today. The engine heavily uses physics and geometry to create motions. Paris and Mars demoed the use of Unity to create a simple bouncing ball simulation.
Brain, Cademy, and Agent
Unity Machine Learning agent (beta) is an open-source Unity plug-in that was discussed to assist with the training of your model through providing simulation environments. The agents can be trained using various deep learning methodologies such as Reinforcement Learning. Before going into details, let’s discuss the concept of Agent, Academy and Brain.
The Brain encapsulates decisions and receive rewards whilst the Academy orchestrates decision process. The Agent which is attached to Unity frame generates. Observations, assign rewards and is linked to the Brain. Consequently, the agent is controlled by the Brain.
Self-driving with Reinforcement Learning
In Reinforcement Learning within Unity, the Agents learn the optimal actions based on rewards. The action could be a number, observation could be vector, pictures, in-game camera views, and Rewards would be a high positive number for good actions and a low negative number for bad actions. This is the same concept as creating the cost function for DeepRacers.
The virtual self-driving car uses reinforcement learning to help it learn how to drive. Whenever the car drove as aspected, it gained a reward but if it collides with something then it loses the reward. The Machine Learning Agent is trained using a pre-defined EC2 AMI which is built on the Deep Learning AMI. The instruction was tested on a p2 instance, you can find more information here.
By letting the car drive itself and learning from mistakes, it gets better over time and the car that gained the most reward is likely the best self-driving car among the others.
Large numbers and non-trivial ETL pipelines
Non-trivial ETL Pipelines
Jess Flanagan presented an informative talk on the delivery of the data through Extract, Transform and Load (ETL). she mentioned the concept of a non-trivial pipeline which was to stick something into S3 and do something about it in batch. The problem is, without proper monitoring for for a large number pipeline, it can be hard to manage and debug failures. Hence, she based her talk around how her and the team has built a non-trivial ETL pipeline that works with large numbers and adheres to their architectural goals.
What is the architectural goal
- Create pipelines that are easy to debug, when they fail we need to see where the failure occurred
- Need a whole system view so we can see that our pipelines are running which succeed or fail
- Monitoring so we can see if our pipelines are degrading or producing fewer results over time
- A way of making sure if our pipelines fail that someone is tasked to fixing the problem
Her data pipeline is quite fascinating, the use of docker (batch jobs) and AWS Batch, and Lambda (stream or small jobs) were used to extract the data into S3 (name – Extract) from various sources.
Another Docker and AWS batch, or Lambda, was then used to transform and preprocess the data where the result could be CSV or JSON, and drop it into another S3 bucket (name-Structured). Before the load process, Glue Crawler runs to create the data catalog in which Athena can use for ad-hoc queries.
The load step consisted of converting the transformed data into Parquet format via Glue jobs and dropping it into another S3 bucket (name-Curated). This is likely to improve efficiency and reduce the cost of running Athena ad-hoc queries.
Jess also explained the need for different S3 buckets for E, T and L to make it easier to set different storage and backup options.
It was very interesting to see the use frequent use of Athena in the transform and load bucket, and I believe this aligns with what Jess has mentioned around data accessibility and that is, show your work, allow the Data Scientist to access the data before and after transformation.
The orchestration of the ETL jobs ware carefully evaluated against it being data specific and flexibility. The flexible orchestrator, Step Function, was used which was scheduled to run via Cloudwatch Event. I have attached an example above that Jess had kindly shown.
Due to the importance of monitoring, the logs from each pipeline was first sent to Cloudwatch first but it got messy and hard to track especially when those logs were expanded. Jess used JSON logging since the monitoring platform she is using is Datadog which easily read JSON logs and has custom checks to see if the pipeline has failed or have completed.
Lambda was used to format the metric which pushes to SNS through a subscription-based method. Datadog uses the formatted metric for monitoring and alerting which occurred via Slack or Jira ticket with reduced noise.
Unfortunately, the host had to interrupt her presentation as she was going overtime so we did not the rest of the slides but quite interesting indeed.
Datalake — A step towards being data driven
One of the major issues of any traditional data warehouse platform has always been the reliability, fit for purpose as the single source of truth, lack of documentation and data cataloging, and data accessibility.
Modern data platform inceptions
Leila Sagharichi and Amey Khedekar, gave a very interesting talk on how they leverage off of AWS to solve some of the typical data warehouse issues and some of the modern data platform inceptions and this includes:
- Want broader access to insights and analytics
- Be able to serve personalised products
- Improve performance to quickly produce insights
- Able to create A/B testing and ML experimentations
- A global platform, and improvements in governance and security
It was interesting to see all these inceptions (had a big smile on my face) as I have seen these inceptions with many clients and have helped enable them through interactive support in improving their data platform. However, Carsales have taken the appropriate approach of:
- Making the data core competency
- Providing educational use of data through providing ongoing encouragement
- Allowing data accessibility
- Establishing data governance by providing access to people based on their needs and not to all
- Establishing a CI/CD pipeline within the data platform and data life-cycle
Since January last year, they have gone on a huge data platform re-engineering journey. Leila and Amey guide us through some of their milestones and this was possible as they have introduced continues improvements within the Data and Analytics culture.
Carsales made good use of various AWS services and open-source technologies to “do the undifferentiated heavy lifting”.
In their initial version, Glue was used to perform ETL jobs with the additional use of Athena and Glue Catalogue as the serving layer. I forgot if they mentioned if data was copied from the Datalake and into Redshift but from the architectural diagram, I can see the use of Spectrum. I see the benefit of using spectrum here to save the cost of storage on Redshift. Presumably, if you have a large amount of infrequently accessed data such as historical data that went back many years, you could lifecycle this out of Redshift and into S3 then leverage off Spectrum for ad-hoc queries of those data.
The second version of their platform leverages off EMR as oppose to Glue jobs. I believe the intention of this was to not only validate data as Leila and Amey have announced but it is also likely to add more flexibility with executor memories, libraries, temporarily storage, and logging, etc at Carsales
Improvements to their 2nd version were the addition of real-time streaming jobs which leveraged off kinesis firehose where the data was driven from their streaming sources such as application and browser activities. Apache Airflow was also used to orchestrate their ETL jobs.
Ending the day with a speedy trophy
To wrap things up AWS trophies were given to the top 3 individuals who had the best DeepRacer time.
Happy Ending: AWS Community Day 2019
Overall, I quite enjoyed the AWS community day and I hope that those who went also enjoyed it as well. The presentations were a mixture of technical and non-technical talks so it suited different types of audiences.
I always try to keep up to date with how we are evolving with technology, the problems we face, and how we solve them. Consequently, it was fascinating to see all the different talks by our local AWS community members and some of the problems and solutions they had through using AWS.
Looking forward to the next AWS Community Day!
As a Consultant at Servian, I have been helping businesses build scalable cloud and data solutions in the area of Data Warehousing, AI/ML, Personalisation, CI/CD, and Micro-services. During my spare time, I try to beat my squat record.
I love data and cloud so feel free to reach out to me on LinkedIn for a casual chat or to find out more on how we Servianites can help you with your cloud and data solutions.
AWS Community Day Melbourne — Recap (Part 2) was originally published in WeAreServian on Medium, where people are continuing the conversation by highlighting and responding to this story.