Many people don’t know that the idea of learning from data is not something from the current decade. It was pointed out over 50 years ago by John Tukey in “The Future of Data Analysis” (Donoho, D. (2017)). More and more companies have a Data Warehouse (Data Lake or both – Data Lakehouse) that can handle a massive amount of data, such as 900 TB, and has data imported constantly with SQL-like queries and operators accessing it over the day. What is the reason behind it? Identifying and visualizing patterns can help one gain competitive advantages for their businesses (Bose, R. (2009)). I have been covering some data analytics examples in many posts in this blog:
And despite that, I have been working with this for the last few years and have just completed a master’s program in Data Science in 2022; only now have I decided to take a certification in this topic, and therefore, I chose the AWS Certified Data Analytics – Specialty. Thus, to prepare for this exam, I covered all the products in the Data Analytics section and its different products as below:
- Amazon Athena
- Amazon CloudSearch
- Amazon EMR
- Amazon FinSpace
- Amazon Kinesis
- Amazon Kinesis Data Firehose
- Amazon Kinesis Data Analytics
- Amazon Kinesis Data Streams
- Amazon Kinesis Video Streams
- Amazon OpenSearch Service
- Amazon Redshift
- Amazon Redshift Serverless
- Amazon QuickSight
- AWS Data Exchange
- AWS Data Pipeline
- AWS Glue
- AWS Lake Formation
- Amazon Managed Streaming for Apache Kafka (Amazon MSK)
As a matter of fact, I can’t get enough of the Amazon Athena, it is one of my favorite products so far.
Additionally, I enjoy reading books, so I acquired the AWS Certified Data Analytics Study Guide: Specialty (DAS-C01) Exam 1st Edition by Asif Abbasi. On top of that, I did some study cases based on some Amazon posts that I considered as helpful in my preparation:
- Integrating MongoDB’s Application Data Platform with Amazon Kinesis Data Firehose: https://aws.amazon.com/blogs/big-data/integrating-the-mongodb-cloud-with-amazon-kinesis-data-firehose/
- Difference between data lake and data warehouse: https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/ and https://aws.amazon.com/data-warehouse/
- Apache HBase: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hbase.html
- Install Kibana and differences with Grafana (I use Grafana at work): https://aws.amazon.com/what-is/elk-stack/
- AWS Glue to handle data to build a dataset: https://aws.amazon.com/glue/
- Apache Spark SQL on Amazon EMR: https://aws.amazon.com/emr/features/spark/
- Amazon Redshift database encryption: https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-db-encryption.html
- Build, Train, and Deploy a Machine Learning Model with Amazon SageMaker: https://aws.amazon.com/getting-started/hands-on/build-train-deploy-machine-learning-model-sagemaker/
- Data Mart: https://aws.amazon.com/what-is/data-mart/
- Build a SQL-based ETL pipeline with Apache Spark on Amazon EKS: https://aws.amazon.com/getting-started/hands-on/build-train-deploy-machine-learning-model-sagemaker/
- Perform Adhoc queries using Amazon Athena: https://www.youtube.com/watch?v=Dmw7HOOmiJQ
- Write prepared data directly into JDBC-supported destinations using AWS Glue DataBrew: https://aws.amazon.com/blogs/big-data/write-prepared-data-directly-into-jdbc-supported-destinations-using-aws-glue-databrew/
- Apache Flink: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-flink.html
- Supported formats for Amazon S3 manifest files: https://docs.aws.amazon.com/quicksight/latest/user/supported-manifest-file-format.html
- Data ingestion methods: https://docs.aws.amazon.com/whitepapers/latest/building-data-lakes/data-ingestion-methods.html
- Using the Parquet format in AWS Glue: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-parquet-home.html
- Run a Spark SQL-based ETL pipeline with Amazon EMR on Amazon EKS: https://aws.amazon.com/blogs/big-data/run-a-spark-sql-based-etl-pipeline-with-amazon-emr-on-amazon-eks/
After waiting for my results to be published, I got confirmation from Amazon that I succeeded in the exam. Therefore, I hope this post helps you, and have a fantastic learning journey!
References:
- Donoho, David. “50 years of data science.” Journal of Computational and Graphical Statistics 26.4 (2017): 745-766
- Bose, R. (2009). Advanced analytics: opportunities and challenges. Industrial Management & Data Systems.
Hi! I am Bruno, a Brazilian born and bred, and I am also a naturalized Swedish citizen. I am a former Oracle ACE and, to keep up with academic research, I am a Computer Scientist with an MSc in Data Science and another MSc in Software Engineering. I have over ten years of experience working with companies such as IBM, Epico Tech, and Playtech across three different countries (Brazil, Hungary, and Sweden), and I have joined projects remotely in many others. I am super excited to share my interests in Databases, Cybersecurity, Cloud, Data Science, Data Engineering, Big Data, AI, Programming, Software Engineering, and data in general.
(Continue reading)