Michael E. Byczek, Technical Consultant
Michael E. Byczek

Amazon Cloud-based Analytics

Amazon AWS cloud-based analytics include EMR (Hadoop), Data Pipeline, Elasticsearch, Kinesis (streaming data), machine learning, Quicksight (business intelligence), and Redshift (data warehouse).

EMR (Elastic MapReduce)
  • Process vast amount of data
  • Managed Hadoop framework to distribute data across Amazon EC2 instances
  • Big data use cases: log analysis, scientific simulation, bioinformatics
  • Process thousands of compute instances at any scale
  • Deploy, operate, scale analysis for logs, real-time application monitoring, and click stream analytics
  • Streaming analytics, such as alerts based on pattern of IP addresses
  • eCommerce filtering and navigation
  • Collect social medial feeds for customer sentiment
Kinesis Firehose
  • Load streaming data into AWS
  • Near real-time analysis
  • Batch, compress, and encrypt
  • Combine hundreds of thousands of data sources
  • Gigabytes per second of streaming data
Kinesis Streams
  • Build custom applications that process/analyze streaming data
  • Continuously capture/store terabytes of data per hour, such as IT logs and website clickstreams
  • Multiple apps can process same stream concurrently
Kinesis Analytics
  • Run SQL queries against streaming data
  • Fully managed petabyte-scale data warehouse
  • Improved I/O efficiency and paralleling queries across multiple nodes
  • JDBC/ODBC drivers and PostgreSQL
  • Auto backup to S3
  • Data compression and parallel/distributed SQL operations
  • Use hard disk drives, solid-state drives, fast CPUs, and large amounts of RAM
  • Audit all SQL operations, such as connection attempts and queries
  • All employees can build visualizations, ad hoc analysis, and get insight
  • Parallel in-memory calculation engine for advanced calculations
  • Visualize data from file or datasource within one minute
  • Amazon AWS data, CSV/TSV/spreadsheet, Salesforce, PostgreSQL, SQL Server, Oracle, or MySQL
  • Auto infer data types and relationships to suggest visualizations
  • AutoGraph collection of algorithms learn best visualizations that match analytical patterns
  • Eliminate need for data engineers to spend months building data models
  • Storyboard tour through evolution of analysis for collaboration
Machine Learning
  • Developers of all skills can utilize machine learning
  • Visualization tools and wizards without learning complex algorithms
  • Same technology used by Amazon
  • Generate billions of predications daily
  • Built-in data processors and scalable machine learning algorithms
  • Use data from Amazon S3, Redshift, or RDS
  • Batch prediction of individual data records
  • Click prediction, personalized content, product catalogues, promotional marketing campaigns, and identifying which items a customer is most likely to purchase
  • A learning algorithm creates either a binary, multi-class, or regression model. Binary refers to solving problems like whether a customer will purchase a particular product; multi-class for what is a customer's favorite category of products; and regression for predicting the selling price of an item
Data Pipeline
  • Process and move data between different AWS compute/storage services
  • Access, transform, and process data and transfer results to other services
  • Drag-and-drop interface with built-in common preconditions
  • Library of pipeline templates
  • Serial or parallel dispatch to process millions of files

Copyright © 2016. Michael E. Byczek. All Rights Reserved.