This first Machine Learning tutorial will cover the detailed and complete data pre-processing process in building Machine Learning models.
We’ll embrace pre-processing in data transformation, selection, dimensionality reduction, and sampling for machine learning throughout this tutorial. In another opportunity, we will apply this process with various algorithms to help you understand what it is and how to use Machine Learning with Python language.
First of all, we need to define the business problem. After all…
The purpose of this tutorial is that we can build graphs to assist in the application of the data science process. We can employ visualizations during exploratory analysis, before or after processing data, construct statistical graphs to analyze datasets, identify variable relationships, or verify how data is distributed.
We can do all this with Matplotlib; however, we have a library that is much better and much easier when we refer to statistical graphs — Seaborn. Therefore, knowing how to create a visualization, regardless of its tool, is of fundamental importance.
Visit Jupyter Notebook to see all the concepts that we…
No último tutorial de operações úteis em Pandas, trabalhamos sempre com um único objeto, uma Series, DataFrame ou Array em NumPy — sempre um objeto. E se precisarmos trabalhar com mais de um objeto?
Acesse o Jupyter Notebook para consultar os conceitos que serão abordados sobre SQL Join em Pandas. Obs: as funções, outputs e termos importantes estão em negrito para facilitar a compreensão — pelo menos a minha.
Primeiro passo é importarmos o Pandas para podermos usar os pacotes, métodos e atributos, e o pacote NumPy para criarmos os Arrays:
import pandas as pd
import numpy as np
In this tutorial we’ll explore the rental dataset, perform transformations and reorganize the data as if we were actually preparing the data for modeling and creating models.
It is common that we receive the data to solve any problem and need to analyze and explore data, seek relationships, seek how variables are organized, have or not the need to transform…
In the last tutorial of valuable operations in Pandas, we worked with a single object, a Series, DataFrame, or Array in NumPy — always an object. What if we need to work with more than one object?
The first step is to import Pandas so we can use the packages, methods, and attributes, and the NumPy package to create the Arrays:
import pandas as pd
import numpy as np
Veremos o processo de construção de modelos de Machine Learning de forma fácil! Basicamente, temos um conjunto de atividades que sempre serão realizadas durante a construção de um modelo preditivo. Cada uma dessas quatro etapas envolvem uma série de atividades técnicas, procedimentos matemáticos e estatísticos, programação e conhecimento de negócio.
É importante conhecer esse processo porque cada etapa requer diferentes ferramentas, diferentes técnicas e diferentes procedimentos — nada mais que o trabalho rotineiro de um Cientista de Dados.
Fundamentalmente temos quatro etapas principais no processo de construção de modelos de aprendizagem de máquina:
Here we’ll see how the combination of Python and Spark — PySpark works. We will run an application, and then we will do a MapReduce operation, addressing various concepts about Spark Core — apache spark’s main engine.
First, let’s boot PySpark through the terminal. We open the command prompt, navigate to the directory where the files are, and type:
PySpark will open in the default browser. If you want to change your browser, copy the URL address and paste it into another browser.
We have the Notebook that we’ll cover an intro to PySpark. The first important note will…
When Hadoop was released, it supplied two Big Data needs: distributed storage and distributed processing. However, this alone is not enough to work with Big Data; other tools are required, i.e., other functionalities to meet different business, application, and architecture needs.
The entire Hadoop Ecosystem is to work with billions of records. That’s what we have at our disposal the whole Apache Hadoop Ecosystem.
Over the years, other software has begun to appear to run seamlessly together with Hadoop and its Ecosystem. …
In this article, we will cover what the Business Analytics process is, the flow of activities to be done in a specific order to achieve our goal.
While traditional statistical applications focus on relatively small data sets, Data Science involves vast amounts of data, typically what we call Big Data. When we talk about Data Science, we are generally talking about applying analysis techniques to large data sets that require an application from Computer Science to the Storage and Processing of large datasets.
If we’re going to work with a petabyte of data, it will require a different infrastructure than…
One of the main advantages of Apache Spark is to apply MapReduce to large datasets. Hadoop has popularized MapReduce operations. However, these operations can be executed with Spark up to 100x faster if we run in memory. Now we’ll look at the differences between Hadoop MapReduce and Apache Spark.
Hadoop MapReduce and Apache Spark are the two most popular frameworks for cluster computing and large-scale data analysis (Big Data). These two frameworks hide the complexity of data processing concerning the parallelism between tasks and fault tolerance by exposing a simple SOFTWARE API with information to users. We have no way…
Composing a repository of books (i bought), authors (i follow) & blogs (direct ones) for my own understanding.