Data Science Dictionary
Data Science Dictionary
Algorithm
Definition: A defined set of instructions or steps used to perform calculations or solve problems in
computing. In data science, algorithms process data to uncover patterns or make predictions.
Example: The team designed an algorithm to identify fraudulent transactions in real-time.
Translation: Algoritmo.
Big Data
Definition: Extremely large datasets that cannot be processed using traditional data-processing
techniques. Big Data is often analyzed to reveal patterns, trends, and associations.
Example: The retail company used big data analytics to understand shopping habits across multiple
regions.
Translation: Big Data (grandes volumes de dados).
Bias
Definition: A systematic error introduced into data analysis or machine learning models, leading to
skewed results or conclusions. Bias can arise from data collection methods or model design.
Example: The model's predictions were inaccurate due to bias in the training data, which
overrepresented one demographic group.
Translation: Viés.
Clustering
Definition: A type of unsupervised machine learning where the goal is to group data points into
clusters based on their similarities. It helps identify natural patterns in the data.
Example: Clustering was used to segment customers based on purchasing behavior, enabling more
targeted marketing strategies.
Translation: Agrupamento.
Correlation
Definition: A statistical measure that describes the strength and direction of a relationship between
two variables. It indicates how one variable changes as the other does.
Example: A positive correlation was found between daily screen time and reduced sleep hours
among teenagers.
Translation: Correlação.
D
Data Cleaning
Definition: The process of identifying and rectifying errors or inconsistencies in a dataset to ensure
its accuracy and completeness. This is a critical step before analysis can take place.
Example: Data cleaning removed duplicate entries and filled in missing values, making the dataset
ready for analysis.
Translation: Limpeza de dados.
Data Visualization
Definition: The graphical representation of data, typically using charts, graphs, and plots, to make
complex data more accessible and understandable.
Example: The data scientist created an interactive dashboard that visualized trends in sales over the
last quarter.
Translation: Visualização de dados.
Feature Engineering
Definition: The process of transforming raw data into meaningful features that can improve the
performance of machine learning algorithms. It often involves techniques like scaling, encoding, and
creating new variables.
Example: Feature engineering involved creating interaction terms between product price and
customer age to enhance the model’s predictive power.
Translation: Engenharia de características.
Forecasting
Definition: The practice of predicting future values based on historical data using statistical models
and machine learning techniques.
Example: Time series forecasting was used to predict next quarter’s sales based on historical trends.
Translation: Previsão.
Gradient Descent
Definition: An optimization algorithm used in machine learning to minimize a loss function by
iteratively adjusting the model's parameters. It is commonly used in training deep learning models.
Example: The model’s performance improved as gradient descent minimized the error between
predicted and actual values.
Translation: Descenso de gradiente.
Graph
Definition: A data structure made up of nodes (vertices) and edges (connections) that represents
relationships among entities. It is widely used to model networks such as social media or
transportation systems.
Example: The social network analysis used graphs to find the most influential nodes within a
community.
Translation: Grafo.
Hypothesis Testing
Definition: A statistical method used to evaluate assumptions about a population based on sample
data. It involves testing a null hypothesis against an alternative hypothesis.
Example: Hypothesis testing showed that the new marketing campaign significantly increased
customer engagement.
Translation: Teste de hipótese.
Hyperparameter
Definition: Parameters set before the learning process begins that influence the training process,
such as learning rate, number of trees, or regularization strength.
Example: The model's accuracy improved after tuning the hyperparameters, including the learning
rate and batch size.
Translation: Hiperparâmetro.
Imputation
Definition: The process of replacing missing values in a dataset with substituted values, often using
statistical techniques such as mean, median, or predictive models.
Example: Imputation helped fill in missing data for age and income, which was crucial for the
analysis.
Translation: Imputação.
Inferential Statistics
Definition: A branch of statistics that allows for making predictions or generalizations about a
population based on a sample. It involves techniques like hypothesis testing and confidence intervals.
Example: Inferential statistics showed that the average annual salary in the region was significantly
higher than the national average.
Translation: Estatística inferencial.
K-Means
Definition: A popular unsupervised machine learning algorithm used to partition data into k distinct
clusters based on feature similarity.
Example: K-Means clustering divided the customer base into five distinct segments, each with
unique characteristics.
Translation: K-Means.
Kernel
Definition: A mathematical function used in machine learning, particularly in Support Vector
Machines (SVMs), to map data into a higher-dimensional space to make it easier to separate.
Example: The use of the Gaussian kernel allowed the SVM model to successfully classify non-linear
data.
Translation: Kernel.
Linear Regression
Definition: A statistical method for modeling the relationship between a dependent variable and one
or more independent variables by fitting a linear equation to observed data.
Example: Linear regression was used to predict house prices based on size, location, and amenities.
Translation: Regressão linear.
Logistic Regression
Definition: A statistical model used to predict a binary outcome (e.g., yes/no, success/failure) based
on one or more independent variables.
Example: Logistic regression was employed to classify emails as either spam or non-spam.
Translation: Regressão logística.
Neural Network
Definition: A computational model inspired by the structure of the human brain, consisting of
interconnected nodes (neurons) that process data for tasks like classification and regression.
Example: The deep neural network achieved high accuracy in recognizing handwritten digits.
Translation: Rede neural.
Normalization
Definition: The process of adjusting the range of numerical data so that it fits within a specific scale,
often to improve the performance of machine learning algorithms. Normalization typically rescales
data into a range between 0 and 1, or -1 and 1.
Example: Before training the model, the data was normalized to ensure that the features with larger
numerical ranges did not dominate the results.
Translation: Normalização.