Scikit-Learn
The Scikit-Learn flavor of Animal-speak.
Last updated
The Scikit-Learn flavor of Animal-speak.
Last updated
Borrowed mostly from here: and here:
a set of objects or individuals (in Stats-speak) each of which has multiple properties (Human), variables (Stats-speak) or features (Scikit-speak). Example: human census is data, the set consists of humans-individuals , each human has height, weight etc.etc. - the features which are also called "attributes" (because of the census, duh!). Notice that a human must have height and weight, (s)he can not not have them. The complete set of data is sometimes called population (as opposed to sample which is a sub-set of it).
a subset of data is called sample (see below). It is just what I've just said a subset picked out of the data according to a criterion (including a 'random' selection).
Human language: If you are a student of Newtonian dynamics or Mechanics or, God forbid, Electrodynamics - you need to wrap your head around it here. Everything is totally static, there is only a set of pieces on the chessboard and of course all the Animal-speak is created for the sole purpose of being able to joggle the super-expensive to acquire numbers and answers to questions as the 'researcher' wants in order to obtain the desired 'prediction' that will sell well. Sorry for being cynical but that's what it is and nothing more, - the Census data collection as any other 'field research' required in the past a lot of money. Now you can probably surveil each human in an industrialized country in real time and work with that 'data'. :)
a subset of data from the population/(all of) data selected by a predefined procedure (including a 'random' selection). The elements of a sample are known as "sample points", "sampling units" or "observations". Example: a subset of humans in the same age group taken from the census data (the population).
is the pure Scikit-speak, it means a number of individual "data points" in the data, which is a n_samples
* n_features
sized array with key (SQL-speak) data
.
the Scikit-speak for a variable of Stat-speak. It's a variable in a very broad sense of the word, because it can be an object representing a complex phenomenon (like a histogram of a noise), but it's a variable (or at maximum a 'representation of a variable') none the less. They also use the word attribute for the same meaning here and there, because... they want to. Seriously, there is no need in this meaningless (in this context) word "feature".
number of variables that every individual data points depends on, in Scikit -speak they interchangeably use the words features and attributes to describe the content of the line in the array of data representing an individual data point.
The word that is constantly used in all the functions related to regression. The synonyms of it are: "label", "class" and even "name of object" (in the sense "apple" or "orange"). It can actually be multidimensional, but typically is just 1D. The human meaning of it is: it's the thing you are trying to predict with the help of the "regression analysis". It may be one of the variables the data points depend on or it may be totally different, from outside the data table, then they call it an external_variable assigned to each of the individual data points separately and because of that called label.
the other dimension of 'output' matrix y.
loaders and fetchers functions return a dictionary-like object holding at least two items: an array of shape n_samples
* n_features
with key (a Python dictionary "key", not an SQL key) data
and a numpy array of length n_samples
, containing the target values, with key (a Python dictionary "key", not an SQL key) target
.
Notice that for some inexplicable reason (probably because "it just happened so") this is a tuple while the loaders and fetchers return a dictionary...
consists in learning the link between two datasets: the observed data X
and an external variable y
that we are trying to predict, usually called “target” or “labels”. Most often, y
is a 1D array of length n_samples
.
All supervised in scikit-learn implement a fit(X, y)
method to fit the model and a predict(X)
method that, given unlabeled observations X
, returns the predicted labels y.
can be used to generate controlled synthetic datasets.