in Education by
I use a linear SVM to predict the sentiment of tweets. The LSVM classifies the tweets as neutral or positive. I use a Pipeline to (in order) clean, vectorize and classify the tweets. But when predicting the sentiment I'm only able to get a 0 (for neg) or 4 (neg). I want to get predicting scores between -1 and 1 in decimal digits to get a better scale/understanding of 'how' positive and negative the tweets are: the code: #read in influential twitter users on stock market twitter_users = pd.read_csv('core/infl_users.csv', encoding = "ISO-8859-1") twitter_users.columns = ['users'] df = pd.DataFrame() #MODEL TRAINING #read trainingset for model : csv to dataframe df = pd.read_csv("../trainingset.csv", encoding='latin-1') #label trainingsset dataframe columns frames = [df] for colnames in frames: colnames.columns = ["target","id","data","query","user","text"] #remove unnecessary columns df = df.drop("id",1) df = df.drop("data",1) df = df.drop("query",1) df = df.drop("user",1) pat1 = r'@[A-Za-z0-9_]+' # remove @ mentions fron tweets pat2 = r'https?://[^ ]+' # remove URL's from tweets combined_pat = r'|'.join((pat1, pat2)) #addition of pat1 and pat2 www_pat = r'www.[^ ]+' # remove URL's from tweets negations_dic = {"isn't":"is not", "aren't":"are not", "wasn't":"was not", "weren't":"were not", # converting words like isn't to is not "haven't":"have not","hasn't":"has not","hadn't":"had not","won't":"will not", "wouldn't":"would not", "don't":"do not", "doesn't":"does not","didn't":"did not", "can't":"can not","couldn't":"could not","shouldn't":"should not","mightn't":"might not", "mustn't":"must not"} neg_pattern = re.compile(r'\b(' + '|'.join(negations_dic.keys()) + r')\b') def tweet_cleaner(text): # define tweet_cleaner function to clean the tweets soup = BeautifulSoup(text, 'lxml') # call beautiful object souped = soup.get_text() # get only text from the tweets try: bom_removed = souped.decode("utf-8-sig").replace(u"\ufffd", "?") # remove utf-8-sig codeing except: bom_removed = souped stripped = re.sub(combined_pat, '', bom_removed) # calling combined_pat stripped = re.sub(www_pat, '', stripped) #remove URL's lower_case = stripped.lower() # converting all into lower case neg_handled = neg_pattern.sub(lambda x: negations_dic[x.group()], lower_case) # converting word's like isn't to is not letters_only = re.sub("[^a-zA-Z]", " ", neg_handled) # will replace # by space words = [x for x in tok.tokenize(letters_only) if len(x) > 1] # Word Punct Tokenize and only consider words whose length is greater than 1 return (" ".join(words)).strip() # join the words # Build a list of stopwords to use to filter stopwords = list(STOP_WORDS) # Use the punctuations of string module punctuations = string.punctuation # Creating a Spacy Parser parser = English() class predictors(TransformerMixin): def transform(self, X, **transform_params): return [clean_text(text) for text in X] def fit(self, X, y=None, **fit_params): return self def get_params(self, deep=True): return {} # Basic function to clean the text def clean_text(text): return text.strip().lower() def spacy_tokenizer(sentence): mytokens = parser(sentence) mytokens = [word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens] #mytokens = [word.lemma_.lower().strip() for word in mytokens] mytokens = [word for word in mytokens if word not in stopwords and word not in punctuations] #mytokens = preprocess2(mytokens) return mytokens # Vectorization # Convert a collection of text documents to a matrix of token counts # ngrams : extension of the unigram model by taking n words together # big advantage: it preserves context. -> words that appear together in the text will also appear together in a n-gram # n-grams can increase the accuracy in classifying pos & neg vectorizer = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1)) # Linear Support Vector Classification. # "Similar" to SVC with parameter kernel=’linear’ # more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples. # LinearSVC take as input two arrays: an array X of size [n_samples, n_features] holding the training samples, and an array y of class labels (strings or integers), size [n_samples]: classifier = LinearSVC(C=0.5) # Using Tfidf tfvectorizer = TfidfVectorizer(tokenizer=spacy_tokenizer) #put tweet-text in X and target in ylabels to train model X = df['text'] ylabels = df['target'] #T he next step is to load the data and split it into training and test datasets. In this example, # we will use 80% of the dataset to train the model.This 80% is then splitted again in 80-20. 80% tot train the model, 20% to test results. # the remaining 20% is kept to train the final model X_tr, X_kast, y_tr, y_kast = train_test_split(X, ylabels, test_size=0.2, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X_tr, y_tr, test_size=0.2, random_state=42) # Create the pipeline to clean, tokenize, vectorize, and classify # Tying together different pieces of the ML process is known as a pipeline. # Each stage of a pipeline is fed data processed from its preceding stage # Pipelines only transform the observed data (X). # Pipeline can be used to chain multiple estimators into one. # The pipeline object is in the form of (key, value) pairs. # Key is a string that has the name for a particular step # value is the name of the function or actual method. #Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator. pipe_tfid = Pipeline([("cleaner", predictors()), ('vectorizer', tfvectorizer),('classifier', classifier)]) # Fit our data, fit = training the model pipe_tfid.fit(X_train,y_train) # Predicting with a test dataset #sample_prediction1 = pipe_tfid.predict(X_test) accur = pipe_tfid.score(X_test,y_test) when predicting a sentiment score I do pipe_tfid.predict('textoftweet') Select the correct answer from above options

1 Answer

0 votes
by
 
Best answer
During the training, SVM calculates the weights w such that the margin which separates the classes is maximum. The predictions are then made using the function (in case of binary classifier) Choose A1 if w^Tx + bias > 0 else Choose A2 Here A1 and A2 are predictions. SVM is not able to return the probabilities because it is not a probabilistic model. There are some probabilistic interpretations of SVM like this. But you can use some standard probabilistic models (like NaiveBayes, LogisticRegression, etc) if you want to know the confidence of the prediction.

Related questions

0 votes
    I'm trying to write a program that takes text(article) as input and outputs the polarity of this text, ... open-source implementation. Select the correct answer from above options...
asked Feb 4, 2022 in Education by JackTerrance
0 votes
    I'm working on some Artificial Intelligence project and I want to predict the bitcoin trend but while using the ... Thanks in advance! Select the correct answer from above options...
asked Feb 1, 2022 in Education by JackTerrance
0 votes
    I know the basics of feedforward neural networks, and how to train them using the backpropagation algorithm, but I'm ... , even better. Select the correct answer from above options...
asked Feb 8, 2022 in Education by JackTerrance
0 votes
    I'm looking for some examples of robot/AI programming using Lisp. Are there any good online examples available ... in nature)? Select the correct answer from above options...
asked Feb 4, 2022 in Education by JackTerrance
0 votes
    I'm teaching a kid programming, and am introducing some basic artificial intelligence concepts at the moment. To begin ... and boxes)? Select the correct answer from above options...
asked Feb 4, 2022 in Education by JackTerrance
0 votes
    Nominally a good problem to have, but I'm pretty sure it is because something funny is going on... As ... labeled data vectors/instances (transformed video frames of individuals--...
asked Feb 4, 2022 in Education by JackTerrance
0 votes
    I am learning programming (Python and algorithms) and was trying to work on a project that I find interesting. ... is impossible). Select the correct answer from above options...
asked Feb 2, 2022 in Education by JackTerrance
0 votes
    I'm trying to recover from a PCA done with scikit-learn, which features are selected as relevant. A classic example ... for your help. Select the correct answer from above options...
asked Feb 2, 2022 in Education by JackTerrance
0 votes
    I have a map stored as a multidimensional array ($map[row][col]) and I'd wish to create a path from ... path inside these values? Select the correct answer from above options...
asked Jan 30, 2022 in Education by JackTerrance
0 votes
    I have learned a Machine Learning course using Matlab as a prototyping tool. Since I got addicted to F#, I ... of resources? Thanks. Select the correct answer from above options...
asked Jan 30, 2022 in Education by JackTerrance
0 votes
    I have trouble understanding the difference (if there is one) between roc_auc_score() and auc() in scikit-learn. I ... out why. Thanks! Select the correct answer from above options...
asked Jan 29, 2022 in Education by JackTerrance
0 votes
    I'm creating a very basic AI with Tensorflow, and am using the code from the official docs/tutorial. Here's my ... Tensorflow 1.13.1. Select the correct answer from above options...
asked Jan 28, 2022 in Education by JackTerrance
0 votes
    A friend of mine told me that it is possible to even create games with PHP. Is that really possible? Can we ... Looking for your ideas. Select the correct answer from above options...
asked Jan 28, 2022 in Education by JackTerrance
0 votes
    I'm trying to read an image from electrocardiography and detect each one of the main waves in it (P wave, QRS ... some ideas? Thanks! Select the correct answer from above options...
asked Feb 8, 2022 in Education by JackTerrance
0 votes
    In the MNIST beginner tutorial, there is the statement accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float")) tf ... (x,1)? Select the correct answer from above options...
asked Feb 8, 2022 in Education by JackTerrance
...