predict sentiment score with LineairSVM in a integer/double value

Question

predict sentiment score with LineairSVM in a integer/double value

asked Jan 30, 2022 in Education by JackTerrance

I use a linear SVM to predict the sentiment of tweets. The LSVM classifies the tweets as neutral or positive. I use a Pipeline to (in order) clean, vectorize and classify the tweets. But when predicting the sentiment I'm only able to get a 0 (for neg) or 4 (neg). I want to get predicting scores between -1 and 1 in decimal digits to get a better scale/understanding of 'how' positive and negative the tweets are: the code: #read in influential twitter users on stock market twitter_users = pd.read_csv('core/infl_users.csv', encoding = "ISO-8859-1") twitter_users.columns = ['users'] df = pd.DataFrame() #MODEL TRAINING #read trainingset for model : csv to dataframe df = pd.read_csv("../trainingset.csv", encoding='latin-1') #label trainingsset dataframe columns frames = [df] for colnames in frames: colnames.columns = ["target","id","data","query","user","text"] #remove unnecessary columns df = df.drop("id",1) df = df.drop("data",1) df = df.drop("query",1) df = df.drop("user",1) pat1 = r'@[A-Za-z0-9_]+' # remove @ mentions fron tweets pat2 = r'https?://[^ ]+' # remove URL's from tweets combined_pat = r'|'.join((pat1, pat2)) #addition of pat1 and pat2 www_pat = r'www.[^ ]+' # remove URL's from tweets negations_dic = {"isn't":"is not", "aren't":"are not", "wasn't":"was not", "weren't":"were not", # converting words like isn't to is not "haven't":"have not","hasn't":"has not","hadn't":"had not","won't":"will not", "wouldn't":"would not", "don't":"do not", "doesn't":"does not","didn't":"did not", "can't":"can not","couldn't":"could not","shouldn't":"should not","mightn't":"might not", "mustn't":"must not"} neg_pattern = re.compile(r'\b(' + '|'.join(negations_dic.keys()) + r')\b') def tweet_cleaner(text): # define tweet_cleaner function to clean the tweets soup = BeautifulSoup(text, 'lxml') # call beautiful object souped = soup.get_text() # get only text from the tweets try: bom_removed = souped.decode("utf-8-sig").replace(u"\ufffd", "?") # remove utf-8-sig codeing except: bom_removed = souped stripped = re.sub(combined_pat, '', bom_removed) # calling combined_pat stripped = re.sub(www_pat, '', stripped) #remove URL's lower_case = stripped.lower() # converting all into lower case neg_handled = neg_pattern.sub(lambda x: negations_dic[x.group()], lower_case) # converting word's like isn't to is not letters_only = re.sub("[^a-zA-Z]", " ", neg_handled) # will replace # by space words = [x for x in tok.tokenize(letters_only) if len(x) > 1] # Word Punct Tokenize and only consider words whose length is greater than 1 return (" ".join(words)).strip() # join the words # Build a list of stopwords to use to filter stopwords = list(STOP_WORDS) # Use the punctuations of string module punctuations = string.punctuation # Creating a Spacy Parser parser = English() class predictors(TransformerMixin): def transform(self, X, **transform_params): return [clean_text(text) for text in X] def fit(self, X, y=None, **fit_params): return self def get_params(self, deep=True): return {} # Basic function to clean the text def clean_text(text): return text.strip().lower() def spacy_tokenizer(sentence): mytokens = parser(sentence) mytokens = [word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens] #mytokens = [word.lemma_.lower().strip() for word in mytokens] mytokens = [word for word in mytokens if word not in stopwords and word not in punctuations] #mytokens = preprocess2(mytokens) return mytokens # Vectorization # Convert a collection of text documents to a matrix of token counts # ngrams : extension of the unigram model by taking n words together # big advantage: it preserves context. -> words that appear together in the text will also appear together in a n-gram # n-grams can increase the accuracy in classifying pos & neg vectorizer = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1)) # Linear Support Vector Classification. # "Similar" to SVC with parameter kernel=’linear’ # more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples. # LinearSVC take as input two arrays: an array X of size [n_samples, n_features] holding the training samples, and an array y of class labels (strings or integers), size [n_samples]: classifier = LinearSVC(C=0.5) # Using Tfidf tfvectorizer = TfidfVectorizer(tokenizer=spacy_tokenizer) #put tweet-text in X and target in ylabels to train model X = df['text'] ylabels = df['target'] #T he next step is to load the data and split it into training and test datasets. In this example, # we will use 80% of the dataset to train the model.This 80% is then splitted again in 80-20. 80% tot train the model, 20% to test results. # the remaining 20% is kept to train the final model X_tr, X_kast, y_tr, y_kast = train_test_split(X, ylabels, test_size=0.2, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X_tr, y_tr, test_size=0.2, random_state=42) # Create the pipeline to clean, tokenize, vectorize, and classify # Tying together different pieces of the ML process is known as a pipeline. # Each stage of a pipeline is fed data processed from its preceding stage # Pipelines only transform the observed data (X). # Pipeline can be used to chain multiple estimators into one. # The pipeline object is in the form of (key, value) pairs. # Key is a string that has the name for a particular step # value is the name of the function or actual method. #Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator. pipe_tfid = Pipeline([("cleaner", predictors()), ('vectorizer', tfvectorizer),('classifier', classifier)]) # Fit our data, fit = training the model pipe_tfid.fit(X_train,y_train) # Predicting with a test dataset #sample_prediction1 = pipe_tfid.predict(X_test) accur = pipe_tfid.score(X_test,y_test) when predicting a sentiment score I do pipe_tfid.predict('textoftweet') Select the correct answer from above options

predict sentiment score with LineairSVM in a integer/double value

1 Answer

Related questions